Below is a running list of issues that BEGINNING students that are familiar with spreadsheet-based programs, such as Microsoft Excel have when first learning R/RStudio.

“Where” is the data?

When you use a command such as dat<- read.csv("foo.csv") R reads your dataset into memory on your computer. Any subsequent functions or operations you perform on dat in R will not effect your original file foo.csv, a comma separated file stored somewhere on your computer. dat is now an R object. Any new objects that you create in R from dat, for example datsub<- dat[,c(1,3,5)], a new dataframe comprised of the 1st, 3rd, and 5th columns of dat, are also stored in your computer’s memory, but NOT written to a file. In RStudio, this object will appear as an object in your environment, but once you close out of RStudio, this particular object will not be accessible.

Example R/RStudio Workflow

When you introduce loading in data, give students an overview of what their workflow may eventually look like:

  1. Read in raw data into a dataframe in R (eg: dat<-read.csv("raw.csv"))
  2. Clean, join, massage your data, recording all steps in the .R file for reproducicibility
  3. Export your cleaned dataframe datclean to a .csv file so you can access it later for analysis without having to repeat all your cleaning steps again (eg: setwd("~home/cleandatdir"); write.csv(datclean, "YYYY-MM-DD-cleaneddata.csv"))
  4. Use your cleaned data (read it in) to carryout different analyses. Keep a record of all the analyses you do in a separate .R file.

Example File Structure

To prevent confusion about where your data came from, what changes you made to it, and when, keep raw datasets, your scripts, and processed dataset in separate directories. In your R code, include the setwd() commands so that you know which datasets were accessed and loaded in for what steps of cleaning and analysis.

Below is an example file structure that works well for me (though mine is a bit more complicated than this):

00_DATA
   |—–0_RAW
      rawpropvaldat.csv
      rawbrtid.csv
   |—–1_WORKING
      2015-12-18-propval-clean.csv
   |—–2_FINAL
      2016-02-05-propval-clean.csv
02_CODE
   propvalcleaning.R
   brtcleaning.R
   propvalanalysis.R
   brtidanalysis.R
03_OUTPUTS
   propvaltable.csv
   propvalhist.png
   probvallinearmodtab.csv

 

What is the difference between a dataframe and a matrix?

Students should think of the R object dataframe like an Excel spreadsheet. Observations are stored in rows and the attributes of observations are stored in the columns. Like Excel, columns can be of DIFFERENT data “types” (factor, character, numeric, etc). A matrix also looks like an Excel table, but only allows columns of the SAME data type. Try demonstrating what can happen when you convert a column that is of the factor data type to numeric or character. Be esepcially careful with dates!

What does the data look like?

I have found that beginning R users frequently try to ‘open’ spreadsheet-like views of their dataframes in the RStudio GUI to ‘see’ what is going on. They want to understand the ‘shape’ of their data, if columns were succssfully created, etc. Early on, teaching R commands such as head, tail, summary, dim, length, and table and hist really help students to start to understand their datasets in a different way. Students should get into the habit of using these commands even with small datasets so that when they work larger datasets, they will no longer feel the urge to ‘scroll through’ the dataframe in order to understand it. dim can be combined with subsetting and logical commands (eg: dim(dat[which (dat$age > 18),])[1]) to get students to “ask” preliminary questions about their datasets. Walk the students what the code says in English (eg: how many rows of observations are there for which the observation’s ag is over 18? Or, how many observations are over 18?)

Wanting to see cell formulas

Unlike Excel, which for each observation, stores both data and potentially any functions used to create new values, R “loads in” tabular data from tabular sources and can perform functions on its columns, but will NOT store and record of the functions in the columns, only the output values. This is why it is important to keep a record of all the operations you perform on the raw dataset in a .R file (more on this later)

Fear of making changes to/overwriting original data, fear of iteration

When using a graphical user interface (GUI) to make plots and graphs, each click of the mouse to format something is a “step” one has to go through to get the desired effect. The upside to this is that the GUI often organizes the parameters to choose from very well. The downside is your process is not very repeatable — to redo a plot, for example, you must go through an re-select all of the settings you want for your plot – adjust the margins, axes, colors, etc.

With R, convey to students that they should view the processes of plotting as iterative, with all the “good” iterations they make being saved in their .R code. Everything is completely reproducible in seconds just by re-running the good lines of code– things that look good stay in, things that don’t work can simply be taken out and the code re-run.

R and Excel are also very different in that R loads in data and your R code potentially makes changes to, but those changes do not overwrite the original dataset while you are working in R. If students are working with particularly large dataset, where cleaning steps may have been lengthy, then have them export clean data to a csv file, which can then be re-loaded in to do analysis on. Introduce the concept of keeping shortened vectors, matrices or dataframes called tmp and test to test functions, or do some preliminary analysis on.

Stress to students that the commands they do in R will not disturb their original datasets. Similiarly, any changes they make to the data in the RStudio session will not be ‘saved’ in the Excel-sense, even if they save changes to their .R files.

Which brings me to the next point…

Understanding the relationship between the .R file and the R Console

When you ‘save’ an Excel file, you save all the changes you have made within the .xlsx file- including the data, functions that you have used, new columns you created, graphs and tables, etc. In RStudio, when you exit you may be asked if you would like to save a ‘worksapce’ in the GUI, which may preserve the datasets you are working with and the variables you have defined, but encourage students that this is NOT a great way to store data. The next time you open up RStudio you may be working on a different dataset, you may not remember what you were doing when those variables were created.

Students should think of the .R file as a record of all the things that they did to get their data into the form they want for analysis (data cleaning). All cleaning steps should be in one file that has comments embedded within it. Among the last steps in the ‘cleaning code’ should be the commands setwd(~/cleandatadir) and write.csv(dat,"YYYY-MM-DD-cleaneddata.csv"). Once they start running different models, suggest that they have a separate code that loads in the cleaned data file dat<-read.csv("YYYY-MM-DD-cleaneddata.csv").

Leave a Reply

Your email address will not be published. Required fields are marked *