Below is a running list of issues that BEGINNING students that are familiar with spreadsheet-based programs, such as Microsoft Excel have when first learning R/RStudio.
“Where” is the data?
When you use a command such as
dat<- read.csv("foo.csv") R reads your dataset into memory on your computer. Any subsequent functions or operations you perform on
dat in R will not effect your original file
foo.csv, a comma separated file stored somewhere on your computer.
dat is now an R object. Any new objects that you create in R from
dat, for example
datsub<- dat[,c(1,3,5)], a new dataframe comprised of the 1st, 3rd, and 5th columns of dat, are also stored in your computer’s memory, but NOT written to a file. In RStudio, this object will appear as an object in your environment, but once you close out of RStudio, this particular object will not be accessible.
Example R/RStudio Workflow
When you introduce loading in data, give students an overview of what their workflow may eventually look like:
- Read in raw data into a dataframe in R (eg:
- Clean, join, massage your data, recording all steps in the .R file for reproducicibility
- Export your cleaned dataframe
datcleanto a .csv file so you can access it later for analysis without having to repeat all your cleaning steps again (eg:
setwd("~home/cleandatdir"); write.csv(datclean, "YYYY-MM-DD-cleaneddata.csv"))
- Use your cleaned data (read it in) to carryout different analyses. Keep a record of all the analyses you do in a separate .R file.
Example File Structure
To prevent confusion about where your data came from, what changes you made to it, and when, keep raw datasets, your scripts, and processed dataset in separate directories. In your R code, include the
setwd() commands so that you know which datasets were accessed and loaded in for what steps of cleaning and analysis.
Below is an example file structure that works well for me (though mine is a bit more complicated than this):