This is a step-by-step tutorial for getting started with R, a powerful programming language for data analysis and visualization. It is aimed at near complete beginners. You'll basically want to be comfortable with spreadsheets and with using your computer's command line.
I slapped this together quickly, so expect some weirdness. Feel free to email me with comments or questions at jpvelez | at | gmail.com
I learned the following stuff using the UCLA Statistic's Department great R tutorials, so check those out: https://www.ats.ucla.edu/stat/r/learning_modules.htm
First, we need to get some data to analyze. We'll be using a dataset of NYC school sat scores (nyc_schools_sat_scores_clean.csv), which is attach to this gist. Download the data your computer and pop it into Excel to examine it.
Every row in this dataset represents an NYC school that had MORE THAN FIVE students take the SAT exam in 2010. The columns include a school's "DBN" number (a unique id for every school), its name, the number of students who took the SAT that year, and the mean reading, math, and writing scores of those students. Think of each row as a school, and each column as that schools attributes.
NOTE: the attached csv file is a cleaned version of the raw data available on the NYC data portal: https://nycopendata.socrata.com/
A number of schools in the original data had 's' values instead of numbers in the SAT reading, math, and writing score columns. According to the data portal, these schools had fewer than 5 students take the SAT, so those scores have been suppressed (hence the 's') to protect their anonymity. To simplify things, I've gone ahead and removed those rows from the dataset. If you don't use the clean data, the following tutorial won't work.
Ok, use the command line to navigate to the directory where you saved the data, and type "r" to fire up R. You can also use the official R console, but then you'll need to explicitly set your working directory to the directory where the sat data lives. I won't cover that. Google it. Don't be lazy.
Now, we need to read our data into R so we can do stuff with it. For what, we use the read.csv() function:
sat_scores = read.csv('nyc_school_sat_scores_clean.csv')
Here the read.csv
function reads in the csv, which needs to
be located in the working directory, and returns your data in a dataframe object that is then saved to a variable
named sat_scores
What is a dataframe object, you ask? In technicalese, it's a data structure that makes it easy to store and access tabular data with named columns. Think of it as a spreadsheet or table you can do stuff with.
To see the contents of your dataframe, just type it in:
sat_scores
names(sat_scores)
Throw your dataframe into this names function to see what columns of data are in there. this is the same thing as column names in the first row of a spreadsheet.
The other big type of object in R is a vector. A vector is basically just a list. It could be a list of text, or of numbers, but it's usually numbers.
From here on out, type in the code first, try to understand what it does, and then read the description.
vector = c(1, 2, 3, 4)
this is how you make a new vector and save it to a variable named vector
. if you don't save dataframes or vectors
to variables, you can't use them later.
vector
this is how you inspect the contents of your new vector.
Now we're going to slice and dice the data in our dataframe. This is called 'subsetting' data.
sat_scores$reading
this is the easiest way of accessing all the data in on of your columns. the style is dataframe$column
. you punch this
is in, and the computer will return a vector of all the values in that column, in this case, all of the mean sat
reading scores for nyc schools (that had more than 5 students take the sat in 2010.)
reading_scores = sat_scores$reading
you can save the data in the reading column, i.e. the vector of mean reading scores, by saving it to a new variable just like above
sat_scores[, 2]
you can also select columns using brackets like this. actually, these brackets let you select both columns and rows.
the first 'slot' in the brackets, before the comma, lets you specify what rows you want. we want all of them, so
leave that blank. the second slow lets you specify which columns you want. so this code will get you a vector of
school_names
, because school_names are in column 2. if you don't remember a column's number (or name), use the names()
function.
sat_scores[, c(2, 3, 4)]
you can select multiple columns. the way you do this is by putting a vector of the column numbers you want in that second column slot.
sat_scores[, 'school_names']
you can also specify what columns you want by using their names. the names must have quotes around them, because technically these are 'strings', or text objects. if you don't put quotes around it, R thinks you're talking about a variable. if you haven't used that variable anywhere, it'll get pissy and throw an error at you.
sat_scores[, c('school_names', 'reading')]
you can also specify multiple columns using their names by putting them in a vector, just like we did for the column numbers. this code will return a new, two-column dataframe of school_names and reading scores. every school will be in this new dataframe, because you left the first 'slot' in the brackets blank.
Let's try to filter out rows now.
sat_scores[sat_scores$math > 350, ]
This code will return a new dataframe that will contain only those rows
(i.e. those schools) which had math scores ABOVE 350. this new dataframe will have all the columns - school name,
number of testers, reading scores, etc - because you left the second slot intact, but it will only include schools
that scored above 350 in math. in other words, this command says 'get me every school that had math scores greater
than 350." For some silly reason, you can't just write [math > 350,]
because R doesn't know which dataframe that R
column belongs to. maybe you have several dataframes with columns named 'math'. so you need to specify which column
you're talking about by writing [sat_scores$math > 350,]
the same syntax you used to access the reading scores
above.
Now let's subset on both rows and columns.
sat_scores[sat_scores$math > 350, c(2, 4, 5)]
This codes says "get me every row where math score > 350, but only show me the data in columns 2, 4, and 5.' in other words, take our sat_scores table and spit out a new table that only shows the school name, math, and reading scores of schools that had mean math scores above 350. Got it?
sat = sat_scores[sat_scores$math != 's' ,]
This shows you another way you can subset rows. This code says "return every row that DOES NOT have an 's' value in it's math column." != stands DOES NOT EQUAL, while == stands for EQUAL. If you want to start with the raw, not-cleaned data from the NYC data portal, you could use this code to remove schools that have suppressed scores i.e. 's' strings in many of their columns.
Alright, so now you can turn filter tables and access the data in their columns. That's nice, but a big vector (list) of numbers isn't very helpful. It doesn't give us insight. We need a way to summarize some of the data.
summary(sat_scores)
The summary function does just. For vectors that contain numbers, it prints summary statistics like
what is the smallest, largest, and mean number in the list. For vectors that contain text, like school_names, it counts
how many times each unique text string occurs in the vector. you can also use this function on individual vectors -
summary(sat_scores$reading)
or summary(reading_scores)
- not just entire dataframes.
It's time for a little data viz. R makes it stupid simple to generate charts. Let's start with a histogram, which is great way to visualize the distribution of the data in a single column.
hist(sat$math)
The hist function takes in a vector and returns a histogram chart. it won't work if you feed it an entire dataframe -
hist(sat_scores)
- you need to specify a column. remember, whenever you type data_frame$column
, the computer returns a
vector of all the values in the column, which then get fed into the hist
function.
hist(sat$reading)
So this will show you the distribution of mean reading scores across nyc schools. notice that a lot of them cluster between 350 - 450. This is a low score, and consistent with the median values we got from the summary function. Takeaway: NYC schools aren't doing very well.
hist(sat$writing)
You can do this for writing and math as well.
Now let's make a scatterplot. These let us see two variables at once, and examine wether there's a relationship between the two.
plot(math ~ reading, sat)
This plot
function looks similar to hist, but it's a little peculiar. first, it has two arguments or 'slots'.
instead of specifying columns with the dataframe$column
syntax as you've been doing, the first argument tells R
which columns to plot, and the second argument tells R which dataframe these columns belong to. Also there's a weird
~ in the first argument. Basically the (math ~ writing, .. ) code says I want a scatterplot with math scores on the
y axis and writing scores on the x axis: (y ~ x, ..) I think of it as "math mashed up with writing."
OK, so we have a scatterplot! Two observations: most schools cluster between 350-450 on BOTH their math and reading scores, which is consistent with our histograms and summaries. 2. schools that have higher math scores tend to have higher reading scores. they tend to move together, that's why you see the dots moving up and to the right. that means there's an association between math and reading scores. cool! if we didn't see that pattern, if dots where all over the place, then there would be no association.
library(lattice)
Lattice is a library that has functions that make fancier graphs than the ones that come built-in to R. use this function to load it into R so you can use some of them.
xyplot(math ~ reading, sat)
xyplot
is lattice's equivalent to the plot()
function. it works the same way, but gives
you pretty colors.
So we've got some charts. That's great. Before the end, I will tease you with a tiny bit of stats.
fit = lm(math ~ reading, sat)
This lm()
function runs a linear regression on the data we visualized with a scatterplot. Very crudely, it tries
to measure to what extent there's a linear relationship between math and reading scores. A linear relationship means
"as math goes up, so does reading." The scatterplot suggested that schools with higher math scores tend to have higher
reading scores, this is a rigorous way of capturing that relationship.
abline(fit)
This function will take the linear regression object generate it above, and add a 'line of best fit' to our scatterplot.
Also, based on Juan's work, I cobbled together a version of this demo that uses clojure's incanter package to accomplish the same work. Incanter is a port of R's statistics and charting functions to clojure.
https://gist.github.com/3292129