Last active
December 16, 2015 16:08
-
-
Save alexstorer/5460479 to your computer and use it in GitHub Desktop.
Introduction to R - Data Scientist Training for Librarians
http://rpubs.com/alexplanation/dst4l
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Introduction to R | |
| ======================================================== | |
| This file is online! | |
| http://rpubs.com/alexplanation/dst4l | |
| https://gist.github.com/alexstorer/5460479 | |
| ### **R** is an interpreted language like **Python** | |
| ```{r} | |
| "hello world" | |
| ``` | |
| If you use NumPy/SciPy, here is a conversion to R: | |
| http://mathesaurus.sourceforge.net/matlab-python-xref.pdf | |
| ### R is better than Python | |
| #### The language is designed to make statistics and data handling easy | |
| ```{r} | |
| mtcars["Honda Civic","mpg"] | |
| lm(mpg~cyl+gear+wt,data=mtcars) | |
| ``` | |
| #### Math, Statistics and Matrices work out of the box | |
| ```{r} | |
| 2+2 | |
| log(2^(0.25)) | |
| ``` | |
| #### Install add ons from inside R | |
| ```{r} | |
| # install.packages('ggplot2') | |
| library(ggplot2) | |
| ``` | |
| #### Superb graphing tools | |
| ```{r fig.width=7, fig.height=6} | |
| ggplot(data=mtcars,aes(x=mpg,y=wt,color=cyl))+geom_point() | |
| ``` | |
| #### Incredible IDE: __RStudio__ | |
|  | |
| You can even run RStudio on the internet, and set up your own FREE Amazon instance to run it: | |
| http://www.louisaslett.com/RStudio_AMI/ | |
| #### CRAN Task Views (http://cran.r-project.org/web/views/) summarize 3rd party packages that do many types of anaylsis | |
| ### Python is better than R | |
| #### Installed by default | |
| #### Syntax is much easier to learn and use | |
| #### Powerful and intuitive data structures | |
| #### Outstanding text processing | |
| #### RPy2 lets you use R from Python | |
| ### In practice, I use both! | |
| ************ | |
| Today's Goal: Getting Started With R | |
| ------------------------------------ | |
| ### Today's Strategy | |
| Jump in with both feet, worry about the details later! More time spent _doing_. | |
| ### Let's get data into R | |
| ```{r} | |
| df <- read.csv('http://files.figshare.com/983501/nsfastgrantbib20121130.csv') | |
| ``` | |
| * This is the sort of thing designed to _just work_ in R | |
| * Unlike any other programming language, the ```<-``` symbol does assignment | |
| * Our data is now in a _data frame_, the best data structure in R | |
| * Data Frames are like a matrix for data, but the rows and columns have names | |
| * Different data types can be in the same row (same observation) | |
| * We pull out data using the dollar sign ```$``` | |
| * We can also treat our data frame just like a matrix, and use square brackets (```[]```) | |
| * R has a special data type for missing data, called ```NA``` | |
| ### Data Frame examples | |
| ```{r} | |
| names(df) | |
| df$PaperTitle[100] | |
| df[100,"PaperTitle"] | |
| head(df$PaperTitle) | |
| ``` | |
| ### Returning subsets of data frames | |
| ```{r} | |
| df[5:10,c("PaperTitle","GrantTitle")] | |
| ``` | |
| The ```c``` command, for combine, combines variables into arrays. | |
| The ```:``` operator returns a vector of values | |
| ```{r} | |
| df[seq(5,10,2),c("GrantStartYear","GrantStartMonth")] | |
| ``` | |
| The ```seq``` command is a more flexible version of ```:``` | |
| ### A note on this data | |
| * Each row is a Grant-Paper combination | |
| * One paper can be on multiple rows | |
| * One grant can be on multiple rows | |
| * Was probably merged from two separated tables | |
| ### How many papers have more than 50 citations? | |
| ```{r} | |
| head(df$CitationCount>50) | |
| table(df$CitationCount>50) | |
| head(which(df$CitationCount>50)) | |
| paste("There are", length(which(df$CitationCount>50)), "papers") | |
| ``` | |
| * ```head``` returns the first few elements of what you give it | |
| * ```table``` summarizes a list by its unique values | |
| * ```which``` gives the ```TRUE``` indices in a logical array | |
| * ```length``` gives the length of any array | |
| * ```paste``` concatenates strings | |
| ### But these are not unique! | |
| ```{r} | |
| numPapers <- length(unique(df[df$CitationCount>50,"PaperTitle"])) | |
| ``` | |
| * ```[df$CitationCount>50,"PaperTitle"]``` - Just the titles | |
| * ```unique``` removes any repeats | |
| * ```length``` tells us how many there are | |
| ### How many different grants are in our data set? | |
| ```{r} | |
| # Try this one yourself! | |
| ``` | |
| ### The duplicated function | |
| Like ```unique``` but gives us the indices (logical or otherwise). | |
| ```{r} | |
| table(duplicated(df$GrantNumber)) | |
| ``` | |
| ### What is the mean of the money awarded for papers with at least 50 Citations? Less than 10 citations? | |
| ```{r} | |
| popular.papers <- subset(df,CitationCount>50) | |
| less.popular.papers <- subset(df,CitationCount<10) | |
| mpp <- mean(popular.papers$AwardedUSD) | |
| lpp <- mean(less.popular.papers$AwardedUSD) | |
| paste(">50:",mpp,"<10",lpp) | |
| ``` | |
| * ```subset``` returns a data frame with only the matching elements | |
| * ```mean``` takes the mean of your data | |
| ### Make a table of award amounts by presidential administration | |
| ```{r} | |
| df$AwardAmountGroups <- cut(df$AwardedUSD,3,labels=c('low','med','hi')) | |
| df$Administration <- cut(df$GrantStartYear,breaks=c(1992,2000,2008,2013),labels=c('Clinton','Bush','Obama')) | |
| with(df,table(Administration,AwardAmountGroups)) | |
| ``` | |
| * ```$``` and ```<-``` can be used to assign a new column to a data frame | |
| * ```cut``` splits your data into groupings by value, either into N groups, or as directed | |
| ### Exercises | |
| * Do grants start more frequently at a given time of year? | |
| * What proportion of papers are referreed? | |
| * What state received the most grants? | |
| * Find the most cited paper. How many other papers were published on that grant? | |
| * How many papers refer to at least three grants? | |
| ### Solutions | |
| * Grant Start Month | |
| ```{r} | |
| df.grants <- df[!duplicated(df$GrantNumber),] | |
| table(df.grants$GrantStartMonth) | |
| plot(table(df.grants$GrantStartMonth)) | |
| ``` | |
| * Refereed Papers | |
| ```{r} | |
| df.papers <- df[!duplicated(df$PaperTitle),] | |
| table(df.papers$Refereed) | |
| ``` | |
| * States with grants | |
| ```{r} | |
| df.grants <- df[!duplicated(df$GrantNumber),] | |
| sort(table(df.grants$GranteeOrganizationState)) | |
| ``` | |
| * Most cited | |
| ```{r} | |
| max.citations <- max(df$CitationCount) | |
| max.citations <- max(df$CitationCount,na.rm=T) | |
| max.index <- which(df$CitationCount==max.citations) | |
| max.grants <- df[max.index,"GrantNumber"] | |
| # Note: this paper was funded by two grants! | |
| max.subset <- subset(df,GrantNumber==max.grants[1] | GrantNumber==max.grants[2]) | |
| length(unique(max.subset$PaperTitle)) | |
| # Here is a more elegant way: | |
| max.grants <- df[which.max(df$CitationCount),"GrantNumber"] | |
| max.subset <- subset(df,GrantNumber %in% max.grants) | |
| length(unique(max.subset$PaperTitle)) | |
| ``` | |
| * Three grants or more | |
| ```{r} | |
| papers <- aggregate(df$GrantNumber,list(df$PaperTitle),length) | |
| length(which(papers$x>3)) | |
| ``` | |
| #### Our goal! | |
|  | |
| ### What do we need in our data frame? | |
| **x-axis** - Year of Paper | |
| **y-axis** - Years Since Grant | |
| **color** - Number of Papers | |
| ```{r} | |
| levels(df$PaperPubDate) | |
| ``` | |
| #### We have to use regular expressions here | |
| ```{r} | |
| yearstr <- paste(df$PaperPubDate) | |
| unique(yearstr) | |
| yearstr <- gsub('.*/','',yearstr,perl=TRUE) | |
| unique(yearstr) | |
| yearstr <- gsub('\\D*','',yearstr,perl=TRUE) | |
| unique(yearstr) | |
| changeinds <- which(as.numeric(yearstr)<14) | |
| changevals <- as.numeric(yearstr[changeinds])+2000 | |
| yearstr[changeinds] <- as.character(changevals) | |
| unique(yearstr) | |
| changeinds <- which(unlist(lapply(yearstr,nchar))==2) | |
| changevals <- paste("19",yearstr[changeinds],sep="") | |
| yearstr[changeinds] <- changevals | |
| df$PaperPubDateNumeric <- as.numeric(yearstr) | |
| ``` | |
| #### Now make a new variable for years since grant | |
| ```{r} | |
| df$YearsSinceGrant <- df$PaperPubDateNumeric-df$GrantStartYear | |
| ``` | |
| #### Remove bad rows | |
| Rows with no papers, papers published prior to grant years (?!) and unrefereed papers. | |
| ```{r} | |
| df.filt <- subset(df,YearsSinceGrant>=0) | |
| df.filt <- subset(df.filt,Refereed=="refereed") | |
| ``` | |
| ### Basic Plots | |
| ```{r fig.width=7, fig.height=6} | |
| ggplot(data=df.filt,aes(x=YearsSinceGrant))+geom_bar() | |
| ``` | |
| ```ggplot``` maps various aesthetics to components of the data. It then draws things based on these aesthetics, and any transformations. | |
| ```{r fig.width=7, fig.height=6} | |
| ggplot(data=df.filt,aes(x=YearsSinceGrant,y=CitationCount))+geom_point() | |
| ``` | |
| ```{r fig.width=7, fig.height=6} | |
| library(scales) | |
| ggplot(data=df.filt,aes(x=YearsSinceGrant,y=CitationCount))+geom_point()+scale_y_log10() | |
| ``` | |
| ### Final Plot | |
| ```{r fig.width=7, fig.height=6} | |
| d <-ggplot(df.filt,aes(x=GrantStartYear,y=YearsSinceGrant))+xlim(1995,2011)+ylim(0,16) | |
| d <- d + geom_bin2d(binwidth=c(1,1)) | |
| d | |
| ``` | |
| #### Change the color | |
| ```{r fig.width=7, fig.height=6} | |
| d <- d + scale_fill_gradient(low="paleturquoise1",high="midnightblue",name="# Papers") | |
| d | |
| ``` | |
| #### Change the theme | |
| ```{r fig.width=7, fig.height=6} | |
| d <- d + theme_bw() | |
| d <- d + theme(text=element_text(size=24)) | |
| d <- d + theme(panel.grid=element_blank()) | |
| d | |
| ``` | |
| #### Add a line | |
| ```{r fig.width=7, fig.height=6} | |
| d <- d + geom_abline(intercept=2012, slope=-1,size=2,linetype="dashed") | |
| d | |
| ``` | |
| #### Add a smoother | |
| ```{r fig.width=7, fig.height=6} | |
| d <- d + geom_smooth(color="red") | |
| d | |
| ``` | |
| #### Change the legend location | |
| ```{r fig.width=7, fig.height=6} | |
| library(grid) | |
| d <- d + theme(legend.direction='horizontal',legend.position=c(0.75,0.8), | |
| legend.key.width=unit(1,"cm"),legend.title.align=1) | |
| d | |
| ``` | |
| #### Change axis labels | |
| ```{r fig.width=7, fig.height=6} | |
| d <- d + xlab("Grant Start Year") + ylab("Years Since Grant Start") | |
| d <- d + ggtitle("References to NSF ASF Grants") | |
| d | |
| ``` | |
| #### Label the future | |
| ```{r fig.width=7, fig.height=6} | |
| d <- d + annotate("text", x = 2002, y = 11, label = "The Future", size=7, angle=-45, fontface="italic") + coord_fixed() | |
| d | |
| ``` | |
| ### Improvements | |
| This figure is, I think, just OK. There are a lot of colors, and it can be hard to compare across colors sometimes. We also aren't comparing apples to apples when it comes to 1995's papers and 2010's papers, by virtue of there being fewer years from which to sample. | |
| Probably the easiest way to compare is just to plot the paper production in the first year for all years. | |
| #### Look just at the first year post-grant | |
| ```{r fig.width=7, fig.height=6} | |
| my.subset <- subset(df.filt,YearsSinceGrant<=1) | |
| d <- ggplot(data=my.subset,aes(x=GrantStartYear))+geom_histogram(binwidth=1) | |
| ``` | |
| We can also use the `stat_bin` command with the `geom` option to plot this as a line instead of as bars. | |
| ```{r fig.width=7, fig.height=6} | |
| d <- ggplot(data=my.subset,aes(x=GrantStartYear))+stat_bin(binwidth=1,geom='line') | |
| d | |
| ``` | |
| We can also use the `stat_bin` command with the `geom` option to plot this as a line instead of as bars. | |
| Clearly, the number of papers in the first year has increased over time. You might ask, is this related to the number of grants? Is this related to the total amount of money? Let's compute and plot these. | |
| ```{r fig.width=7, fig.height=6} | |
| unique.grants <- unique(my.subset[,c("GrantStartYear","GrantNumber","AwardedUSD")]) | |
| a.money <- aggregate(unique.grants$AwardedUSD,list(unique.grants$GrantStartYear),sum) | |
| a.grants <- aggregate(unique.grants$GrantStartYear,list(unique.grants$GrantStartYear),length) | |
| a.papers <- aggregate(my.subset$PaperTitle,list(my.subset$GrantStartYear),length) | |
| new.df <- data.frame(year=a.money$Group.1, | |
| nGrants=a.grants$x/max(a.grants$x), | |
| nDollars=a.money$x/max(a.money$x), | |
| nPapers=a.papers$x/max(a.papers$x)) | |
| ggplot(data=new.df,aes(x=year))+ | |
| geom_line(aes(y=nGrants,color="Grants"))+ | |
| geom_line(aes(y=nDollars,color="Dollars"))+ | |
| geom_line(aes(y=nPapers,color="Papers"))+ | |
| scale_colour_manual("", | |
| breaks = c("Grants", "Dollars", "Papers"), | |
| values = c("red", "black", "blue"))+ | |
| ylab("Number (relative to maximum)") + xlab("Year") + ggtitle("Grants, Funding and Publications over Time") | |
| ``` | |
| Details on this and other styles are available at http://stackoverflow.com/questions/10349206/add-legend-to-ggplot2-line-plot | |
| We can also display this information in a different way, using densities as opposed to a 2-d heat map. | |
| ```{r fig.width=7, fig.height=6} | |
| ggplot(data=df.filt)+geom_density(aes(x=YearsSinceGrant,color=factor(GrantStartYear),fill=factor(GrantStartYear)),adjust=3,alpha=0.2) | |
| ``` | |
| This is still lacking. We'd like to split this up somehow to see what's going on more clearly! | |
| ```{r fig.width=7, fig.height=6} | |
| ggplot(data=df.filt)+geom_density(aes(x=YearsSinceGrant,color=factor(GrantStartYear),fill=factor(GrantStartYear)),adjust=3,alpha=0.2) + facet_grid(Administration~.) | |
| ``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment