Created
June 22, 2016 01:26
-
-
Save slopp/85fc3a097b2c01fd2711083ce89ede6c to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(shiny) | |
ui <- fluidPage( | |
selectInput('xcol', 'X Variable', names(iris)), | |
selectInput('ycol', 'Y Variable', names(iris)), | |
numericInput('clusters', 'Cluster count', 3, min = 1, max = 9), | |
plotOutput('plot1') | |
) | |
server <- function(input, output) { | |
# Combine the selected variables into a new data frame | |
selectedData <- reactive({ | |
iris[, c(input$xcol, input$ycol)] | |
}) | |
clusters <- reactive({ | |
kmeans(selectedData(), input$clusters) | |
}) | |
output$plot1 <- renderPlot({ | |
plot(selectedData(), col = clusters()$cluster,pch = 20, cex = 3) | |
points(clusters()$centers, pch = 4, cex = 4, lwd = 4) | |
}) | |
} | |
shinyApp(ui=ui, server=server) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: intro_to_R | |
output: html_document | |
--- | |
## What is R? | |
R is a functional programming language for statistical computing. R has been widely adopted as the language of choice for statisticians and data scientists, and is used heavily in both academia and industry. | |
R is designed to be run interactively (you'll execute one line at a time) as opposed to a compiled language like C / C++ / Java. | |
```{r} | |
print("Hello World!") | |
``` | |
## Operations & Functions | |
R uses the standard mathematical operators `+,-,/,*`. | |
R's assignment variable is `<-`. | |
```{r} | |
x <- 1 | |
y <- x+1 | |
y | |
``` | |
R uses the following syntax to define functions: | |
```{r, eval=FALSE} | |
myFunc <- function(arguments){ | |
# this is a comment | |
result <- arguments | |
result # the function returns the last listed object, but functions can have side affects due to R's scoping rules! | |
} | |
# call the function like this | |
myResult <- myFunc(arguments) | |
``` | |
###Exercise 1: Let's create a function called computeArea that, given the radius of a circle, returns the area. The keywoard for pi in R is `pi`. | |
Hint: RStudio supplies a number of shortcuts, one is for creating functions. Type `fun` and hit `tab`. | |
```{r} | |
# Exercise 1: Create a function that returns the area of a circle | |
``` | |
```{r} | |
# Solution | |
computeArea <- function(r) { | |
pi*r^2 | |
} | |
``` | |
## Packages | |
The base language R is extended by packages. Packages are primarily composed of functions and datasets. To install a package, either use the RStudio IDE Packages pane, or use the R command `install.packages("package_name")`. | |
When you install a package, it will ask you to select a repo, any choice will work. | |
To use a function or dataset in your code that is contained in a package, you have to load the package, `library(package_name)`. | |
Install and load the following packages: `dplyr`, `ggplot2`, and `devtools`. This is going to take a few minutes ... | |
```{r} | |
install.packages("ggplot2") | |
install.packages("dplyr") | |
install.packages("BoulderDaily") | |
library(ggplot2) | |
library(dplyr) | |
library(devtools) | |
``` | |
Now we're going to grab some data. To do this we'll start by installing a package from github using the devtools package we just installed and loaded. | |
```{r} | |
devtools::install_github("slopp/dataWorkshop2016") | |
library(dataWorkshop2016) | |
data("BoulderDaily") | |
B <- BoulderDaily | |
``` | |
## Data Types and Structures in R | |
Before we can explore our data, we need to understad data types and data structures in R. R supports many data types including numeric, logical, character, integer, complex, and raw as well as a vareity of date formats. R also has a number of data types including vectors, matrices, lists, and data frames. | |
We can look at the structure of an R object using the R command `str`. We can view a dataset using a number of commands including: `head`, `tail`, `dim`, `nrow`, `ncol`, and `summary`. | |
### Exercise 2: Start to get familiar with the Boulder Daily Temperature dataset we just installed and called `B`. | |
Hint: To get help with a function, you can use the Help pane in RStudio, or one of: | |
```{r} | |
help(summary) | |
?str | |
``` | |
Hint: Use tab complete to quickly find functions and their arguments. | |
```{r} | |
# Exercise 2: Get familiar with the data types and data structures in our data set. | |
# What data type is B? | |
# What is the coldest temperature in the dataset? | |
# What is the hotest temperature in the dataset? | |
# How many days are missing precipitation data? | |
``` | |
Hint: You can interactively explore a data set in RStudio by either clicking on the dataset in the Enviornment pane, or running the `View` command, ie `View(B)`. | |
```{r} | |
#Solution: | |
str(B) # B is a data frame with 7 columns, 5 int and 2 num | |
View(B) # try using the filter command | |
head(B); tail(B) | |
summary(B) | |
# What is the coldest temperature in the dataset? | |
# the Min in tmin is -33 | |
# What is the hotest temperature in the dataset? | |
# the Max in tmax is 104 | |
# How many days are missing precipitation data? | |
# the NA's in precip is 4584 | |
``` | |
## Break/ Let's explore RStudio a bit... (15 minutes) | |
Each team will report one thing cool they found inside of the IDE. | |
Hint: Tools -> Global Options | |
## Manipulating Data (base R) | |
Now that we've looked at our data, we'll learn how to play with it. You can grab one column of a data frame using the following syntax: `dataframe$column`. For example, let's verify the min/max temperatures we found using the `summary` command: | |
```{r} | |
min.temp <- B$tmin | |
min(min.temp, na.rm = TRUE) | |
max(B$tmax, na.rm = TRUE) | |
``` | |
R is a vectorized language! Let's create a new column that is the average temperature for each day in our dataset. | |
```{r} | |
B$tavg <- (B$tmax - B$tmin)/2 | |
summary(B$tavg) | |
``` | |
### Exercise 3: Add a column that contains the temperature differential for each day | |
```{r} | |
# Exercise 3: Add a column that contains the temperature differential for each day | |
``` | |
```{r} | |
#solution | |
B$tdiff <- B$tmax - B$tmin | |
``` | |
We subset data in R using `[row, col]` notation. You omit the row or col if you want all of them, and you can reference columns either numerically or by label. To get more than 1 row or column, you can create a vector: `c(1,2,3)` and supply that as an argument for `row` or `col` or for short `start:stop`. Here are a few examples: | |
```{r} | |
# get the min temp for the first observation | |
B[1,5] | |
#get all of the columns for the 3rd observation | |
B[,3] | |
#get the first 3 min temperatures in the dataset | |
B[c(1,2,3), 5] | |
B[1:3,5] | |
B[1:3, "tmin"] | |
# or, we can use the $ operator and then brackets | |
B$tmin[1:3] | |
``` | |
We can dynamically subset using the `which` command and logical operators. For example, we can find the day of the year with the coldest temperature: | |
```{r} | |
coldest.temp <- min(B$tmin, na.rm=TRUE) | |
coldest.row <- which(B$tmin==coldest.temp) | |
coldest.day <- B[coldest.row,] | |
coldest.day | |
# or, all at once | |
B[which(B$tmin==min(B$tmin, na.rm=TRUE)),] | |
``` | |
### Exercise 4: What is the wetest day? | |
```{r} | |
#Exercise 4: What is the wetest day? | |
``` | |
Hint: Use ctrl+f to open a find a replace dialogue, replace tmin with precip in the previous line of code and then change min to max | |
```{r} | |
#solution | |
B[which(B$precip==max(B$precip, na.rm=TRUE)),] | |
``` | |
## Manipulating Data using dplyr | |
The `dplyr` package provides a cleaner (and often faster) syntax for manipulating data in R. `dplyr` is based on verbs and the pipe operator `%>%`. | |
Some key verbs are: | |
`select` - choose the column you want | |
`mutuate` - add a column | |
`filter` - select rows based on a condition | |
`arrange` - sort your data by a column | |
`group_by` and `summarize` - aggregate your data | |
The pipe operation `%>%` lets us string together these verbs. | |
For example, lets find all the days where it was colder than -20 degrees | |
```{r} | |
B %>% | |
filter(tmin < -20) %>% | |
select(year, month, day, tmin) | |
``` | |
The pipe operator works with other R functions too. This is how it works: | |
`fun(x,y)` is equivalent to `x %>% fun(y)`. | |
We'll see more examples of this later, but it can help keep our code readable. | |
Next, lets use the `group_by` command and `summarize` command to find the average temperature by month. | |
Hint: Don't forget about using `tab` for auto-complete! | |
```{r} | |
B %>% | |
select(month, tavg) %>% | |
group_by(month) %>% | |
summarise(monthly.avg.temp = mean(tavg, na.rm=TRUE)) | |
``` | |
`dplyr` also support joins! If you're not familiar with joins don't worry, if you are, try `?left_join`. | |
### Exercise 5 Create a data frame that contains the average precipitation by month | |
```{r} | |
# Exercise 5 Create a data frame that contains the average precipitation by month | |
``` | |
```{r} | |
#Solution | |
B %>% | |
select(month, precip) %>% | |
group_by(month) %>% | |
summarise(monthly.avg.precip=mean(precip, na.rm=TRUE)) | |
``` | |
## Plotting with base R | |
There are a many ways to plot our data in R. We'll start with some of the base R functions. | |
```{r} | |
hist(B$tavg) | |
``` | |
We can add some complexity by using formulas. The syntax for a formula in R is `y ~ x`. For example, we can create a boplot of average temperature by month: | |
```{r} | |
boxplot(tavg ~ month, data=B) | |
``` | |
Or, we could do a scatter plot. For example, we could plot tmin vs tmax. | |
```{r} | |
plot(tmin~tmax, data=B) | |
``` | |
### Exercise 6: Plot the maximum tempereature vs the temperature differential | |
```{r} | |
# Exercise 6: Plot the maximum tempereature vs the temperature differential | |
``` | |
```{r} | |
#solution | |
plot(tdiff~tmax, data=B) | |
``` | |
There are many ways to improve base graphics in R. For example, you can add titles, change the axis labels, the color of points, the type of points, the size of points, etc. I'll let you explore these options at home. | |
## Plotting with the ggplot2 package | |
Like `dplyr`, `ggplot2` is a powerful package for creating graphics in R. | |
Lets recreate the plot just made: | |
```{r} | |
ggplot(data=B)+ | |
geom_point(mapping=aes(x=tmax, y=tdiff)) | |
``` | |
First we create a general plot and then we add layers. In this case `geom_point` adds a layer of points. But there are many different layers we can add, for example `geom_histogram` adds a histogram, `geom_line` adds a line, etc. | |
The last step is we have to map our data to the different aestetics using the `aes` function. An aesthetic is just something you can see on the plot. For instance, a point is defined by teh `x` and `y` location. We can easily add other aesthetics: | |
```{r} | |
ggplot(data=B)+ | |
geom_point(mapping=aes(x=tmax, y=tdiff, colour=precip)) | |
``` | |
Other aesthetics include size, shape, etc. You can also specify aesthetics by hand: | |
```{r} | |
ggplot(data=B)+ | |
geom_point(mapping=aes(x=tmax, y=tdiff, colour="red")) | |
``` | |
We can also add layers to a plot:(First we'll down sample our data so its not just a blob... we'll use the pipe operator to connect the filter step with the plot) | |
```{r} | |
B %>% | |
filter(month==1, year>2000) %>% | |
ggplot()+ | |
geom_point(mapping=aes(x=tmax, y=tdiff))+ | |
geom_smooth(mapping=aes(x=tmax, y=tdiff)) | |
``` | |
The key to `ggplot2` is getting your dataset exactly right first, then plotting! | |
For instance, lets create a barplot of average rainfall by month. We'll use both dplyr and ggplot2, and we'll connect the two using the pipe operator. | |
```{r} | |
B %>% | |
# first we need to manipulate our data | |
select(month, precip) %>% | |
group_by(month) %>% | |
summarize(avg.monthly.precipitation = mean(precip, na.rm=TRUE)) %>% | |
mutate(month = as.factor(month)) %>% | |
# now we'll add our plot | |
ggplot()+ | |
geom_bar(aes(x=month,y=avg.monthly.precipitation), stat="identity") | |
``` | |
Another example: | |
```{r} | |
B %>% | |
# use dplyr to summarise the data | |
group_by(year) %>% | |
summarise(max.max.tmp = max(tmax, na.rm=TRUE)) %>% | |
# add the plot | |
ggplot(aes(x=year, y=max.max.tmp)) + # if we define our aesthetics here, all the layers will inherit them to save us some typing | |
geom_point() + | |
geom_smooth() | |
``` | |
We can add a trend line: | |
### Exercise 7: Create your own plot with `ggplot2` using at least 2 geoms, one of which we haven't seen before | |
Hint: In RStudio go to Help -> Cheatsheets | |
```{r} | |
# Exercise 7: Create your own plot with `ggplot2` | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment