Last active
December 24, 2015 20:59
-
-
Save geoffwoollard/6862144 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
stat545a-2013-hw05_woollard-geo | |
================================ | |
--- | |
```{r} | |
library(plyr) | |
library(mgcv) | |
library(lattice) | |
library(ggplot2) | |
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt" | |
iDat <- read.table(gdURL, header = TRUE, sep = '\t', quote = "\"") #initial data, with Oceania | |
removeContinent <- "Oceania" | |
gDat <- droplevels(subset(iDat, continent != removeContinent)) | |
str(gDat) | |
``` | |
Load the data and get rid of the continent `r removeContinent` since it has only `r length(unique(iDat[with(data = iDat, continent == removeContinent),]$country)) ` countries | |
## Lattice vs. ggplot2 | |
--- | |
The following `ddply` and `xyplot` wizardry comes from [Jenny's course notes](http://www.stat.ubc.ca/~jenny/STAT545A/hw04_univariateLattice.html) | |
```{r, tidy=FALSE, fig.width = 5, fig.height=5} | |
# modified from jenny's lattice solution to max and min per continent | |
lifeExpByContinent <- ddply(gDat, ~ continent + year, function(x) { | |
jLevels <- c("min", "max") | |
data.frame(lifeExp = range(x$lifeExp), | |
stat = factor(jLevels, levels = jLevels)) | |
}) | |
head(lifeExpByContinent) | |
# in lattice | |
xyplot(lifeExp ~ year | continent, lifeExpByContinent, | |
group = stat, type = "b", grid = "h", as.table = TRUE, | |
auto.key = list(columns = 2), | |
scales = list(y = list(log = TRUE, equispaced.log = FALSE))) | |
# same thing but with ggplot2 | |
plt <- ggplot(lifeExpByContinent, aes(y=lifeExp, x=year, color=stat)) | |
plt + geom_line() + geom_point() +facet_wrap(~continent) | |
``` | |
## gdpPercap vs. lifeExp | |
--- | |
This density plot uses `geom_hex()`, which I find easy on the eyes. The correlation between lifeExp and gdpPercap -- or better yet log(gdpPercap) -- is clear with a scatter plot. We use a [LOESS curve](http://en.wikipedia.org/wiki/Local_regression) with `geom_smooth(method='loess')`, since this is the option that is chosen with `geom_smooth(method='auto')` with so many data points; all the data makes the spread are a bit hard to see. | |
```{r fig.width = 10, fig.height=5} | |
# density plot | |
plt <- ggplot(gDat, aes(x = gdpPercap, y = lifeExp)) | |
plt + scale_x_log10() + geom_hex() | |
# scatter plot, showing raw data and spread | |
plt <- ggplot(gDat, aes(y=lifeExp,x=gdpPercap, color=continent)) | |
plt + facet_wrap(~continent) + geom_point(shape=1) + geom_smooth(method='loess') + scale_x_log10() | |
``` | |
## [Diabetes data](http://www.kaggle.com/c/pf2012-diabetes/data) from [kaggle.com](http://www.kaggle.com/) | |
Kaggle hosts data analysis competitions. Real world problems are posted, and experts from around the world provide solutins to prediction and optimization challenges. The Featured competitions as of `r today <- Sys.Date() ; format(today, format="%d %B %Y")` include: | |
* **$220 000** : Optimize flight routes based on current weather and traffic (GE) | |
* **$25 000** : Learning to rank hotels to maximize purchases (Expedia) | |
* **A job** : Keyword Extraction (Facebook) | |
```{r fig.width = 10, fig.height=5} | |
# import kaggle diabetes data | |
kDat <- read.table("/Users/geoffrey/Downloads/kaggleDataSets/diabetes_trainingSet/training_SyncTranscript.csv", header=TRUE, sep=",", na.strings=c("NULL", 0,"")) | |
# type of physician | |
kClean <- kDat[(!is.na(kDat$PhysicianSpecialty)),] # remove the NAs | |
problemFactor <- "Developmental – Behavioral Pediatrics" | |
kClean <- kClean[kClean$PhysicianSpecialty != problemFactor,] # one factor level is giving a hard time, after googling around I suspect the encoding of the dash | |
plt <- ggplot(kClean, | |
aes(x=reorder(PhysicianSpecialty,PhysicianSpecialty, length)) | |
) | |
plt + geom_histogram() + scale_y_log10() + theme(axis.text.x=element_text(angle=90, hjust=1)) + xlab("PhysicianSpecialty") # vertical text | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment