Last active
December 24, 2015 06:29
-
-
Save geoffwoollard/6757752 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
stat545a-2013-hw04_woollard-geo | |
================================ | |
--- | |
```{r, echo=FALSE} | |
library(plyr) | |
library(xtable) | |
library(lattice) | |
#install.packages("reshape", dependencies=TRUE) | |
library(reshape) | |
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt" | |
iDat <- read.table(gdURL, header = TRUE, sep = '\t', quote = "\"") #initial data, with Oceania | |
removeContinent <- "Oceania" | |
gDat <- droplevels(subset(iDat, continent != removeContinent)) | |
``` | |
Load the data and get rid of the continent `r removeContinent` since it has only `r length(unique(iDat[with(data = iDat, continent == removeContinent),]$country)) ` countries | |
# Crique of another student's code | |
--- | |
This code comes from homework 3: gao-wen [source](https://gist.github.com/sibyl229/6668047#file-stat545a-2013-hw03_gao-wen-rmd) | [report](http://rpubs.com/less/stat545a-2013-hw03_gao-wen) | |
#### print_table | |
She first made a generalized function called `print_table`, this makes the code compact, although one improvement would be to include an option to round certain columns as desired. | |
```{r} | |
print_table = function(df) { | |
print(xtable(df), type = "html", include.rownames = FALSE) | |
} | |
``` | |
### Change in life expectancy | |
--- | |
She sometimes makes the code uber compact (2 lines!). | |
* Pro: avoid place holder names like `foo` for objects (like data.frames) that are not going to be used later | |
* Con: can make the line really long and hard to read, at least on rpubs which doesn't save the whitespace formatting of the source | |
In fact the original source is acttually much more readable than the rpubs version | |
#### original source | |
```{r, eval=FALSE, tidy=FALSE} | |
trimFrac <- 0.15 | |
print_table( | |
ddply(gDat, ~year, summarize, | |
trimmedMean = mean(lifeExp, trim=trimFrac))) | |
``` | |
#### rpubs | |
```{r, eval=FALSE, tidy=FALSE} | |
print_table(ddply(gDat, ~year, summarize, trimmedMean = mean(lifeExp, trim = trimFrac))) | |
``` | |
#### Plot the time course of median world life expenctancy | |
Ploting this data shows that it is increasing, but at a slower and slower rate | |
```{r fig.width = 3, fig.height=3} | |
trimFrac <- 0.15 | |
df <- ddply(gDat, ~year, summarize, meanLifeExp = mean(lifeExp, trim = trimFrac)) | |
xyplot(meanLifeExp ~ year, df, type=c("a","p")) | |
``` | |
### Rich and Poor Continents | |
--- | |
These questions use just a subset of the data from year 2007. Why the year 2007? This is the final year for each country. Perhaps it would be better to have a year variable set to 2007 instead of hardcoding it in, since it it referenced in multiple places. If there was just one variable controling it then it could be changed easily by just modifying it in one spot. | |
```{r, tidy=FALSE} | |
gdp_stat <- function(gDatCntnt) { | |
minGDP <- min(gDatCntnt$gdpPercap) | |
maxGDP <- max(gDatCntnt$gdpPercap) | |
return(data.frame(stat.type = c("min", "max"), GDP.per.cap = c(minGDP, maxGDP))) | |
} | |
gdpByContinentTall <- ddply(subset(gDat, year == 2007), ~continent, gdp_stat) | |
gdpByContinentTall <- arrange(gdpByContinentTall, GDP.per.cap) | |
``` | |
> One quick fix is to get the year from the data itself and then replace all references to `2007` with `finalYear` | |
```{r} | |
(finalYear = max(gDat$year)) | |
``` | |
#### Plot the max and min life expectancy in each continent | |
> [there's more than one way to skin a cat](http://en.wiktionary.org/wiki/there's_more_than_one_way_to_skin_a_cat) | |
The above results tell me that something of the ordering between continents (omitting Oceania): | |
* MIN: Africa < Asia < Americas < Europe | |
* MAX: Africa < Americas < Asia < Europe | |
<br> But this is just for the year 2007... what's going on in other years. Since there is only a few continents, we can use panel plots. | |
We tall table has `max` or `min` in one column and the value in another. This makes it very easy to plot on the same graph since we can set `groups = stat.type`. | |
```{r fig.width = 10, tidy=FALSE} | |
gdpByContYear <- ddply(gDat, ~continent + year, gdp_stat) | |
xyplot(GDP.per.cap ~ year | continent, | |
gdpByContYear, | |
groups = stat.type, auto.key=TRUE, type=c("a","p") | |
) | |
``` | |
> There is not as much information as a box plot or violin plot, but the presentation is perhaps cleaner. But what are we really interested in? We want to compare the different continents, right? I don't really like looking at different panels back and forth, so we can swap the conditioning argument (`|`) and `groups`: ` | A, groups = B` --> ` | B, groups = A`. | |
```{r fig.width = 10, tidy=FALSE} | |
xyplot(GDP.per.cap ~ year | stat.type, data = gdpByContYear, | |
groups = continent, auto.key=TRUE, type=c("a","p"), scale="free" | |
) | |
``` | |
> note the `scale="free"` argument , which automatically sets the scales on the panels. | |
### Interesting deviation from predicted life expectancy | |
--- | |
We can find "oputlying" countries by finding countries that have a large residual. The approach is: | |
* build function that gets the max residuals | |
** note that it finds the absolute value of the residual, which can be positive or negative depending on weather it is respectively above or below the fit. | |
* feed it into `ddply` | |
* order the output by the residuals. | |
The function `max_residual` is a function that we feed into `ddply` without the `summarize` function. It is customized to give back what we want, in this case the maximum residual and the year. | |
```{r, comment=NA, results='asis', tidy=FALSE} | |
max_residual <- function(cDat){ | |
clm <- lm(lifeExp ~ I(year-min(cDat$year)), data=cDat) | |
largestResidual <- abs(clm$residuals) | |
maxRsdl <- max(largestResidual) | |
return(c(Max.Residual=maxRsdl, | |
Year=cDat[which.max(largestResidual),'year'])) | |
} | |
interesting <- ddply(gDat, ~country, max_residual) | |
interesting <- arrange(interesting, abs(Max.Residual), decreasing=T) | |
print_table(head(interesting)) | |
``` | |
> At the end we checked the output with head. Here we see an issue with the `print_table` function -- it has `.00` after each year, which are always integers. What we need instead is to use the `digits` command in `xtable`. This could be packaged up into `print_table` easily since it works well with data.frames (there aren't issues with rounding factors). | |
#### My partial solution | |
```{r, comment=NA, results='asis'} | |
numCountries <- 11# number of interesting countries | |
rounded <- xtable(head(interesting,numCountries), digits=c(0,0,2,0)) # just use 0 for row number and country | |
print(rounded, type = "html", include.rownames = FALSE) | |
``` | |
#### Plot the top `r numCountries` interesting countries | |
We used the residuals to find countries and their associalted years. We can see the big residuals if we plot the lifeExp vs year with a line -- just look for the data points that are far from the line. | |
```{r, tidy=FALSE} | |
interesting <- arrange(interesting, abs(Max.Residual), decreasing = T) | |
xyplot(lifeExp ~ year | country, gDat, | |
subset = country %in% head(interesting,numCountries)$country, | |
type = c("p", "r")) | |
``` | |
As we can wee with Iraq, the largest residual is in `r subset(interesting, subset = country == "Iraq")$Year`, but this isn't some dramatic year, but the year before a change in trend. There are so many wars in Iraq, that it is difficult to relate historical events to the data: | |
* 1980-1988 [Iran-Iraq War](http://en.wikipedia.org/wiki/Iran%E2%80%93Iraq_War) | |
* 1990-1991 [Gulf War](http://en.wikipedia.org/wiki/Persian_Gulf_War) | |
* 2003 - present [Iraq War](http://en.wikipedia.org/wiki/Iraq_War) | |
# Choose your own adventure | |
--- | |
This answers the question **Find countries with sudden, substantial departures from the temporal trend in one of the quantitative measures**. | |
Why does the life expencancy decrease so dramatically? Two stricking examples were Cambodia and Rwanda, which had clear conflict times in their history (see [Khmer Rouge rule of Cambodia](http://en.wikipedia.org/wiki/Khmer_Rouge_rule_of_Cambodia) in the 1970s and the [Rwandan Genocide](http://en.wikipedia.org/wiki/Rwandan_Genocide) in the 1990s). But the life expectancy doesn't just reflect a over loss of life, it reflects changes in the average age of the population. Roiughly speaking the life expectancy changes when on the average more young people are dying than old people, or vice-versa. So I decided to look into the relation between population and life expencancy. Obviously I need to somehow normalize, since population is on the oder of 10^6, while life expencancy tends to fluctuate in the range 20-100. | |
```{r, tidy=FALSE} | |
wide <- ddply(subset(gDat, subset = country %in% head(interesting,numCountries)$country), | |
~country, summarize, | |
normPop=scale(pop, center=FALSE), | |
normLifeExp=scale(lifeExp, center=FALSE), | |
year=year | |
) | |
tall <- melt(wide, id=c('country','year')) | |
xyplot(data=tall, | |
value ~ year | country, | |
group = variable, auto.key=TRUE | |
) | |
``` | |
I think the take home message from here is that life exectancy is a much more sensative variable than overall population in visualizing major shifts in the population. There are only a few cases where the overall population is decreases over a 5 year interval (Lesotho, Rwanda, Cambodia), and at these time points the life expectancy has a more defined negative trend. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment