This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(readr) | |
library(rvest) | |
library(animation) | |
library(stringr) | |
library(lubridate) | |
library(hms) | |
library(geofacet) | |
library(tidyverse) | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Scraping and visualizing How I Met Your Mother" | |
author: "Jorge Cimentada" | |
date: "7/10/2017" | |
output: html_document | |
--- | |
How I Met Your Mother (HIMYM from here after) is a television series very similar to the classical 'Friends' series from the 90's. Following the release of the 'tidy text' book I was looking for a project in which I could apply some of these skills. I decided I would scrape all the transcripts from HIMYM and analyze patterns between characters. This post really took me to the limit in terms of web scraping and pattern matching, which was specifically what I wanted to improve in the first place. Let's begin! | |
My first task was whether there was any consistency in the URL's that stored the transcripts. If you ever watched HIMYM, we know there's around nine seasons, each one with about 22 episodes. This makes about 200 episodes give or take. It would be a big pain in the ass to manually write down 200 complicated URL's. Luckily, there is a way of finding the 200 links without writing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Scraping and visualizing How I Met Your Mother" | |
author: "Jorge Cimentada" | |
date: "7/10/2017" | |
output: html_document | |
--- | |
How I Met Your Mother (HIMYM from here after) is a television series very similar to the classical 'Friends' series from the 90's. Following the release of the 'tidy text' book I was looking for a project in which I could apply some of these skills. I decided I would scrape all the transcripts from HIMYM and analyze patterns between characters. This post really took me to the limit in terms of web scraping and pattern matching, which was specifically what I wanted to improve in the first place. Let's begin! | |
My first task was whether there was any consistency in the URL's that stored the transcripts. If you ever watched HIMYM, we know there's around nine seasons, each one with about 22 episodes. This makes about 200 episodes give or take. It would be a big pain in the ass to manually write down 200 complicated URL's. Luckily, there is a way of finding the 200 links without writing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Functions | |
# Probably it is possible to combine these two functions in one, as they are identical only that | |
# in the first with the "if" we manipulate x and with "else" y, while the opposite is the | |
# case for the second function. | |
# First polygon | |
shrink_fun <- function(x, shrink, x_value = TRUE) { | |
if(x_value) { |
We can't make this file beautiful and searchable because it's too large.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"State","StateName","Year","Sex","Age","Cause","CauseName","Gap" | |
"1","Aguascalientes","1990","m",15,"g1","Amenable to medical service",0.000675098052688838 | |
"1","Aguascalientes","1990","m",16,"g1","Amenable to medical service",0.000667806966070827 | |
"1","Aguascalientes","1990","m",17,"g1","Amenable to medical service",0.00100441688809383 | |
"1","Aguascalientes","1990","m",18,"g1","Amenable to medical service",0.00136100033554243 | |
"1","Aguascalientes","1990","m",19,"g1","Amenable to medical service",0.00149372003684789 | |
"1","Aguascalientes","1990","m",20,"g1","Amenable to medical service",0.0013343531602672 | |
"1","Aguascalientes","1990","m",21,"g1","Amenable to medical service",0.000926874268287747 | |
"1","Aguascalientes","1990","m",22,"g1","Amenable to medical service",0.000352160113791911 | |
"1","Aguascalientes","1990","m",23,"g1","Amenable to medical service",0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(tidyverse) | |
library(animation) | |
url <- "https://gist.githubusercontent.com/cimentadaj/a2226ca503031140caecb7add0670d81/raw/7f09b9f457e67f13acda2305b9ae391d277070a4/mexico_mortality.csv" | |
data <- read_csv(url) | |
other_new_data <- | |
data %>% | |
mutate(cause_recode = dplyr::recode(CauseName, | |
'Road traffic' = 'Road traffic + Suicide', |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# (c) Repeat this simulation, but instead fit the model using t errors (see Exercise 6.6). | |
# The only change here is defining error1 as a t distribution instead of normally distributed | |
coefs <- array(NA, c(3, 1000)) | |
se <- array(NA, c(3, 1000)) | |
for (i in 1:ncol(coefs)) { | |
x1 <- 1:100 | |
x2 <- rbinom(100, 1, 0.5) | |
error1 <- rt(100, df=4)*sqrt(5 * (4-2)/4) + 0 # t distributed errors |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# (b) Put the above step in a loop and repeat 1000 times. Calculate the | |
# confidence coverage for the 68% intervals for each of the three | |
# coefficients in the model. | |
coefs <- array(NA, c(3, 1000)) | |
se <- array(NA, c(3, 1000)) | |
# Naturally, these estimates will be different for anyone who runs this code | |
for (i in 1:ncol(coefs)) { | |
x1 <- 1:100 | |
x2 <- rbinom(100, 1, 0.5) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# (a) Simulate data from this model. For simplicity, suppose the values of x1 are simply the integers | |
# from 1 to 100, and that the values of x2 are random and equally likely to be 0 or 1. Fit a linear | |
# regression (with normal errors) to these data and see if the 68% confidence intervals for the | |
# regression coefficients (for each, the estimates ±1 standard error) cover the true values. | |
library(arm) | |
library(broom) | |
library(hett) | |
set.seed(2131) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Posterior predictive checking: continuing the previous exercise, use the fitted | |
# model from Exercise 12.2(b) to simulate a new dataset of CD4 percentages | |
# (with the same sample size and ages of the original dataset) for the final time | |
# point of the study, and record the average CD4 percentage in this sample. | |
# Repeat this process 1000 times and compare the simulated distribution to the | |
# observed CD4 percentage at the final time point for the actual data. | |
# Make the data similar to the model in mod2 | |
finaltime_data <- subset(cd4, !is.na(treatmnt) & !is.na(baseage)) |