Skip to content

Instantly share code, notes, and snippets.

View cimentadaj's full-sized avatar

Jorge Cimentada cimentadaj

View GitHub Profile
library(readr)
library(rvest)
library(animation)
library(stringr)
library(lubridate)
library(hms)
library(geofacet)
library(tidyverse)
---
title: "Scraping and visualizing How I Met Your Mother"
author: "Jorge Cimentada"
date: "7/10/2017"
output: html_document
---
How I Met Your Mother (HIMYM from here after) is a television series very similar to the classical 'Friends' series from the 90's. Following the release of the 'tidy text' book I was looking for a project in which I could apply some of these skills. I decided I would scrape all the transcripts from HIMYM and analyze patterns between characters. This post really took me to the limit in terms of web scraping and pattern matching, which was specifically what I wanted to improve in the first place. Let's begin!
My first task was whether there was any consistency in the URL's that stored the transcripts. If you ever watched HIMYM, we know there's around nine seasons, each one with about 22 episodes. This makes about 200 episodes give or take. It would be a big pain in the ass to manually write down 200 complicated URL's. Luckily, there is a way of finding the 200 links without writing
---
title: "Scraping and visualizing How I Met Your Mother"
author: "Jorge Cimentada"
date: "7/10/2017"
output: html_document
---
How I Met Your Mother (HIMYM from here after) is a television series very similar to the classical 'Friends' series from the 90's. Following the release of the 'tidy text' book I was looking for a project in which I could apply some of these skills. I decided I would scrape all the transcripts from HIMYM and analyze patterns between characters. This post really took me to the limit in terms of web scraping and pattern matching, which was specifically what I wanted to improve in the first place. Let's begin!
My first task was whether there was any consistency in the URL's that stored the transcripts. If you ever watched HIMYM, we know there's around nine seasons, each one with about 22 episodes. This makes about 200 episodes give or take. It would be a big pain in the ass to manually write down 200 complicated URL's. Luckily, there is a way of finding the 200 links without writing
# Functions
# Probably it is possible to combine these two functions in one, as they are identical only that
# in the first with the "if" we manipulate x and with "else" y, while the opposite is the
# case for the second function.
# First polygon
shrink_fun <- function(x, shrink, x_value = TRUE) {
if(x_value) {
We can't make this file beautiful and searchable because it's too large.
"State","StateName","Year","Sex","Age","Cause","CauseName","Gap"
"1","Aguascalientes","1990","m",15,"g1","Amenable to medical service",0.000675098052688838
"1","Aguascalientes","1990","m",16,"g1","Amenable to medical service",0.000667806966070827
"1","Aguascalientes","1990","m",17,"g1","Amenable to medical service",0.00100441688809383
"1","Aguascalientes","1990","m",18,"g1","Amenable to medical service",0.00136100033554243
"1","Aguascalientes","1990","m",19,"g1","Amenable to medical service",0.00149372003684789
"1","Aguascalientes","1990","m",20,"g1","Amenable to medical service",0.0013343531602672
"1","Aguascalientes","1990","m",21,"g1","Amenable to medical service",0.000926874268287747
"1","Aguascalientes","1990","m",22,"g1","Amenable to medical service",0.000352160113791911
"1","Aguascalientes","1990","m",23,"g1","Amenable to medical service",0
library(tidyverse)
library(animation)
url <- "https://gist.githubusercontent.com/cimentadaj/a2226ca503031140caecb7add0670d81/raw/7f09b9f457e67f13acda2305b9ae391d277070a4/mexico_mortality.csv"
data <- read_csv(url)
other_new_data <-
data %>%
mutate(cause_recode = dplyr::recode(CauseName,
'Road traffic' = 'Road traffic + Suicide',
# (c) Repeat this simulation, but instead fit the model using t errors (see Exercise 6.6).
# The only change here is defining error1 as a t distribution instead of normally distributed
coefs <- array(NA, c(3, 1000))
se <- array(NA, c(3, 1000))
for (i in 1:ncol(coefs)) {
x1 <- 1:100
x2 <- rbinom(100, 1, 0.5)
error1 <- rt(100, df=4)*sqrt(5 * (4-2)/4) + 0 # t distributed errors
# (b) Put the above step in a loop and repeat 1000 times. Calculate the
# confidence coverage for the 68% intervals for each of the three
# coefficients in the model.
coefs <- array(NA, c(3, 1000))
se <- array(NA, c(3, 1000))
# Naturally, these estimates will be different for anyone who runs this code
for (i in 1:ncol(coefs)) {
x1 <- 1:100
x2 <- rbinom(100, 1, 0.5)
# (a) Simulate data from this model. For simplicity, suppose the values of x1 are simply the integers
# from 1 to 100, and that the values of x2 are random and equally likely to be 0 or 1. Fit a linear
# regression (with normal errors) to these data and see if the 68% confidence intervals for the
# regression coefficients (for each, the estimates ±1 standard error) cover the true values.
library(arm)
library(broom)
library(hett)
set.seed(2131)
# Posterior predictive checking: continuing the previous exercise, use the fitted
# model from Exercise 12.2(b) to simulate a new dataset of CD4 percentages
# (with the same sample size and ages of the original dataset) for the final time
# point of the study, and record the average CD4 percentage in this sample.
# Repeat this process 1000 times and compare the simulated distribution to the
# observed CD4 percentage at the final time point for the actual data.
# Make the data similar to the model in mod2
finaltime_data <- subset(cd4, !is.na(treatmnt) & !is.na(baseage))