cimentadaj · July 11, 2017 06:27
diff --git a/markdown_text_analysis.Rnw b/markdown_text_analysis.Rnw
 ---
 title: "Scraping and visualizing How I Met Your Mother"
 author: "Jorge Cimentada"
 date: "7/10/2017"
 output: html_document
 ---

 How I Met Your Mother (HIMYM from here after) is a television series very similar to the classical 'Friends' series from the 90's. Following the release of the 'tidy text' book I was looking for a project in which I could apply some of these skills. I decided I would scrape all the transcripts from HIMYM and analyze patterns between characters. This post really took me to the limit in terms of web scraping and pattern matching, which was specifically what I wanted to improve in the first place. Let's begin!

 My first task was whether there was any consistency in the URL's that stored the transcripts. If you ever watched HIMYM, we know there's around nine seasons, each one with about 22 episodes. This makes about 200 episodes give or take. It would be a big pain in the ass to manually write down 200 complicated URL's. Luckily, there is a way of finding the 200 links without writing them down manually.

 First, we create the links for the 9 websites that contain all episodes (1 through season 9)

 ```{r, echo = FALSE}
 knitr::opts_chunk$set(message = FALSE,
                      warning = FALSE,
                      error = FALSE)
 ```

 ```{r}
 library(rvest)
 library(tidyverse)
 library(stringr)
 library(tidytext)

 main_url <- "http://transcripts.foreverdreaming.org"
 all_pages <- paste0("http://transcripts.foreverdreaming.org/viewforum.php?f=177&start=", seq(0, 200, 25))
 characters <- c("ted", "lily", "marshall", "barney", "narrator", "robin")
 ```

 Each of the URL's of `all_pages` contains all episodes for that season (so around 22 URL's). I also picked the characters we're gonna concentrate for now. From here the job is very easy. We create a function that reads each link and parses the section containing all links for that season. We can do that using [SelectorGadget](http://selectorgadget.com/.) to find the section we're interested in. We then search for the `href` attribute to grab all links in that attribute and finally create a tibble with each episode together with it's link.

 ```{r}
 episode_getter <- function(link) {
  title_reference <-
    link %>%
    read_html() %>%
    html_nodes(".topictitle")
  
  episode_links <-
    title_reference %>%
    html_attr("href") %>%
    gsub("^.", "", .) %>%
    paste0(main_url, .) %>%
    setNames(title_reference %>% html_text()) %>%
    enframe(name = "episode_name", value = "link")
  
  episode_links
 }

 all_episodes <- map_df(all_pages, episode_getter)
 all_episodes$id <- 1:nrow(all_episodes)
 ```

 There we go! Now we have a very organized `tibble`. 

 ```{r, eval = F}
 all_episodes %>%
  View()
 ```

 The remaining part is to actually scrape and link each episode. We can work that out for a single episode and then turn that into a function and apply for all episodes.

 ```{r}
 episode_fun <- function(file) {
  
  file %>%
    read_html() %>%
    html_nodes(".postbody") %>%
    html_text() %>%
    str_split("\n|\t") %>%
    .[[1]] %>%
    data_frame(text = .) %>%
    filter(str_detect(text, ""), # Lots of empty spaces
           !str_detect(text, "^\\t"), # Lots of lines with \t to delete
           !str_detect(text, "^\\[.*\\]$"), # Text that start with brackets
           !str_detect(text, "^\\(.*\\)$"), # Text that starts with parenthesis
           str_detect(text, "^*.:"), # I want only lines with start with dialogue (:)
           !str_detect(text, "^ad")) # Remove lines that start with ad (for 'ads', the link of google ads)
 }
 ```

 The above function reads each episode, turns the html text into a data frame and organizes it clearly for text analysis. For example:

 ```{r}
 episode_fun(all_episodes$link[15])
 ```

 We now have a data frame with only dialogue for each character. We need to apply that function to each episode and `bind` everything together. We first apply the function to every episode.

 ```{r}
 all_episodes$text <- map(all_episodes$link, episode_fun)
 ```

 The `text` list-column is a now very organized list with text for each episode. However, manual inspection of some chapters actually denotes a small error that limits our analysis greatly. Among the main interests of this document is to study relationships and presence between characters. For that, we need each line of text to be accompanied by the character who said. Unfortunately, some of these scripts don't have that.

 For example, check any episode from season [8]("http://transcripts.foreverdreaming.org/viewforum.php?f=177&start=175") and [9]("http://transcripts.foreverdreaming.org/viewforum.php?f=177&start=200"). The writer avoided writing the dialogue and simply rewrote the lines. There's nothing we can do so far to improve that and we'll be excluding these chapters. This pattern is also present in random episodes like in season 4 or season 6. We can exclude chapters based on the number of lines we parsed. On average, each of these episodes has about 200 lines of dialogue. Anything significantly lower, like 30 or 50 lines, is an episode which doesn't have a lot of dialogue.

 ```{r}
 all_episodes$count <- map_dbl(all_episodes$text, nrow)
 ```

 We can extend the previous `tibble` to be a big more organized by separating the episode-season column into separate season and episo numbers.

 ```{r}
 all_episodes <-
  all_episodes %>%
  separate(episode_name, c("season", "episode"), "-", extra = "merge") %>%
  separate(season, c("season", "episode_number"), sep = "x")
 ```

 Great! We now have a very organized `tibble` with all the information we need. Next step is to actually break down the lines into words and start looking for general patterns. We can do that by looping through all episodes that have over 100 lines (just an arbitrary threshold) and unnesting each line for each **valid** character.

 ```{r}
 lines_characters <-
  map(filter(all_episodes, count > 100) %>% pull(text), ~ {
    .x %>%
      separate(text, c("character", "text"), sep = ":", extra = 'merge') %>%
      unnest_tokens(character, character) %>%
      filter(str_detect(character, paste0(paste0("^", characters, "$"), collapse = "|"))) %>%
      mutate(lines_id = 1:nrow(.))
  }) %>%
  setNames(filter(all_episodes, count > 100) %>%
             unite(season_episode, season, episode_number, sep = "x") %>%
             pull(season_episode))

 ```

 Ok! Our text is sort of ready. Let's turn that list into something more organized and remove some bad words.

 ```{r}
 words_per_character <-
  enframe(lines_characters) %>%
  unnest() %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  filter(!word %in% characters) %>%
  arrange(name) %>%
  separate(name, c("season", "episode"), sep = "x", remove = FALSE) %>%
  mutate(name = factor(name, ordered = TRUE),
         season = factor(season, ordered = TRUE),
         episode = factor(episode, ordered = TRUE))
 ```

 Just to make sure, let's look at the `tibble`.

 ```{r}
 words_per_character
 ```

 Perfect! One row per word, per character, per episode with the id of the line of the word.

 ```{r}

 first_episodes <-
  all_episodes %>%
  filter(count > 100, episode_number == "01 ") %>%
  pull(id)

 words_per_character %>%
  split(.$name) %>%
  setNames(1:length(.)) %>%
  enframe(name = "episode_id") %>%
  unnest() %>%
  count(episode_id, character) %>%
  group_by(episode_id) %>%
  mutate(total_n = sum(n),
         perc = round(n / total_n, 2)) %>%
  ggplot(aes(as.numeric(episode_id), perc, group = character, colour = character)) +
  geom_line() +
  geom_smooth(method = "lm") +
  scale_x_continuous(breaks = first_episodes, labels = paste("Season", 1:7)) +
  facet_wrap(~ character) +
  theme(axis.text.x = element_text(angle = 90))

 # Marshall was the only one who increased over time
 # These lines are very difficult to compare in the same panel
 # because there's a lot of ups and downs
 ```

 ```{r}
 words_per_character %>%
  count(name, character) %>%
  group_by(name) %>%
  mutate(total_n = sum(n),
         perc = round(n / total_n, 2)) %>%
  ggplot(aes(character, perc)) +
  geom_boxplot()

 # lily and robin are not present very often with only about
 # ~10 % of the dialogue
 ```

 ```{r}
 words_per_character %>%
  filter(!word %in% characters) %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  ggplot(aes(reorder(word, n), n)) + geom_col(alpha = 0.8) + coord_flip()
 ```

 ```{r}
 # Everyone is really negative!
 # Although marshal seems to be the happiest
 words_per_character %>%
  inner_join(get_sentiments("bing")) %>%
  count(name, character, sentiment) %>%
  spread(sentiment, n) %>%
  mutate(net = positive - negative) %>%
  ggplot(aes(name, net, fill = character)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ character, ncol = 2, scale = "free_x")
 ```

 ```{r}
 afinn <-
  words_per_character %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(name) %>%
  summarize(sentiment = sum(score), method = "afinn")

 bing <-
  words_per_character %>%
  inner_join(get_sentiments("bing")) %>%
  count(name, sentiment) %>%
  spread(sentiment, n) %>%
  transmute(name, sentiment = positive - negative, method = "bing")

 nrc <-
  words_per_character %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  count(name, sentiment) %>%
  spread(sentiment, n) %>%
  transmute(name, sentiment = positive - negative, method = "nrc")


 bind_rows(afinn, bing, nrc) %>%
  ggplot(aes(name, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ method, ncol = 1, scale = "free_x")

 # For afinn, the first few seasons were mostly positive
 # but from there on, it started to shift. But the other two
 # lexicons say something different! Maybe I have to check
 # if the unnest token words was done properly and the scraping
 ```

 ```{r}
 library(wordcloud)

 words_per_character %>%
  anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(word, n, max.words = 200))

 words_per_character %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  reshape2::acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"),
                   max.words = 100)
 ```
	---
	title: "Scraping and visualizing How I Met Your Mother"
	author: "Jorge Cimentada"
	date: "7/10/2017"
	output: html_document
	---

	How I Met Your Mother (HIMYM from here after) is a television series very similar to the classical 'Friends' series from the 90's. Following the release of the 'tidy text' book I was looking for a project in which I could apply some of these skills. I decided I would scrape all the transcripts from HIMYM and analyze patterns between characters. This post really took me to the limit in terms of web scraping and pattern matching, which was specifically what I wanted to improve in the first place. Let's begin!

	My first task was whether there was any consistency in the URL's that stored the transcripts. If you ever watched HIMYM, we know there's around nine seasons, each one with about 22 episodes. This makes about 200 episodes give or take. It would be a big pain in the ass to manually write down 200 complicated URL's. Luckily, there is a way of finding the 200 links without writing them down manually.

	First, we create the links for the 9 websites that contain all episodes (1 through season 9)

	```{r, echo = FALSE}
	knitr::opts_chunk$set(message = FALSE,
	warning = FALSE,
	error = FALSE)
	```

	```{r}
	library(rvest)
	library(tidyverse)
	library(stringr)
	library(tidytext)

	main_url <- "http://transcripts.foreverdreaming.org"
	all_pages <- paste0("http://transcripts.foreverdreaming.org/viewforum.php?f=177&start=", seq(0, 200, 25))
	characters <- c("ted", "lily", "marshall", "barney", "narrator", "robin")
	```

	Each of the URL's of `all_pages` contains all episodes for that season (so around 22 URL's). I also picked the characters we're gonna concentrate for now. From here the job is very easy. We create a function that reads each link and parses the section containing all links for that season. We can do that using [SelectorGadget](http://selectorgadget.com/.) to find the section we're interested in. We then search for the `href` attribute to grab all links in that attribute and finally create a tibble with each episode together with it's link.

	```{r}
	episode_getter <- function(link) {
	title_reference <-
	link %>%
	read_html() %>%
	html_nodes(".topictitle")

	episode_links <-
	title_reference %>%
	html_attr("href") %>%
	gsub("^.", "", .) %>%
	paste0(main_url, .) %>%
	setNames(title_reference %>% html_text()) %>%
	enframe(name = "episode_name", value = "link")

	episode_links
	}

	all_episodes <- map_df(all_pages, episode_getter)
	all_episodes$id <- 1:nrow(all_episodes)
	```

	There we go! Now we have a very organized `tibble`.

	```{r, eval = F}
	all_episodes %>%
	View()
	```

	The remaining part is to actually scrape and link each episode. We can work that out for a single episode and then turn that into a function and apply for all episodes.

	```{r}
	episode_fun <- function(file) {

	file %>%
	read_html() %>%
	html_nodes(".postbody") %>%
	html_text() %>%
	str_split("\n\|\t") %>%
	.[[1]] %>%
	data_frame(text = .) %>%
	filter(str_detect(text, ""), # Lots of empty spaces
	!str_detect(text, "^\\t"), # Lots of lines with \t to delete
	!str_detect(text, "^\\[.*\\]$"), # Text that start with brackets
	!str_detect(text, "^\\(.*\\)$"), # Text that starts with parenthesis
	str_detect(text, "^*.:"), # I want only lines with start with dialogue (:)
	!str_detect(text, "^ad")) # Remove lines that start with ad (for 'ads', the link of google ads)
	}
	```

	The above function reads each episode, turns the html text into a data frame and organizes it clearly for text analysis. For example:

	```{r}
	episode_fun(all_episodes$link[15])
	```

	We now have a data frame with only dialogue for each character. We need to apply that function to each episode and `bind` everything together. We first apply the function to every episode.

	```{r}
	all_episodes$text <- map(all_episodes$link, episode_fun)
	```

	The `text` list-column is a now very organized list with text for each episode. However, manual inspection of some chapters actually denotes a small error that limits our analysis greatly. Among the main interests of this document is to study relationships and presence between characters. For that, we need each line of text to be accompanied by the character who said. Unfortunately, some of these scripts don't have that.

	For example, check any episode from season [8]("http://transcripts.foreverdreaming.org/viewforum.php?f=177&start=175") and [9]("http://transcripts.foreverdreaming.org/viewforum.php?f=177&start=200"). The writer avoided writing the dialogue and simply rewrote the lines. There's nothing we can do so far to improve that and we'll be excluding these chapters. This pattern is also present in random episodes like in season 4 or season 6. We can exclude chapters based on the number of lines we parsed. On average, each of these episodes has about 200 lines of dialogue. Anything significantly lower, like 30 or 50 lines, is an episode which doesn't have a lot of dialogue.

	```{r}
	all_episodes$count <- map_dbl(all_episodes$text, nrow)
	```

	We can extend the previous `tibble` to be a big more organized by separating the episode-season column into separate season and episo numbers.

	```{r}
	all_episodes <-
	all_episodes %>%
	separate(episode_name, c("season", "episode"), "-", extra = "merge") %>%
	separate(season, c("season", "episode_number"), sep = "x")
	```

	Great! We now have a very organized `tibble` with all the information we need. Next step is to actually break down the lines into words and start looking for general patterns. We can do that by looping through all episodes that have over 100 lines (just an arbitrary threshold) and unnesting each line for each valid character.

	```{r}
	lines_characters <-
	map(filter(all_episodes, count > 100) %>% pull(text), ~ {
	.x %>%
	separate(text, c("character", "text"), sep = ":", extra = 'merge') %>%
	unnest_tokens(character, character) %>%
	filter(str_detect(character, paste0(paste0("^", characters, "$"), collapse = "\|"))) %>%
	mutate(lines_id = 1:nrow(.))
	}) %>%
	setNames(filter(all_episodes, count > 100) %>%
	unite(season_episode, season, episode_number, sep = "x") %>%
	pull(season_episode))

	```

	Ok! Our text is sort of ready. Let's turn that list into something more organized and remove some bad words.

	```{r}
	words_per_character <-
	enframe(lines_characters) %>%
	unnest() %>%
	unnest_tokens(word, text) %>%
	anti_join(stop_words) %>%
	filter(!word %in% characters) %>%
	arrange(name) %>%
	separate(name, c("season", "episode"), sep = "x", remove = FALSE) %>%
	mutate(name = factor(name, ordered = TRUE),
	season = factor(season, ordered = TRUE),
	episode = factor(episode, ordered = TRUE))
	```

	Just to make sure, let's look at the `tibble`.

	```{r}
	words_per_character
	```

	Perfect! One row per word, per character, per episode with the id of the line of the word.

	```{r}

	first_episodes <-
	all_episodes %>%
	filter(count > 100, episode_number == "01 ") %>%
	pull(id)

	words_per_character %>%
	split(.$name) %>%
	setNames(1:length(.)) %>%
	enframe(name = "episode_id") %>%
	unnest() %>%
	count(episode_id, character) %>%
	group_by(episode_id) %>%
	mutate(total_n = sum(n),
	perc = round(n / total_n, 2)) %>%
	ggplot(aes(as.numeric(episode_id), perc, group = character, colour = character)) +
	geom_line() +
	geom_smooth(method = "lm") +
	scale_x_continuous(breaks = first_episodes, labels = paste("Season", 1:7)) +
	facet_wrap(~ character) +
	theme(axis.text.x = element_text(angle = 90))

	# Marshall was the only one who increased over time
	# These lines are very difficult to compare in the same panel
	# because there's a lot of ups and downs
	```

	```{r}
	words_per_character %>%
	count(name, character) %>%
	group_by(name) %>%
	mutate(total_n = sum(n),
	perc = round(n / total_n, 2)) %>%
	ggplot(aes(character, perc)) +
	geom_boxplot()

	# lily and robin are not present very often with only about
	# ~10 % of the dialogue
	```

	```{r}
	words_per_character %>%
	filter(!word %in% characters) %>%
	count(word, sort = TRUE) %>%
	top_n(10) %>%
	ggplot(aes(reorder(word, n), n)) + geom_col(alpha = 0.8) + coord_flip()
	```

	```{r}
	# Everyone is really negative!
	# Although marshal seems to be the happiest
	words_per_character %>%
	inner_join(get_sentiments("bing")) %>%
	count(name, character, sentiment) %>%
	spread(sentiment, n) %>%
	mutate(net = positive - negative) %>%
	ggplot(aes(name, net, fill = character)) +
	geom_col(show.legend = FALSE) +
	facet_wrap(~ character, ncol = 2, scale = "free_x")
	```

	```{r}
	afinn <-
	words_per_character %>%
	inner_join(get_sentiments("afinn")) %>%
	group_by(name) %>%
	summarize(sentiment = sum(score), method = "afinn")

	bing <-
	words_per_character %>%
	inner_join(get_sentiments("bing")) %>%
	count(name, sentiment) %>%
	spread(sentiment, n) %>%
	transmute(name, sentiment = positive - negative, method = "bing")

	nrc <-
	words_per_character %>%
	inner_join(get_sentiments("nrc")) %>%
	filter(sentiment %in% c("positive", "negative")) %>%
	count(name, sentiment) %>%
	spread(sentiment, n) %>%
	transmute(name, sentiment = positive - negative, method = "nrc")


	bind_rows(afinn, bing, nrc) %>%
	ggplot(aes(name, sentiment, fill = method)) +
	geom_col(show.legend = FALSE) +
	facet_wrap(~ method, ncol = 1, scale = "free_x")

	# For afinn, the first few seasons were mostly positive
	# but from there on, it started to shift. But the other two
	# lexicons say something different! Maybe I have to check
	# if the unnest token words was done properly and the scraping
	```

	```{r}
	library(wordcloud)

	words_per_character %>%
	anti_join(stop_words, by = "word") %>%
	count(word, sort = TRUE) %>%
	with(wordcloud(word, n, max.words = 200))

	words_per_character %>%
	inner_join(get_sentiments("bing")) %>%
	count(word, sentiment, sort = TRUE) %>%
	reshape2::acast(word ~ sentiment, value.var = "n", fill = 0) %>%
	comparison.cloud(colors = c("#F8766D", "#00BFC4"),
	max.words = 100)
	```