Last active
September 27, 2017 08:43
-
-
Save cimentadaj/0f6156df63b8bb9c1c6c59bde33adeff to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Scraping and visualizing How I Met Your Mother" | |
author: "Jorge Cimentada" | |
date: "7/10/2017" | |
output: html_document | |
--- | |
How I Met Your Mother (HIMYM from here after) is a television series very similar to the classical 'Friends' series from the 90's. Following the release of the 'tidy text' book I was looking for a project in which I could apply some of these skills. I decided I would scrape all the transcripts from HIMYM and analyze patterns between characters. This post really took me to the limit in terms of web scraping and pattern matching, which was specifically what I wanted to improve in the first place. Let's begin! | |
My first task was whether there was any consistency in the URL's that stored the transcripts. If you ever watched HIMYM, we know there's around nine seasons, each one with about 22 episodes. This makes about 200 episodes give or take. It would be a big pain in the ass to manually write down 200 complicated URL's. Luckily, there is a way of finding the 200 links without writing them down manually. | |
First, we create the links for the 9 websites that contain all episodes (1 through season 9) | |
```{r, echo = FALSE} | |
knitr::opts_chunk$set(message = FALSE, | |
warning = FALSE, | |
error = FALSE) | |
``` | |
```{r} | |
library(rvest) | |
library(tidyverse) | |
library(stringr) | |
library(tidytext) | |
main_url <- "http://transcripts.foreverdreaming.org" | |
all_pages <- paste0("http://transcripts.foreverdreaming.org/viewforum.php?f=177&start=", seq(0, 200, 25)) | |
characters <- c("ted", "lily", "marshall", "barney", "robin") | |
``` | |
Each of the URL's of `all_pages` contains all episodes for that season (so around 22 URL's). I also picked the characters we're gonna concentrate for now. From here the job is very easy. We create a function that reads each link and parses the section containing all links for that season. We can do that using [SelectorGadget](http://selectorgadget.com/.) to find the section we're interested in. We then search for the `href` attribute to grab all links in that attribute and finally create a tibble with each episode together with it's link. | |
```{r} | |
episode_getter <- function(link) { | |
title_reference <- | |
link %>% | |
read_html() %>% | |
html_nodes(".topictitle") | |
episode_links <- | |
title_reference %>% | |
html_attr("href") %>% | |
gsub("^.", "", .) %>% | |
paste0(main_url, .) %>% | |
setNames(title_reference %>% html_text()) %>% | |
enframe(name = "episode_name", value = "link") | |
episode_links | |
} | |
all_episodes <- map_df(all_pages, episode_getter) | |
all_episodes$id <- 1:nrow(all_episodes) | |
``` | |
There we go! Now we have a very organized `tibble`. | |
```{r, eval = F} | |
all_episodes %>% | |
View() | |
``` | |
The remaining part is to actually scrape and link each episode. We can work that out for a single episode and then turn that into a function and apply for all episodes. | |
```{r} | |
episode_fun <- function(file) { | |
file %>% | |
read_html() %>% | |
html_nodes(".postbody") %>% | |
html_text() %>% | |
str_split("\n|\t") %>% | |
.[[1]] %>% | |
data_frame(text = .) %>% | |
filter(str_detect(text, ""), # Lots of empty spaces | |
!str_detect(text, "^\\t"), # Lots of lines with \t to delete | |
!str_detect(text, "^\\[.*\\]$"), # Text that start with brackets | |
!str_detect(text, "^\\(.*\\)$"), # Text that starts with parenthesis | |
str_detect(text, "^*.:"), # I want only lines with start with dialogue (:) | |
!str_detect(text, "^ad")) # Remove lines that start with ad (for 'ads', the link of google ads) | |
} | |
``` | |
The above function reads each episode, turns the html text into a data frame and organizes it clearly for text analysis. For example: | |
```{r} | |
episode_fun(all_episodes$link[15]) | |
``` | |
We now have a data frame with only dialogue for each character. We need to apply that function to each episode and `bind` everything together. We first apply the function to every episode. | |
```{r} | |
all_episodes$text <- map(all_episodes$link, episode_fun) | |
``` | |
The `text` list-column is a now very organized list with text for each episode. However, manual inspection of some episodes actually denotes a small error that limits our analysis greatly. Among the main interests of this document is to study relationships and presence between characters. For that, we need each line of text to be accompanied by the character who said. Unfortunately, some of these scripts don't have that. | |
For example, check any episode from season [8]("http://transcripts.foreverdreaming.org/viewforum.php?f=177&start=175") and [9]("http://transcripts.foreverdreaming.org/viewforum.php?f=177&start=200"). The writer didn't write the dialogue and simply rewrote the lines. There's nothing we can do so far to improve that and we'll be excluding these chapters. This pattern is also present in random episodes like in season 4 or season 6. We can exclude chapters based on the number of lines we parsed. On average, each of these episodes has about 200 lines of dialogue. Anything significantly lower, like 30 or 50 lines, is an episode which doesn't have a lot of dialogue. | |
```{r} | |
all_episodes$count <- map_dbl(all_episodes$text, nrow) | |
``` | |
We can extend the previous `tibble` to be a big more organized by separating the episode-season column into separate season and episo numbers. | |
```{r} | |
all_episodes <- | |
all_episodes %>% | |
separate(episode_name, c("season", "episode"), "-", extra = "merge") %>% | |
separate(season, c("season", "episode_number"), sep = "x") | |
``` | |
Great! We now have a very organized `tibble` with all the information we need. Next step is to actually break down the lines into words and start looking for general patterns. We can do that by looping through all episodes that have over 100 lines (just an arbitrary threshold) and unnesting each line for each **valid** character. | |
```{r} | |
lines_characters <- | |
map(filter(all_episodes, count > 100) %>% pull(text), ~ { | |
.x %>% | |
separate(text, c("character", "text"), sep = ":", extra = 'merge') %>% | |
unnest_tokens(character, character) %>% | |
filter(str_detect(character, paste0(paste0("^", characters, "$"), collapse = "|"))) %>% | |
mutate(episode_lines_id = 1:nrow(.)) | |
}) %>% | |
setNames(filter(all_episodes, count > 100) %>% | |
unite(season_episode, season, episode_number, sep = "x") %>% | |
pull(season_episode)) | |
``` | |
Ok! Our text is sort of ready. Let's turn that list into something more organized and remove some bad words. | |
```{r} | |
words_per_character <- | |
enframe(lines_characters) %>% | |
unnest() %>% | |
mutate(all_lines_id = 1:nrow(.)) %>% | |
unnest_tokens(word, text) %>% | |
anti_join(stop_words) %>% | |
filter(!word %in% characters) %>% | |
arrange(name) %>% | |
separate(name, c("season", "episode"), sep = "x", remove = FALSE) %>% | |
mutate(name = factor(name, ordered = TRUE), | |
season = factor(season, ordered = TRUE), | |
episode = factor(episode, ordered = TRUE)) %>% | |
filter(season != "07") | |
``` | |
Just to make sure, let's look at the `tibble`. | |
```{r} | |
words_per_character | |
``` | |
Perfect! One row per word, per character, per episode with the id of the line of the word. | |
Alright, let's get our hands dirty. First, let visualize the presence of each character in terms of words over time. | |
```{r} | |
# Filtering position of first episode of all seasons to | |
# position the X axis in the next plot. | |
first_episodes <- | |
all_episodes %>% | |
filter(count > 100, episode_number == "01 ") %>% | |
pull(id) | |
words_per_character %>% | |
split(.$name) %>% | |
setNames(1:length(.)) %>% | |
enframe(name = "episode_id") %>% | |
unnest() %>% | |
count(episode_id, character) %>% | |
group_by(episode_id) %>% | |
mutate(total_n = sum(n), | |
perc = round(n / total_n, 2)) %>% | |
ggplot(aes(as.numeric(episode_id), perc, group = character, colour = character)) + | |
geom_line() + | |
geom_smooth(method = "lm") + | |
scale_colour_discrete(guide = FALSE) + | |
scale_x_continuous(name = "Seasons", | |
breaks = first_episodes, labels = paste0("S", 1:7)) + | |
scale_y_continuous(name = "Percentage of words per episode") + | |
theme_minimal() + | |
facet_wrap(~ character, ncol = 3) | |
# Marshall was the only one who increased over time | |
# These lines are very difficult to compare in the same panel | |
# because there's a lot of ups and downs | |
``` | |
Ted is clearly the character with the highest number of words per episode followed by Barney. Lily and Robin, the only two women have very low presence compared to the men. In fact, if one looks closely, Lily seemed to have decreased slightly over time, having an all time low in season 4. Marshall, Lily's partner in the show, does have much lower presence than both Barney and Ted but he has been catching up over time. | |
We also see an interesting pattern where Barney has a lot of peaks, suggesting that in some specific episodes he gains predominance, where Ted has an overall higher level of words per episode. And when Ted has peaks, it's usually below its trend-line. | |
Looking at the distribution | |
```{r} | |
# devtools::install_github("clauswilke/ggjoy") | |
library(ggjoy) | |
words_per_character %>% | |
split(.$name) %>% | |
setNames(1:length(.)) %>% | |
enframe(name = "episode_id") %>% | |
unnest() %>% | |
count(season, episode_id, character) %>% | |
group_by(episode_id) %>% | |
mutate(total_n = sum(n), | |
perc = round(n / total_n, 2)) %>% | |
ggplot(aes(x = perc, y = character, fill = character)) + | |
geom_joy(scale = 0.85) + | |
scale_fill_discrete(guide = F) + | |
scale_y_discrete(name = NULL, expand=c(0.01, 0)) + | |
scale_x_continuous(name = "Percentage of words", expand=c(0.01, 0)) + | |
ggtitle("Percentage of words per season") + | |
facet_wrap(~ season, ncol = 7) + | |
theme_minimal() | |
``` | |
we see the differences much clearer. For example, we see Barney's peaks through out every season with Season 6 seeing a clear peak of 40%. On the other hand, we see that their distributions don't change that much over time! Suggesting that the presence of each character is very similar in all seasons. Don't get me wrong, there are differences like Lily in Season 2 and then in Season 6, but in overall terms the previous plot suggests no increase over seasons, and this plot suggests that between seasons, there's not a lot of change in their distributions that affects the overall mean. | |
If you've watched the TV series, you'll remember Barney always repeating one similar trademark word: legendary! Although it is a bit cumbersome for us to count the number of occurrences of that sentence once we unnested each sentence, we can at least count the number of words per character and see whether some characters have particular words. | |
```{r} | |
count_words <- | |
words_per_character %>% | |
filter(!word %in% characters) %>% | |
count(character, word, sort = TRUE) | |
count_words %>% | |
group_by(character) %>% | |
top_n(20) %>% | |
ggplot(aes(reorder(word, n), n)) + | |
geom_col(alpha = 0.8) + | |
coord_flip() + | |
facet_wrap(~ character, scales = "free_y") | |
``` | |
Here we see that a lot of the words we capture are actually nouns or expressions which are common to everyone, such as 'yeah', 'hey' or 'time'. We can weight down commonly used words for other words which are important but don't get repeated a lot. We can exclude those words using `bind_tf_idf()`, which for each character decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection or corpus of documents (see 3.3 in http://tidytextmining.com/tfidf.html). | |
```{r} | |
count_words %>% | |
bind_tf_idf(word, character, n) %>% | |
arrange(desc(tf_idf)) %>% | |
group_by(character) %>% | |
top_n(20) %>% | |
ggplot(aes(reorder(word, n), n)) + | |
geom_col(alpha = 0.8) + | |
coord_flip() + | |
facet_wrap(~ character, scales = "free_y") | |
``` | |
Now Barney has a very distinctive word usage, one particularly sexist with words such as couger, bang (as in bang that girl, probably) and tits. Also, we see the word legendary as the thirdly repeated word, something we were expecting! On the other hand, we see Ted with things like professor (him), aunt (because of aunt Lily and such). | |
```{r} | |
library(wordcloud) | |
words_per_character %>% | |
anti_join(stop_words, by = "word") %>% | |
count(word, sort = TRUE) %>% | |
with(wordcloud(word, n, max.words = 200)) | |
words_per_character %>% | |
inner_join(get_sentiments("bing")) %>% | |
count(word, sentiment, sort = TRUE) %>% | |
bind_tf_idf(word, sentiment, n) %>% | |
arrange(desc(tf_idf)) %>% | |
reshape2::acast(word ~ sentiment, value.var = "n", fill = 0) %>% | |
comparison.cloud(colors = c("#F8766D", "#00BFC4"), | |
max.words = 100) | |
``` | |
## Experimental section | |
In this section (which I vaguely started), I'm attempting to predict who said what on the seasons which didn't have dialogues assigned. | |
```{r} | |
all_sentiments <- sentiments[!is.na(sentiments$sentiment), ]$sentiment %>% unique | |
sentiment_matrix <- | |
vector("numeric", nrow(words_per_character)) %>% | |
list() %>% | |
rep(all_sentiments %>% length) %>% | |
setNames(all_sentiments) %>% | |
as.data.frame() %>% | |
as.matrix() | |
for (.x in filter(sentiments, !is.na(sentiment)) %>% .$word %>% unique) { | |
word_to_search <- filter(sentiments, !is.na(sentiment), word == .x) | |
sentiments_word <- unique(word_to_search$sentiment) | |
matrix_try <- | |
ifelse(word_to_search$word %>% unique == words_per_character$word, 1, 0) %>% | |
list() %>% | |
rep(sentiments_word %>% length) %>% | |
setNames(sentiments_word) %>% | |
as.data.frame() %>% | |
as.matrix() | |
sentiment_matrix[, sentiments_word] <- sentiment_matrix[, sentiments_word] + matrix_try | |
} | |
``` | |
```{r} | |
sentence_features <- | |
words_per_character %>% | |
bind_cols(as_tibble(sentiment_matrix)) %>% | |
arrange(all_lines_id) %>% | |
split(.$all_lines_id) %>% | |
map(~ select(.x, -(1:7)) %>% colSums) %>% | |
reduce(bind_rows) | |
``` | |
```{r} | |
sentence_features_complete <- | |
sentence_features %>% mutate(id_line = words_per_character %>% | |
bind_cols(as_tibble(sentiment_matrix)) %>% | |
arrange(all_lines_id) %>% | |
distinct(all_lines_id) %>% | |
pull(all_lines_id)) | |
``` | |
```{r} | |
characters_lookup <- setNames(seq_along(characters), characters) | |
sentence_features_complete <- | |
left_join(sentence_features_complete, | |
enframe(lines_characters) %>% | |
unnest() %>% | |
mutate(all_lines_id = 1:nrow(.)), | |
by = c("id_line" = "all_lines_id")) %>% | |
select(id_line, name, character, text, episode_lines_id, id_line, everything()) %>% | |
mutate(character = as.numeric(characters_lookup[character])) | |
``` | |
```{r} | |
library(caret) | |
train_index <- createDataPartition(sentence_features_complete$character, p = 0.8, list = FALSE) | |
sentence_features_complete <- | |
sentence_features_complete %>% | |
select(character, | |
trust, | |
fear, | |
negative, | |
sadness, | |
anger, | |
surprise, | |
positive, | |
disgust, | |
joy, | |
anticipation, | |
uncertainty, | |
litigious, | |
constraining, | |
superfluous) | |
sentence_train <- slice(sentence_features_complete, train_index) | |
sentence_test <- slice(sentence_features_complete, -train_index) | |
lda_mod <- MASS::lda(character ~ ., data = sentence_train) | |
lda_pred <- predict(lda_mod, newdata = sentence_test)$class %>% as.numeric | |
confusionMatrix(lda_pred, sentence_test$character) | |
``` | |
```{r} | |
svm_mod <- e1071::svm(as.factor(character) ~ ., | |
data = sentence_train, kernel = "radial", scale = FALSE, | |
cost = 1) | |
sentence_test$pred <- predict(svm_mod, newdata = sentence_test) | |
confusionMatrix(sentence_test$character, sentence_test$pred) | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment