Last active
June 13, 2016 11:46
-
-
Save amir-rahnama/694f8fdfe9ffcc3b2b22070220f4f7d4 to your computer and use it in GitHub Desktop.
Capstone report for the Data Specialization Capstone project course
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Capstone First Milestone Report: Feature Engineering" | |
author: "Amir Hossein Rahnama" | |
date: "11 June 2016" | |
output: html_document | |
--- | |
####Introduction | |
In this report, we are analyzing I have tried to start by showing a summary of all three data source from SwiftKey in terms of size. You can obtain the data with following code: | |
```{r eval=FALSE} | |
download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip') | |
unzip('Coursera-SwiftKey.zip') | |
``` | |
Data is a text-based source. Each row represents a document: piece of news, blog post or a tweet separated by a new line. | |
####General NLP process | |
In any NLP process, we need to first clean the text. Free-texts are usually noisy data types since people mistakenly type wrong words and there are many words that repeat so many times but they can be removed from the text without text losing their meanings. Some of these words are called stop words such as ```is```, ```can``` and etc. Also, since we use methods that transform text to numeric values, like frequency of words, it is not necessary to count different words with the same stem word like going and go since then our analysis will not be focused on meaning and our result will be more sparce. | |
####Preliminary Notes | |
Here are the list of packages we used for this report: | |
* wordcloud | |
* tm | |
* SnowballC | |
* slam | |
* ggplot2 | |
* RWeka | |
```{r, echo=FALSE, warning=FALSE} | |
library(wordcloud) | |
library(tm) | |
library(SnowballC) | |
library(slam) | |
library(ggplot2) | |
library(RWeka) | |
``` | |
Please note that some results that are echoed off for readability and you can see all the report with details [here](https://gist.github.com/ambodi/694f8fdfe9ffcc3b2b22070220f4f7d4). I should also emphasize that I am using the English Corpus for this report. | |
###Summary of the data | |
We use readLines in ```R``` to read each source and perform a high-level summary of data sources: | |
```{r, echo=FALSE, warning=FALSE} | |
con <- file("/Users/ara/dev/personal/r/final/en_US/en_US.blogs.txt", "r") | |
blogs <- readLines(con) | |
close(con) | |
con <- file("/Users/ara/dev/personal/r/final/en_US/en_US.news.txt", "r") | |
news <- readLines(con) | |
close(con) | |
con <- file("/Users/ara/dev/personal/r/final/en_US/en_US.twitter.txt", "r") | |
twitter <- readLines(con) | |
close(con) | |
``` | |
Using the summarize function I wrote, we show the total number of lines and words in each source. | |
```{r} | |
summarize <- function(d) { | |
summary<-data.frame(length(d),sum(sapply(strsplit(d,"\\s+"),length))) | |
return(summary) | |
} | |
``` | |
Here's how I used the function to show a summary of the sources: | |
```{r} | |
result <- data.frame(summarize(news)) | |
result <- rbind(result, summarize(blogs)) | |
result <- rbind(result, summarize(twitter)) | |
rownames(result) <- c('news', 'blogs', 'twitter') | |
colnames(result) <- c('Total Number of Lines', 'Total Number of Words') | |
result | |
``` | |
###Loading data | |
Due to the file size and memory limitations of working with original sources, I decided to sample 30% of the data for this report. Using ```readLines``` function and information we had above about the number of lines in each source, we load only those samples: | |
```{r} | |
con <- file("/Users/ara/dev/personal/r/final/en_US/en_US.blogs.txt", "r") | |
blogs <- readLines(con, 269786, encoding = 'UTF-8') | |
close(con) | |
con <- file("/Users/ara/dev/personal/r/final/en_US/en_US.news.txt", "r") | |
news <- readLines(con, 303072, encoding = 'UTF-8') | |
close(con) | |
con <- file("/Users/ara/dev/personal/r/final/en_US/en_US.twitter.txt", "r") | |
twitter <- readLines(con, 708044, encoding = 'UTF-8') | |
close(con) | |
``` | |
###Cleaning data | |
Here are the steps we use to clean the data in each source, using a function I wrote called ```clean```: | |
```{r} | |
clean <- function(docs) { | |
docs <- removeNumbers(docs) | |
docs <- removePunctuation(docs) | |
docs <- stripWhitespace(docs) | |
docs <- stemDocument(docs) | |
return(docs) | |
} | |
``` | |
Clean function performs the following actions: | |
* It removes numbers from the text source | |
* It removes all the punctuations like '-' or ',' | |
* It removes additional whitespace | |
* It decreased different stems of a word to a single usage: ```working``` as a term will be mapped to ```work```. | |
Let's apply the clean function to our data sources: | |
```{r} | |
blogs <- clean(blogs) | |
news <- clean(news) | |
twitter <- clean(twitter) | |
``` | |
###Document Term Matrix | |
In this section, we provide a Document Term Matrix from our cleaned sources. In simple words, In a DTM, Each document (news piece, tweet or a blog post) has a unique row, we call thsoe documents in general. Each row has a corresponding column that shows the frequency of all words in the data source in each document. Needless to say that if a word from the source did not appear in the document, the corresponding value for that word will be 0. We are using both ```tm``` and ```slam``` package to create TDM for our sources. We showcase three cases of n-gram models. The n represents the number of sequences in a sentence. So unigram models (n=1) consider each token (or word) a sequence. Bigram models (n=2) consider a combination of two words and so forth. | |
####Blogs | |
In the blogs data source, we transform the source to a data frame to be able to make a Corpus object from it, since ```TermDocumentMatrix``` needs it. Note that we are using the flag ```stopwords = TRUE``` in order to remove stop words that does not add much value from the DTM. We then sum up all the columns which will then give us total number of appearance of each word or term in all documents and then we order them with popular words on top: | |
```{r} | |
blogs_frame <- data.frame(V1 = blogs, stringsAsFactors = FALSE) | |
blogs_corp <- Corpus(DataframeSource(blogs_frame)) | |
dtm <- TermDocumentMatrix(blogs_corp, control = list(removePunctuation = TRUE, stopwords = TRUE)) | |
freq <- col_sums(dtm, na.rm = T) | |
ordered_frequency <- order(freq,decreasing=TRUE) | |
``` | |
Let's see the most popular terms: | |
```{r} | |
freq[tail(ordered_frequency)] | |
``` | |
and Least popular terms: | |
```{r} | |
freq[head(ordered_frequency)] | |
``` | |
As you can see a lot of top terms have 3 letters. In order to have a more clear idea of our data, we limit the DTM to consider terms that have at least 4 letters and we keep this rule for other sources as well: | |
```{r} | |
options(mc.cores=1) | |
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) | |
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) | |
dtm <-DocumentTermMatrix(blogs_corp, control=list(wordLengths=c(4, 25), bounds = list(global = c(3,30)))) | |
freq <- col_sums(dtm, na.rm = T) | |
ordered_frequency <- order(freq,decreasing=TRUE) | |
dtm_bigram <- TermDocumentMatrix(blogs_corp, control = list(removePunctuation = TRUE, stopwords = TRUE, tokenize = BigramTokenizer)) | |
freq_bigram <- col_sums(dtm_bigram, na.rm = T) | |
ordered_frequency_bigram <- order(freq_bigram,decreasing=TRUE) | |
dtm_trigram <- TermDocumentMatrix(blogs_corp, control = list(removePunctuation = TRUE, stopwords = TRUE, tokenize = TrigramTokenizer)) | |
freq_trigram <- col_sums(dtm_trigram, na.rm = T) | |
ordered_frequency_trigram <- order(freq_trigram,decreasing=TRUE) | |
``` | |
Let's see the most popular terms after redefining our TDM for 1:3-gram models: | |
```{r} | |
freq[head(ordered_frequency)] | |
``` | |
and Least popular terms: | |
```{r} | |
freq[tail(ordered_frequency)] | |
``` | |
And bigram: | |
```{r} | |
freq_bigram[head(ordered_frequency_bigram)] | |
``` | |
and Least popular terms: | |
```{r} | |
freq_bigram[tail(ordered_frequency_bigram)] | |
``` | |
And trigram: | |
```{r} | |
freq_trigram[head(ordered_frequency_trigram)] | |
``` | |
and Least popular terms: | |
```{r} | |
freq_trigram[tail(ordered_frequency_trigram)] | |
``` | |
#####Wordcloud | |
In order to have a better understanding of most frequent terms in our Corpus, instead of looking at heads or tails of the vector, we create a visualization called Wordcloud using the ```wordcloud``` library. First we need to transform our labeled numeric vector to a data frame so that wordcloud is able to visualize the result: | |
```{r, warning=FALSE} | |
term_occurances_blogs = data.frame(term=names(freq),occurrences=freq) | |
wordcloud(words = term_occurances_blogs$term, freq = term_occurances_blogs$occurrences, min.freq = 1, | |
max.words=200, random.order=FALSE, rot.per=0.35, | |
colors=brewer.pal(8, "Dark2")) | |
``` | |
And for bigram: | |
```{r, warning=FALSE} | |
term_occurances_blogs_bigram = data.frame(term=names(freq_bigram),occurrences=freq_bigram) | |
wordcloud(words = term_occurances_blogs_bigram$term, freq = term_occurances_blogs_bigram$occurrences, min.freq = 1, | |
max.words=200, random.order=FALSE, rot.per=0.35, | |
colors=brewer.pal(8, "Dark2")) | |
``` | |
And for trigram: | |
```{r, warning=FALSE} | |
term_occurances_blogs_trigram = data.frame(term=names(freq_trigram),occurrences=freq_trigram) | |
wordcloud(words = term_occurances_blogs_trigram$term, freq = term_occurances_blogs_trigram$occurrences, min.freq = 1, | |
max.words=200, random.order=FALSE, rot.per=0.35, | |
colors=brewer.pal(8, "Dark2")) | |
``` | |
####Histogram | |
Wordcloud is a nice visualization but it does not really reflect numerical factors of frequencies as a good old-fashion histogram does. Let us have a look at a histogram of the same source: | |
```{r, echo=FALSE, warning=FALSE} | |
p <- ggplot(subset(term_occurances_blogs, occurrences>40), aes(term, occurrences)) | |
p <- p + geom_bar(stat="identity") | |
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) | |
p | |
``` | |
And here's the bigram version: | |
```{r, echo=FALSE, warning=FALSE} | |
p <- ggplot(subset(term_occurances_blogs_bigram, occurrences>40), aes(term, occurrences)) | |
p <- p + geom_bar(stat="identity") | |
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) | |
p | |
``` | |
And finally the trigram version: | |
```{r, echo=FALSE, warning=FALSE} | |
p <- ggplot(subset(term_occurances_blogs_trigram, occurrences>40), aes(term, occurrences)) | |
p <- p + geom_bar(stat="identity") | |
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) | |
p | |
``` | |
####News | |
Now let's have a look at the news data source: | |
```{r} | |
news_frame <- data.frame(V1 = news, stringsAsFactors = FALSE) | |
news_corp <- Corpus(DataframeSource(news_frame)) | |
dtm_news <-DocumentTermMatrix(news_corp, control=list(wordLengths=c(4, 25), bounds = list(global = c(3,30)))) | |
freq_news <- col_sums(dtm_news, na.rm = T) | |
ordered_frequency_news <- order(freq_news,decreasing=TRUE) | |
dtm_news_bigram <-DocumentTermMatrix(news_corp, control=list(wordLengths=c(4, 25), bounds = list(global = c(3,30)), tokenize = BigramTokenizer)) | |
freq_news_bigram <- col_sums(dtm_news_bigram, na.rm = T) | |
ordered_frequency_news_bigram <- order(freq_news_bigram,decreasing=TRUE) | |
dtm_news_trigram <-DocumentTermMatrix(news_corp, control=list(wordLengths=c(4, 25), bounds = list(global = c(3,30)), tokenize = TrigramTokenizer)) | |
freq_news_trigram <- col_sums(dtm_news_trigram, na.rm = T) | |
ordered_frequency_news_trigram <- order(freq_news_trigram,decreasing=TRUE) | |
``` | |
Let's see the most popular terms after redefining our TDM: | |
```{r} | |
freq_news[head(ordered_frequency_news)] | |
``` | |
and Least popular terms: | |
```{r} | |
freq_news[tail(ordered_frequency_news)] | |
``` | |
And for bigram: | |
```{r} | |
freq_news_bigram[head(ordered_frequency_news_bigram)] | |
``` | |
and Least popular terms: | |
```{r} | |
freq_news_bigram[tail(ordered_frequency_news_bigram)] | |
``` | |
And for trigram: | |
```{r} | |
freq_news_trigram[head(ordered_frequency_news_trigram)] | |
``` | |
and Least popular terms: | |
```{r} | |
freq_news_trigram[tail(ordered_frequency_news_trigram)] | |
``` | |
#####Wordcloud | |
Let's render the wordcloud again, note the limitation of 200 words and unordered renders: | |
```{r, warning=FALSE} | |
term_occurances_news = data.frame(term=names(freq_news),occurrences=freq_news) | |
wordcloud(words = term_occurances_news$term, freq = term_occurances_news$occurrences, min.freq = 1, | |
max.words=200, random.order=FALSE, rot.per=0.35, | |
colors=brewer.pal(8, "Dark2")) | |
``` | |
And bigram: | |
```{r, warning=FALSE} | |
term_occurances_news_bigram = data.frame(term=names(freq_news_bigram),occurrences=freq_news_bigram) | |
wordcloud(words = term_occurances_news_bigram$term, freq = term_occurances_news_bigram$occurrences, min.freq = 1, | |
max.words=200, random.order=FALSE, rot.per=0.35, | |
colors=brewer.pal(8, "Dark2")) | |
``` | |
And trigram: | |
```{r, warning=FALSE} | |
term_occurances_news_trigram = data.frame(term=names(freq_news_trigram),occurrences=freq_news_trigram) | |
wordcloud(words = term_occurances_news_trigram$term, freq = term_occurances_news_trigram$occurrences, min.freq = 1, | |
max.words=200, random.order=FALSE, rot.per=0.35, | |
colors=brewer.pal(8, "Dark2")) | |
``` | |
####Histogram | |
Let us have a look at a histogram of the news source: | |
```{r, echo=FALSE, warning=FALSE} | |
p <- ggplot(subset(term_occurances_news, occurrences>35), aes(term, occurrences)) | |
p <- p + geom_bar(stat="identity") | |
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) | |
p | |
``` | |
And let's do the same for bigram: | |
```{r, echo=FALSE, warning=FALSE} | |
p <- ggplot(subset(term_occurances_news_bigram, occurrences>35), aes(term, occurrences)) | |
p <- p + geom_bar(stat="identity") | |
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) | |
p | |
``` | |
And finally, the trigram: | |
```{r, echo=FALSE, warning=FALSE} | |
p <- ggplot(subset(term_occurances_news_trigram, occurrences>35), aes(term, occurrences)) | |
p <- p + geom_bar(stat="identity") | |
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) | |
p | |
``` | |
Now let's have a look at the twitter data source: | |
```{r} | |
twitter_frame <- data.frame(V1 = twitter, stringsAsFactors = FALSE) | |
twitter_corp <- Corpus(DataframeSource(twitter_frame)) | |
dtm_twitter <-DocumentTermMatrix(twitter_corp, control=list(wordLengths=c(4, 25), bounds = list(global = c(3,30)))) | |
freq_twitter <- col_sums(dtm_twitter, na.rm = T) | |
ordered_frequency_twitter <- order(freq_twitter,decreasing=TRUE) | |
dtm_twitter_bigram <-DocumentTermMatrix(twitter_corp, control=list(wordLengths=c(4, 25), bounds = list(global = c(3,30)), tokenize = BigramTokenizer)) | |
freq_twitter_bigram <- col_sums(dtm_twitter_bigram, na.rm = T) | |
ordered_frequency_twitter_bigram <- order(freq_twitter_bigram,decreasing=TRUE) | |
dtm_twitter_trigram <-DocumentTermMatrix(twitter_corp, control=list(wordLengths=c(4, 25), bounds = list(global = c(3,30)), tokenize = TrigramTokenizer)) | |
freq_twitter_trigram <- col_sums(dtm_twitter_trigram, na.rm = T) | |
ordered_frequency_twitter_trigram <- order(freq_twitter_trigram,decreasing=TRUE) | |
``` | |
Let's see the most popular terms after redefining our TDM: | |
```{r} | |
freq_twitter[head(ordered_frequency_twitter)] | |
``` | |
and Least popular terms: | |
```{r} | |
freq_twitter[tail(ordered_frequency_twitter)] | |
``` | |
And for bigram: | |
```{r} | |
freq_twitter_bigram[head(ordered_frequency_twitter_bigram)] | |
``` | |
and Least popular terms: | |
```{r} | |
freq_twitter_bigram[tail(ordered_frequency_twitter_bigram)] | |
``` | |
Trigram version: | |
```{r} | |
freq_twitter_trigram[head(ordered_frequency_twitter_trigram)] | |
``` | |
and Least popular terms: | |
```{r} | |
freq_twitter_trigram[tail(ordered_frequency_twitter_trigram)] | |
``` | |
#####Wordcloud | |
Let's render the wordcloud again, note the limitation of 200 words and unordered renders: | |
```{r, warning=FALSE} | |
term_occurances_twitter = data.frame(term=names(freq_twitter),occurrences=freq_twitter) | |
wordcloud(words = term_occurances_twitter$term, freq = term_occurances_twitter$occurrences, min.freq = 1, | |
max.words=200, random.order=FALSE, rot.per=0.35, | |
colors=brewer.pal(8, "Dark2")) | |
``` | |
Bigram: | |
```{r, warning=FALSE} | |
term_occurances_twitter_bigram = data.frame(term=names(freq_twitter_bigram),occurrences=freq_twitter_bigram) | |
wordcloud(words = term_occurances_twitter_bigram$term, freq = term_occurances_twitter_bigram$occurrences, min.freq = 1, | |
max.words=200, random.order=FALSE, rot.per=0.35, | |
colors=brewer.pal(8, "Dark2")) | |
``` | |
Trigram: | |
```{r, warning=FALSE} | |
term_occurances_twitter_trigram = data.frame(term=names(freq_twitter_trigram),occurrences=freq_twitter_trigram) | |
wordcloud(words = term_occurances_twitter_trigram$term, freq = term_occurances_twitter_trigram$occurrences, min.freq = 1, | |
max.words=200, random.order=FALSE, rot.per=0.35, | |
colors=brewer.pal(8, "Dark2")) | |
``` | |
####Histogram | |
Let us have a look at a histogram of the news source: | |
```{r, echo=FALSE, warning=FALSE} | |
p <- ggplot(subset(term_occurances_twitter, occurrences>35), aes(term, occurrences)) | |
p <- p + geom_bar(stat="identity") | |
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) | |
p | |
``` | |
Bigram: | |
```{r, echo=FALSE, warning=FALSE} | |
p <- ggplot(subset(term_occurances_twitter_bigram, occurrences>35), aes(term, occurrences)) | |
p <- p + geom_bar(stat="identity") | |
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) | |
p | |
``` | |
Trigram: | |
Let us have a look at a histogram of the news source: | |
```{r, echo=FALSE, warning=FALSE} | |
p <- ggplot(subset(term_occurances_twitter_trigram, occurrences>35), aes(term, occurrences)) | |
p <- p + geom_bar(stat="identity") | |
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) | |
p | |
``` | |
####Summary | |
We showed some early results of SwiftKey cleaning process of the data. We also showed that using Term Document Matrix method, we can create a metrix of documents as rows and columns as terms (token or words). Transformations such as using stop words or removing numbers and etc were used to create a less sparse model out of our data. We applied 1:3-grams for each data source showing WordCloud and Histogram of each of the words. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment