Created
February 16, 2011 14:57
-
-
Save mjbommar/829504 to your computer and use it in GitHub Desktop.
Comparison of NLTK and tm.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#@author Michael J Bommarito II | |
#@date Feb 16, 2011 | |
library(tm) | |
# Load the tweets | |
tweets <- unique(read.table('data/tweets_25bahman.csv', sep="\t", quote="", comment.char="", header=FALSE, nrows=100000, stringsAsFactors=FALSE)) | |
names(tweets) <- c("id", "date", "user", "text") | |
# Build the corpus and then apply the tm pre-processing methods | |
corpus <- Corpus(VectorSource(tweets$text)) | |
corpus <- tm_map(tm_map(tm_map(corpus, stripWhitespace), tolower), stemDocument) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment