Skip to content

Instantly share code, notes, and snippets.

@sureshdontha
Forked from hopped/ml-with-nb-spam.R
Created May 27, 2019 05:56
Show Gist options
  • Save sureshdontha/4b2cacec5fe1d2afa1a4e0dd14e9386a to your computer and use it in GitHub Desktop.
Save sureshdontha/4b2cacec5fe1d2afa1a4e0dd14e9386a to your computer and use it in GitHub Desktop.
Filtering mobile spam messages with Naive Bayes (includes text mining transformations)
# Download data set via:
# http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
# import libraries
library(caret)
library(e1071)
library(tm)
library(SnowballC)
# read in the dataset
df <- read.table("SMSSpamCollection", sep="\t", header=FALSE, stringsAsFactors=FALSE, quote="", col.names=c("type", "text"))
df$type = factor(df$type)
# build the corpus
corpus <- Corpus(VectorSource(df$text, encoding='UTF-8'))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
stopWordList <- stopwords('english')
# NOTE: add some stopwords, and keep some special words
#
# stopWordList <- c(stopwords(), 'add-this-word')
# redundant <- which(stopWordList == "keep-this-word")
# stopWordList <- stopWordList[-redundant]
corpus <- tm_map(corpus, removeWords, stopWordList)
corpus.stemmed <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace)
# build the document-term-matrix
dtm <- DocumentTermMatrix(corpus.stemmed)
dict <- findFreqTerms(dtm, 10);
dtm.sparse <- DocumentTermMatrix(corpus.stemmed, list(dictionary = dict))
convert_to_factor <- function(x) {
x <- ifelse(x > 0, 1, 0);
x <- factor(x, levels=c(0,1), labels=c("No", "Yes"));
return(x)
}
dtm.final <- apply(dtm.sparse, MARGIN=2, convert_to_factor)
# build training and test corpora
sms.train <- dtm.final[1:4169,]
sms.test <- dtm.final[4170:5559,]
classes <- df$type
sms.train.classes <- classes[1:4169]
sms.test.classes <- classes[4170:5559]
# NOTE: compare train and test set class distribution
# prop.table(table(df.train.classes))
# prop.table(table(df.test.classes))
nb <- naiveBayes(sms.train, sms.train.classes, laplace=1)
pred <- predict(nb, sms.test)
confusionMatrix(pred, sms.test.classes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment