Last active
August 9, 2020 19:54
-
-
Save primaryobjects/094d24084d1045c011b7 to your computer and use it in GitHub Desktop.
Simple example of classifying text in R with machine learning (text-mining library, caret, and bayesian generalized linear model). Classify. tfidf tdm term document matrix
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(caret) | |
library(tm) | |
# Training data. | |
data <- c('Cats like to chase mice.', 'Dogs like to eat big bones.') | |
corpus <- VCorpus(VectorSource(data)) | |
# Create a document term matrix. | |
tdm <- DocumentTermMatrix(corpus, list(removePunctuation = TRUE, stopwords = TRUE, stemming = TRUE, removeNumbers = TRUE)) | |
# Convert to a data.frame for training and assign a classification (factor) to each document. | |
train <- as.matrix(tdm) | |
train <- cbind(train, c(0, 1)) | |
colnames(train)[ncol(train)] <- 'y' | |
train <- as.data.frame(train) | |
train$y <- as.factor(train$y) | |
# Train. | |
fit <- train(y ~ ., data = train, method = 'bayesglm') | |
# Check accuracy on training. | |
predict(fit, newdata = train) | |
# Test data. | |
data2 <- c('Bats eat bugs.') | |
corpus <- VCorpus(VectorSource(data2)) | |
tdm <- DocumentTermMatrix(corpus, control = list(dictionary = Terms(tdm), removePunctuation = TRUE, stopwords = TRUE, stemming = TRUE, removeNumbers = TRUE)) | |
test <- as.matrix(tdm) | |
# Check accuracy on test. | |
predict(fit, newdata = test) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> data | |
[1] "Cats like to chase mice." "Dogs like to eat big bones." | |
> train | |
big bone cat chase dog eat like mice y | |
1 0 0 1 1 0 0 1 1 0 | |
2 1 1 0 0 1 1 1 0 1 | |
> predict(fit, newdata = train) | |
[1] 0 1 | |
> data2 | |
[1] "Bats eat bugs." | |
> test | |
big bone cat chase dog eat like mice | |
1 0 0 0 0 0 1 0 0 | |
> predict(fit, newdata = test) | |
[1] 1 | |
> |
There's a problem in
Train.
fit <- train(y ~ ., data = train, method = 'bayesglm')
With this output:
Error in model.frame.default(form = y ~ ., data = train, na.action = na.fail) :
invalid type (list) for variable 'y'
Yes, there is a problem there, because there is a function called 'train.' But, in line 12, you override that function with a data matrix. You essentially destroy the function, or replace it with a data matrix.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
hello thank you very much sharing this, but I belive
predict(fit, newdata = train) should be tested on the test set rather then train? as this link suggests : https://cran.r-project.org/web/packages/caret/vignettes/caret.html