Skip to content

Instantly share code, notes, and snippets.

@cigrainger
Created February 28, 2014 18:11
Show Gist options
  • Select an option

  • Save cigrainger/9276449 to your computer and use it in GitHub Desktop.

Select an option

Save cigrainger/9276449 to your computer and use it in GitHub Desktop.
library(tm)
library(textcat)
library(dplyr)
library(topicmodels)
load('patents_titles.rdata')
# write.table(patents_titles,file='patentstitles.csv')
# patents_titles <- patents_titles[sample(1:nrow(patents_titles),100000,replace=FALSE),]
patents_titles$appln_title <- tolower(patents_titles$appln_title)
patents_titles$appln_title <- gsub("[^[:alnum:] ]", "",patents_titles$appln_title)
patents_titles$language <- textcat(patents_titles$appln_title)
english_titles <- filter(patents_titles,language=='english')
vs <- VectorSource(english_titles$appln_title)
c <- Corpus(vs)
c <- tm_map(c,removeWords,stopwords('english'))
c <- tm_map(c,stemDocument)
dtm <- DocumentTermMatrix(c)
findFreqTerms(dtm,1000)
lda <- LDA(dtm,k=50)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment