This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library("RWeka") | |
library("tm") | |
data("crude") | |
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) | |
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) | |
inspect(tdm[340:345,1:10]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#### prepare workspace | |
rm(list = ls(all.names = TRUE)) | |
gc() | |
#### get data into the R session | |
# set R's working directory | |
setwd("C:/Users/Marwick/Downloads/JSTOR") # change this to where you downloaded the data! | |
# Get zip file of CSVs from JSTOR and unzip | |
# this may take a few minutes... | |
unzip("2013.4.20.FxFmBVYd.zip") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Untitled" | |
author: "Ben Marwick" | |
date: "Wednesday, September 24, 2014" | |
output: html_document | |
--- | |
## Introduction | |
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. You should read through this entire document very carefully before making any changes or pressing any buttons. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(topicmodels) | |
data(AssociatedPress) | |
train <- AssociatedPress[1:100] | |
test <- AssociatedPress[101:150] | |
train.lda <- LDA(train,5) | |
# Determine the posterior probabilities of the topics | |
# for each document and of the terms for each topic |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
############################################################ | |
# History of Topics in American Archaeology | |
############################################################ | |
#### prepare workspace | |
rm(list = ls(all.names = TRUE)) | |
gc() | |
#### get data into the R session | |
# set R's working directory |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# get reproducible data | |
reut21578 <- system.file("texts", "crude", package = "tm") | |
(r <- Corpus(DirSource(reut21578), | |
readerControl = list(reader = readReut21578XMLasPlain))) | |
# voodoo to give Java 2 gb of RAM, have to do it before loading JVM | |
options(java.parameters = "-Xmx2g" ) | |
# load and install packages if not already | |
list.of.packages <- c("tm", "openNLP", "openNLPmodels.en") | |
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Package: mypackage | |
Title: What the package does (short line) | |
Version: 0.1 | |
Authors@R: "First Last <[email protected]> [aut, cre]" | |
Description: What the package does (paragraph) | |
Depends: | |
R (>= 3.1.1) | |
License: MIT | |
LazyData: true | |
VignetteBuilder: knitr |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Speed tests of different ways to read in large numbers of CSV files | |
# specifically read.csv.sql, read.csv (optimised) and fread | |
library(sqldf) | |
setwd("~/Downloads/wordcounts") | |
files <- sample(list.files(".", pattern="*.csv|CSV$"), 10000) | |
############# read.csv.sql ################### | |
system.time( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# List data object in memory with size (in kB) and mode. From http://r.789695.n4.nabble.com/Size-of-an-object-in-workspace-tp823649p823653.html | |
ls.kb <- function(pos=1, sorted=F){ | |
.result <- sapply(ls(pos=pos, all.names=TRUE), | |
function(..x)object.size(eval(as.symbol(..x)))) | |
if (sorted){ | |
.result <- rev(sort(.result)) | |
} | |
.ls <- | |
as.data.frame(rbind(as.matrix(.result),"**Total"=sum(.result))) | |
names(.ls) <- "Size (kB)" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# test of storing a DTM on disk... | |
# reproducible data | |
library(tm) | |
data(crude) | |
dtm <- DocumentTermMatrix(crude) | |
library(filehash) | |
dbCreate("testDB") | |
db <- dbInit("testDB") |