Skip to content

Instantly share code, notes, and snippets.

@benmarwick
benmarwick / ngrams.R
Created April 12, 2013 07:57
How to extract ngrams from a corpus with R's tm and RWeka packages. From http://tm.r-forge.r-project.org/faq.html
library("RWeka")
library("tm")
data("crude")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(tdm[340:345,1:10])
#### prepare workspace
rm(list = ls(all.names = TRUE))
gc()
#### get data into the R session
# set R's working directory
setwd("C:/Users/Marwick/Downloads/JSTOR") # change this to where you downloaded the data!
# Get zip file of CSVs from JSTOR and unzip
# this may take a few minutes...
unzip("2013.4.20.FxFmBVYd.zip")
---
title: "Untitled"
author: "Ben Marwick"
date: "Wednesday, September 24, 2014"
output: html_document
---
## Introduction
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. You should read through this entire document very carefully before making any changes or pressing any buttons.
@benmarwick
benmarwick / topics-for-new-corpus.R
Created April 20, 2013 04:01
Calculate a topic model for a corpus then calculate the probability of those topics in a new corpus. Using R on windows 7.
library(topicmodels)
data(AssociatedPress)
train <- AssociatedPress[1:100]
test <- AssociatedPress[101:150]
train.lda <- LDA(train,5)
# Determine the posterior probabilities of the topics
# for each document and of the terms for each topic
@benmarwick
benmarwick / history-of-topics-JSTOR-DfR.R
Last active December 16, 2015 11:09
Start with zip file from JSTOR's DfR, end with a summary of topics, historical trends in topics, topics that are becoming more prominent, and topics that have declined. Entirely with R. Topics modeled using LDA.
############################################################
# History of Topics in American Archaeology
############################################################
#### prepare workspace
rm(list = ls(all.names = TRUE))
gc()
#### get data into the R session
# set R's working directory
@benmarwick
benmarwick / POS-tagging-speed-tests.R
Created April 24, 2013 06:38
Speed tests of part-of-speech tagging with and without garbage collection.
# get reproducible data
reut21578 <- system.file("texts", "crude", package = "tm")
(r <- Corpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain)))
# voodoo to give Java 2 gb of RAM, have to do it before loading JVM
options(java.parameters = "-Xmx2g" )
# load and install packages if not already
list.of.packages <- c("tm", "openNLP", "openNLPmodels.en")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
@benmarwick
benmarwick / DESCRIPTION
Last active December 16, 2015 15:49
Basics package creation steps in RStudio and Github
Package: mypackage
Title: What the package does (short line)
Version: 0.1
Authors@R: "First Last <[email protected]> [aut, cre]"
Description: What the package does (paragraph)
Depends:
R (>= 3.1.1)
License: MIT
LazyData: true
VignetteBuilder: knitr
@benmarwick
benmarwick / various_speed_tests.R
Last active January 10, 2020 19:23
Speed tests of different ways to read into R large numbers of CSV files, specifically read.csv.sql, read.csv (optimised) and fread, also of parallel processing and interactive disk storage options (filehash)
# Speed tests of different ways to read in large numbers of CSV files
# specifically read.csv.sql, read.csv (optimised) and fread
library(sqldf)
setwd("~/Downloads/wordcounts")
files <- sample(list.files(".", pattern="*.csv|CSV$"), 10000)
############# read.csv.sql ###################
system.time(
@benmarwick
benmarwick / ls.kB.R
Last active December 17, 2015 14:29
List data object in memory with size (in kB) and mode. From http://r.789695.n4.nabble.com/Size-of-an-object-in-workspace-tp823649p823653.html
# List data object in memory with size (in kB) and mode. From http://r.789695.n4.nabble.com/Size-of-an-object-in-workspace-tp823649p823653.html
ls.kb <- function(pos=1, sorted=F){
.result <- sapply(ls(pos=pos, all.names=TRUE),
function(..x)object.size(eval(as.symbol(..x))))
if (sorted){
.result <- rev(sort(.result))
}
.ls <-
as.data.frame(rbind(as.matrix(.result),"**Total"=sum(.result)))
names(.ls) <- "Size (kB)"
@benmarwick
benmarwick / DTM_disk_storage.R
Created June 2, 2013 08:22
Storing a DocumentTermMatrix on disk (ie. out of memory) using the filehash and ff packages in R
# test of storing a DTM on disk...
# reproducible data
library(tm)
data(crude)
dtm <- DocumentTermMatrix(crude)
library(filehash)
dbCreate("testDB")
db <- dbInit("testDB")