benmarwick’s gists

benmarwick / ngrams.R

Created April 12, 2013 07:57

How to extract ngrams from a corpus with R's tm and RWeka packages. From http://tm.r-forge.r-project.org/faq.html

	library("RWeka")
	library("tm")

	data("crude")

	BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
	tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

	inspect(tdm[340:345,1:10])

benmarwick / JSTOR-DfR-words-over-time.R

Last active December 16, 2015 05:08

	#### prepare workspace
	rm(list = ls(all.names = TRUE))
	gc()

	#### get data into the R session
	# set R's working directory
	setwd("C:/Users/Marwick/Downloads/JSTOR") # change this to where you downloaded the data!
	# Get zip file of CSVs from JSTOR and unzip
	# this may take a few minutes...
	unzip("2013.4.20.FxFmBVYd.zip")

benmarwick / ARCHY-483-week-3-lab-template.Rmd

Last active March 30, 2020 04:09

A very short and simple tutorial for the basics of R. Based on http://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf

	---
	title: "Untitled"
	author: "Ben Marwick"
	date: "Wednesday, September 24, 2014"
	output: html_document
	---

	## Introduction

	This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. You should read through this entire document very carefully before making any changes or pressing any buttons.

benmarwick / topics-for-new-corpus.R

Created April 20, 2013 04:01

Calculate a topic model for a corpus then calculate the probability of those topics in a new corpus. Using R on windows 7.

	library(topicmodels)
	data(AssociatedPress)

	train <- AssociatedPress[1:100]
	test <- AssociatedPress[101:150]

	train.lda <- LDA(train,5)

	# Determine the posterior probabilities of the topics
	# for each document and of the terms for each topic

benmarwick / history-of-topics-JSTOR-DfR.R

Last active December 16, 2015 11:09

Start with zip file from JSTOR's DfR, end with a summary of topics, historical trends in topics, topics that are becoming more prominent, and topics that have declined. Entirely with R. Topics modeled using LDA.

	############################################################
	# History of Topics in American Archaeology
	############################################################

	#### prepare workspace
	rm(list = ls(all.names = TRUE))
	gc()

	#### get data into the R session
	# set R's working directory

benmarwick / POS-tagging-speed-tests.R

Created April 24, 2013 06:38

Speed tests of part-of-speech tagging with and without garbage collection.

	# get reproducible data
	reut21578 <- system.file("texts", "crude", package = "tm")
	(r <- Corpus(DirSource(reut21578),
	readerControl = list(reader = readReut21578XMLasPlain)))

	# voodoo to give Java 2 gb of RAM, have to do it before loading JVM
	options(java.parameters = "-Xmx2g" )
	# load and install packages if not already
	list.of.packages <- c("tm", "openNLP", "openNLPmodels.en")
	new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]

benmarwick / DESCRIPTION

Last active December 16, 2015 15:49

Basics package creation steps in RStudio and Github

	Package: mypackage
	Title: What the package does (short line)
	Version: 0.1
	Authors@R: "First Last <[email protected]> [aut, cre]"
	Description: What the package does (paragraph)
	Depends:
	R (>= 3.1.1)
	License: MIT
	LazyData: true
	VignetteBuilder: knitr

benmarwick / various_speed_tests.R

Last active January 10, 2020 19:23

Speed tests of different ways to read into R large numbers of CSV files, specifically read.csv.sql, read.csv (optimised) and fread, also of parallel processing and interactive disk storage options (filehash)

	# Speed tests of different ways to read in large numbers of CSV files
	# specifically read.csv.sql, read.csv (optimised) and fread


	library(sqldf)
	setwd("~/Downloads/wordcounts")
	files <- sample(list.files(".", pattern="*.csv\|CSV$"), 10000)

	############# read.csv.sql ###################
	system.time(

benmarwick / ls.kB.R

Last active December 17, 2015 14:29

List data object in memory with size (in kB) and mode. From http://r.789695.n4.nabble.com/Size-of-an-object-in-workspace-tp823649p823653.html

	# List data object in memory with size (in kB) and mode. From http://r.789695.n4.nabble.com/Size-of-an-object-in-workspace-tp823649p823653.html
	ls.kb <- function(pos=1, sorted=F){
	.result <- sapply(ls(pos=pos, all.names=TRUE),
	function(..x)object.size(eval(as.symbol(..x))))
	if (sorted){
	.result <- rev(sort(.result))
	}
	.ls <-
	as.data.frame(rbind(as.matrix(.result),"**Total"=sum(.result)))
	names(.ls) <- "Size (kB)"

benmarwick / DTM_disk_storage.R

Created June 2, 2013 08:22

Storing a DocumentTermMatrix on disk (ie. out of memory) using the filehash and ff packages in R

	# test of storing a DTM on disk...

	# reproducible data
	library(tm)
	data(crude)
	dtm <- DocumentTermMatrix(crude)

	library(filehash)
	dbCreate("testDB")
	db <- dbInit("testDB")

Ben Marwick benmarwick