Skip to content

Instantly share code, notes, and snippets.

@benmarwick
benmarwick / R2MALLET-loop.r
Created January 16, 2013 07:58
Uses R to control MALLET and generate models with different numbers of topics. A Log Likelihood value is extracted from each model and the vector of Log Likelihood values from all models generated is plotted and inspected to see which model has the highest, and thus what number of topics best suits the corpus. For a Windows machine.
# setup system enviroment for R and MALLET
MALLET_HOME <- "c:/mallet-2.0.7" # location of the bin directory
Sys.setenv("MALLET_HOME" = MALLET_HOME)
Sys.setenv(PATH = "c:/Program Files (x86)/Java/jre7/bin")
# configure variables and filenames for MALLET
## here using MALLET's built-in example data
# set list of topic numbers to iterate over
seq <- seq(2, 100, 1)
@benmarwick
benmarwick / R2MALLET-loop-linux.r
Last active December 11, 2015 06:28
Uses R to control MALLET and generate models with different numbers of topics. A Log Likelihood value is extracted from each model and the vector of Log Likelihood values from all models generated is plotted and inspected to see which model has the highest, and thus what number of topics best suits the corpus. For a Linux machine.
# R interface with MALLET to loop over different numbers of topics
# on a linux machine
# first, download MALLET
# second, install java
# configure variables and filenames for MALLET
## here using MALLET's built-in example data
# set list of topic numbers to iterate over
@benmarwick
benmarwick / googlesheet2R.r
Last active December 11, 2015 18:19
Use a google spreadsheet as data in R (with RCurl and options to support https)
# More detail: http://blog.revolutionanalytics.com/2009/09/how-to-use-a-google-spreadsheet-as-data-in-r.html and
# http://exploredata.wordpress.com/2012/08/20/importing-a-google-spreadsheet-into-r/
googsheet <- "full URL of google doc here, must end with &output=csv"
require(RCurl)
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
myCsv <- getURL(googsheet)
data <- read.csv(textConnection(myCsv), stringsAsFactors = FALSE)
library(sqldf)
sqldf("SELECT
day
, avg(temp) as avg_temp
FROM beaver2
GROUP BY
day;")
# day avg_temp
@benmarwick
benmarwick / VLOOKUP-with-R.R
Last active April 17, 2017 19:29
A collection of (mostly other people's) methods for reproducing Excel's VLOOKUP function in R
# Methods for doing Excel's VLOOKUP with R
# sample data
x <- data.frame(id = c(1, 2, 3, 4), name = c('foo', 'bar', 'bob', 'joe'))
y <- data.frame(idblah = c(5, 2, 4, 3, 1), sex = c('m', 'f', 'f', 'm', 'm'))
z <- data.frame(id = c(1, 2, 3, 4, 5), sex = c('g', 'b', 'b', 'g', 'g'))
# function for find a single value
vlookup <- function(val, df, col){
df[df[1] == val, col][1] }
@benmarwick
benmarwick / 3-correlation-methods.R
Created February 19, 2013 09:12
Three ways to calculate correlation in R. Basics of the common correlation statistic (pearson/kendall/spearman), the newer distance correlation statistic (Brownian distance covariance) and the ever newer maximal information coefficient (a maximal information-based nonparametric exploration (MINE) statistic) in R
# three correlation methods
duration = faithful$eruptions # the eruption durations
waiting = faithful$waiting # the waiting period
plot(duration, waiting)
cor(duration, waiting)
cor.test(duration, waiting)
# distance correlation statistic
@benmarwick
benmarwick / HTML2DTM.r
Created February 22, 2013 08:13
Take a folder of HTML files and convert them to a document term matrix for text mining. Includes removal of non-ASCII characters and iterative removal of stopwords
# get data
setwd("C:/Downloads/html") # this folder has only the HTML files
html <- list.files()
# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
@benmarwick
benmarwick / auto-install-and-load-packages.R
Last active February 21, 2021 01:20
A method to automatically download, install and load R packages that only requires the package name to be typed once. Credits: http://stackoverflow.com/a/8176099/1036500 & http://stackoverflow.com/a/4090208/1036500
list.of.packages <- c("xx", "yy") # replace xx and yy with package names
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(list.of.packages, require, character.only=T)
@benmarwick
benmarwick / ngram-RMySQL-windows-7
Last active December 15, 2015 18:49
'How to work with Google n-gram data sets in R using MySQL' http://rpsychologist.com/how-to-work-with-google-ngram-data-sets-in-r-using-mysql/ Customizations for making this work on my setup (Windows 7 x64)
http://rpsychologist.com/how-to-work-with-google-ngram-data-sets-in-r-using-mysql/
# get ngram data (files a-z) from
http://books.google.com/ngrams/datasets
# get the a-z files into one big CSV file, use cmd in folder containing all the csv files
http://www.solveyourtech.com/merge-csv-files/
copy *.csv all-ngrams.csv
# get MySQL, install, install client libraries, fuss about to make a new database
@benmarwick
benmarwick / 3D_PCA.r
Last active December 15, 2015 21:59
Using R for three-dimensional plotting of the output of a PCA
pc <- prcomp(~ . - Species, data = iris, scale = TRUE)
library(rgl)
plot3d(pc$x[,1:3], xlab="Component 1", ylab="Component 2", zlab="Component 3", type="n", box=F, axes=T)
decorate3d(xlab = "x", ylab = "y", zlab = "z",
box = TRUE, axes = TRUE, main = NULL, sub = NULL,
top = TRUE, aspect = FALSE, expand = 1.03)
spheres3d(pc$x[,1:3], radius=0.1, col=rep(c("red","green","black"), each = 50))
grid3d(c("x", "y+", "z"))
text3d(pc$x[,1:3], text=rownames(pc$x), adj=1.3)