Skip to content

Instantly share code, notes, and snippets.

View trcook's full-sized avatar

Tom Cook trcook

View GitHub Profile
@trcook
trcook / notes_on_docker.md
Last active August 29, 2015 00:37
some quick ideas about how to leverage docker and aws for hpc

To get a rocker/docker image this running cluster-wise with R on AWS:

The problem is that R already schedules and runs the parallelization tasks, so we need to expose that to the broader cluster and allow interconnections between containers.

One option for this would be to setup a redis container to manage parallel tasks via the doredis package in R (this limits our parallelization options though since some commands don't run parallel via foreach, but instead run via stuff like parapply -- ???)

try ecs -- setup several container instances. Specify each slave and master node as taking up the full resources of the container instance. There should be essentially 3 big steps here.

@trcook
trcook / distance match.r
Last active August 29, 2015 14:28
This is a quick way to use levenshtein distance to match two sets of entries in a 'fuzzy' way. The key is to match the entry in one set with its nearest neighbor in the other set, measuring distance as 'edit distance'.
my_match<-function(m,sp,sp_match='name_spelling',m_match='team',robust=T){
require(RecordLinkage)
require(plyr)
matches<-ldply(m[,m_match],.fun=function(x){
distances<-levenshteinDist(x,sp[,c(sp_match)])
most_likely<-sp[which(distances==min(distances)),]
if(robust==T){
if(length(most_likely[,1])>1){
@trcook
trcook / mturk.py
Last active August 29, 2015 14:24
auto assign mturk qualifications from a hit using boto
from boto import *
mturk=connect_mturk()
x=mturk.get_all_hits()
m=list(x)
b=m
m=list(x)
@trcook
trcook / ggplot.r
Created June 11, 2015 20:52
random notes on ggplot
# use aes_string when constructing a plot in a function.
# when needed, use deparse(substitute(x)) to get an argument as a string -- useful soemtimes for using in aes_string
# use paste in aes_string to construct a string itself
# reset data source in geom_text to avoid chunky text in labels.
# all of these little tricks are used in this function:
my_grid_plot_function<-function(dat=eugene_mod2_qi,byrange=seq(0,1,.1),by_iter='entry_recency',group="factor(ambiguitymA)",xax="betmB",print=F,outfile=file.path(dissertation_root,"Construction/Graphics/grid1.pdf")){
require(gridExtra)
eval(substitute(dat[,j_iter:=x],list(x=as.name(by_iter))))
@trcook
trcook / workaround.md
Created June 4, 2015 02:22
the eccentricities of relogit

Relogit is helpful but it doesn't play nice with formulas. If using a formula in relogit, use this workaround to avoid the problem of object of type 'symbol' not subsettable

require(Zelig)
form1<-formula(x+y+z)
relogit(as.formula(form1),data=...)
@trcook
trcook / trycatch.R
Created May 29, 2015 20:35
Exit Statuses TryCatch in R
#' Trycatch in R is setup in a really weird way. Usually you enter 3+ conditions as parameters (message, warning, error) and the function you would like run to handle the exception.
#" As such, you want to make use of the return command to return an exit status
#' The big caveat is that you want to use return to give exit statuses, but for the expression to be 'tried' itself, you can't use return (unless youwant to wrap the expression in a function, which is a pain.
#' To get an exit status, then you want to code the last part of the expression to return your exit status in the expression and the alternative exit statuses as return statements in the other parameters to trycatch
#' example:
exit_status<-tryCatch({
x<-1

tmux shortcuts & cheatsheet

start new:

tmux

start new with session name:

tmux new -s myname
@trcook
trcook / 0 ec2 and s3 notes.md
Last active August 29, 2015 14:17
my random notes on using ec2 and s3

About s3 and location syntax

bucket location in 's3:' notation is s3://<bucket name>. this is different than the public http URI, which will be https://<budket name>.s3.amazonaws.com/file

For the most part, getting/putting is done with s3 syntax.

getting and putting (ec2)

  1. install s3cmd via apt-get
  2. s3cmd --configure
@trcook
trcook / botoexamples.py
Last active August 29, 2015 14:17
useful boto stuff
# make sure boto is configured
from re import *
from boto import *
con=connect_ec2()
con.get_spot_price_history(instance_type='c4.large')
hist= con.get_spot_price_history(start_time=datetime.isoformat(datetime.now()),availability_zone='us-east-1c')
# get prices for all c4 instances that are compute optimized and usually reasonably priced
[m.string for m in [re.search('c4.large|c4.xlarge|c4.2xlarge',str(i)) for i in hist if i.price<.03] if m]
'''
@trcook
trcook / InstallR.sh
Last active August 29, 2015 14:16
script that installs r on an ec 2 server that's running a linux distro that's not 3 years old, and a bleeding edge R. See: http://www.stat.yale.edu/~jay/EC2/CreateFromScratch.html
#!/bin/sh
##
## InstallR.sh for use on Amazon EC2
##
## Jay Emerson and Susan Wang, May 2013
##
## Added "doextras" for Apache, Rserve/FastRWeb, Shiny server (JE June 2013)
##
## -------------------------
## To log in the first time: