neilkod’s gists

neilkod / gist:3167664

Created July 24, 2012 02:37

upsert and preserve existing values?

	for itm in itms:
	data = itm
	rec=coll.find({'text':data})
	if rec.count() == 0:
	print "new item: {0}".format(itm.strip())
	coll.insert({'text':data,'count':0,'last_posted': OLD_DATE})


	can the coll.find/coll.insert be rewritten to use upsert (coll.update) and preserve any existing values of count(integer) and last_posted(date)?

neilkod / elephant bird demo.pig

Created June 8, 2012 22:30

elephant bird demo.pig

	sample pig script, runs fine in local mode. the elephantbird magic is the JsonLoader() in the LOAD command and then
	converting user to a java map so that i can extract screen_name. I haven't read the docs yet but there may be a better way to do this. I'm sure I can combine the two generate statements into one, this is just a first attempt.

	REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar';
	REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar';
	REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar';
	raw = LOAD '/Users/nkodner/clean_tweets/with_deletedaa' using com.twitter.elephantbird.pig.load.JsonLoader();
	bah = limit raw 100;
	cc = foreach bah generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user;
	dd = foreach cc generat

neilkod / gist:2898057

Created June 8, 2012 20:49

ant test output

	nkodner@hadoop4 pig-0.10.0$ ant test
	Buildfile: /Users/nkodner/Downloads/pig-0.10.0/build.xml

	test:

	ivy-download:
	[get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
	[get] To: /Users/nkodner/Downloads/pig-0.10.0/ivy/ivy-2.2.0.jar
	[get] Not modified - so not downloaded

neilkod / gist:2897924

Created June 8, 2012 20:11

	goal is (text,id,user.screen_name)

	REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar';
	REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar';
	REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar';
	raw = LOAD '/Users/nkodner/tweetsxxxxxx' using com.twitter.elephantbird.pig.load.JsonLoader();
	lmtd = limit raw 100;
	cc = foreach lmtd generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user;
	dd = foreach cc generate text,id,user#'screen_name' as name:chararray;

neilkod / gist:2868503

Created June 4, 2012 13:48

bash one-liner to download the google books 1-gram data

	nkodner@hadoop4 ~$ for i in {0..9}; do curl -O http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20090715-${i}.csv.zip; done
	% Total % Received % Xferd Average Speed Time Time Time Current
	Dload Upload Total Spent Left Speed
	100 196M 100 196M 0 0 8514k 0 0:00:23 0:00:23 --:--:-- 16.0M
	% Total % Received % Xferd Average Speed Time Time Time Current
	Dload Upload Total Spent Left Speed
	100 196M 100 196M 0 0 16.6M 0 0:00:11 0:00:11 --:--:-- 14.1M
	% Total % Received % Xferd Average Speed Time Time Time Current
	Dload Upload Total Spent Left Speed
	100 196M 100 196M 0 0 15.9M 0 0:00:12 0:00:12 --:--:-- 12.3M

neilkod / benford.sh

Created June 3, 2012 20:45

benfords law on twitter data

	nkodner@hadoop4 strip_numbers$ cat numbers_from_12milliontweets.txt \|awk '{print substr($1,0,1)}'\|sort -n\|uniq -c\|sort -n
	69606 7
	70809 9
	80228 6
	80468 8
	125992 0
	131495 4
	194264 5
	369118 3
	394841 2

neilkod / strip_tweet.py

Created June 3, 2012 14:58

strip entities (urls, hashtags, usernames) from a tweet

	note: tweets are in json format, coming from STDIN.

	for each entity in entities, grab the start and end position. because they can appear in any order, put the (start, end) on a list. after extracting all of the entities, reverse the list and trim the string(tweet text) appropriately.

	I'll clean this up and put it in a proper repo. it's some yak-shaving i needed to do for my latest data project.


	#!/bin/python
	import json, sys
	def strip_items(str, start_pos, end_pos):

neilkod / funcs.awk

Created May 29, 2012 20:03

simple histogram in awk

	nkodner@hadoop4 tmp$ cat coins.txt
	gold 1 1986 USA American Eagle
	gold 1 1908 Austria-Hungary Franz Josef 100 Korona
	silver 10 1981 USA ingot
	gold 1 1984 Switzerland ingot
	gold 1 1979 RSA Krugerrand
	gold 0.5 1981 RSA Krugerrand
	gold 0.1 1986 PRC Panda
	silver 1 1986 USA Liberty dollar
	gold 0.25 1986 USA Liberty 5-dollar piece

neilkod / export.sql

Created May 25, 2012 16:58

export all rows where modified dt field is > 01-jan-2010

	exports data that was modified(inc. created) after a certain date, in this case 01-oct-2011

	uses analytic function to decide whether or not to add the column delimiter. in my case, i'm using \|\| as a delimiter since my data contains tabs and commas

	it generates /tmp/TABLE_NAME.cmd.sql and then executes it while spooling TABLE_NAME.txt.

	usage:
	$ sqlplus user/pass@db @export <TABLE_NAME> <MOD_DT_FIELD_NAME>

	set echo off feedb off head off pages 0 lines 500 trimspool on verify off termout off array 1000

neilkod / gist:2270775

Created April 1, 2012 02:39

	data at https://raw.github.com/neilkod/2012_mb_corporate_run/master/data/results_2012.tsv

	> raw_data = read.csv('/path/to/data',header=FALSE, sep='\t',stringsAsFactors=FALSE)
	> names(raw_data) <- c('overall_position','gender_position','bib','name','time','seconds','minutes','gender','team')
	> raw_data[raw_data$team=="Motorola Mobility",]
	overall_position gender_position bib name time
	37 37 33 2271 Roberto Munoz 20:52
	43 43 39 2253 Nicolas Guyot 21:05
	95 95 87 2264 Steve Lloyd 22:15
	125 125 112 2231 Ronald Bochenek 22:35

neil kodner neilkod