This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for itm in itms: | |
data = itm | |
rec=coll.find({'text':data}) | |
if rec.count() == 0: | |
print "new item: {0}".format(itm.strip()) | |
coll.insert({'text':data,'count':0,'last_posted': OLD_DATE}) | |
can the coll.find/coll.insert be rewritten to use upsert (coll.update) **and preserve any existing values of count(integer) and last_posted(date)**? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sample pig script, runs fine in local mode. the elephantbird magic is the JsonLoader() in the LOAD command and then | |
converting user to a java map so that i can extract screen_name. I haven't read the docs yet but there may be a better way to do this. I'm sure I can combine the two generate statements into one, this is just a first attempt. | |
REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar'; | |
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar'; | |
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar'; | |
raw = LOAD '/Users/nkodner/clean_tweets/with_deletedaa' using com.twitter.elephantbird.pig.load.JsonLoader(); | |
bah = limit raw 100; | |
cc = foreach bah generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user; | |
dd = foreach cc generat |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nkodner@hadoop4 pig-0.10.0$ ant test | |
Buildfile: /Users/nkodner/Downloads/pig-0.10.0/build.xml | |
test: | |
ivy-download: | |
[get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar | |
[get] To: /Users/nkodner/Downloads/pig-0.10.0/ivy/ivy-2.2.0.jar | |
[get] Not modified - so not downloaded |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
goal is (text,id,user.screen_name) | |
REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar'; | |
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar'; | |
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar'; | |
raw = LOAD '/Users/nkodner/tweetsxxxxxx' using com.twitter.elephantbird.pig.load.JsonLoader(); | |
lmtd = limit raw 100; | |
cc = foreach lmtd generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user; | |
dd = foreach cc generate text,id,user#'screen_name' as name:chararray; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nkodner@hadoop4 ~$ for i in {0..9}; do curl -O http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20090715-${i}.csv.zip; done | |
% Total % Received % Xferd Average Speed Time Time Time Current | |
Dload Upload Total Spent Left Speed | |
100 196M 100 196M 0 0 8514k 0 0:00:23 0:00:23 --:--:-- 16.0M | |
% Total % Received % Xferd Average Speed Time Time Time Current | |
Dload Upload Total Spent Left Speed | |
100 196M 100 196M 0 0 16.6M 0 0:00:11 0:00:11 --:--:-- 14.1M | |
% Total % Received % Xferd Average Speed Time Time Time Current | |
Dload Upload Total Spent Left Speed | |
100 196M 100 196M 0 0 15.9M 0 0:00:12 0:00:12 --:--:-- 12.3M |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nkodner@hadoop4 strip_numbers$ cat numbers_from_12milliontweets.txt |awk '{print substr($1,0,1)}'|sort -n|uniq -c|sort -n | |
69606 7 | |
70809 9 | |
80228 6 | |
80468 8 | |
125992 0 | |
131495 4 | |
194264 5 | |
369118 3 | |
394841 2 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
note: tweets are in json format, coming from STDIN. | |
for each entity in entities, grab the start and end position. because they can appear in any order, put the (start, end) on a list. after extracting all of the entities, reverse the list and trim the string(tweet text) appropriately. | |
I'll clean this up and put it in a proper repo. it's some yak-shaving i needed to do for my latest data project. | |
#!/bin/python | |
import json, sys | |
def strip_items(str, start_pos, end_pos): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nkodner@hadoop4 tmp$ cat coins.txt | |
gold 1 1986 USA American Eagle | |
gold 1 1908 Austria-Hungary Franz Josef 100 Korona | |
silver 10 1981 USA ingot | |
gold 1 1984 Switzerland ingot | |
gold 1 1979 RSA Krugerrand | |
gold 0.5 1981 RSA Krugerrand | |
gold 0.1 1986 PRC Panda | |
silver 1 1986 USA Liberty dollar | |
gold 0.25 1986 USA Liberty 5-dollar piece |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
exports data that was modified(inc. created) after a certain date, in this case 01-oct-2011 | |
uses analytic function to decide whether or not to add the column delimiter. in my case, i'm using || as a delimiter since my data contains tabs and commas | |
it generates /tmp/TABLE_NAME.cmd.sql and then executes it while spooling TABLE_NAME.txt. | |
usage: | |
$ sqlplus user/pass@db @export <TABLE_NAME> <MOD_DT_FIELD_NAME> | |
set echo off feedb off head off pages 0 lines 500 trimspool on verify off termout off array 1000 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
data at https://raw.github.com/neilkod/2012_mb_corporate_run/master/data/results_2012.tsv | |
> raw_data = read.csv('/path/to/data',header=FALSE, sep='\t',stringsAsFactors=FALSE) | |
> names(raw_data) <- c('overall_position','gender_position','bib','name','time','seconds','minutes','gender','team') | |
> raw_data[raw_data$team=="Motorola Mobility",] | |
overall_position gender_position bib name time | |
37 37 33 2271 Roberto Munoz 20:52 | |
43 43 39 2253 Nicolas Guyot 21:05 | |
95 95 87 2264 Steve Lloyd 22:15 | |
125 125 112 2231 Ronald Bochenek 22:35 |
NewerOlder