Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / .bash_profile
Created June 22, 2012 04:10
Error when storing to HCatalog from Pig
export HADOOP_HOME=/home/hadoop
export HCAT_HOME=/usr/local/hcat
export PIG_HOME=/home/hadoop/pig-0.10.0
export HIVE_HOME=/home/hadoop/hive-0.9.0
export FORREST_HOME=/home/hadoop/apache-forrest-0.9
export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar:$HIVE_HOME/lib/hive-metastore-0.9.0.jar:
$HIVE_HOME/lib/libthrift-0.7.0.jar:$HIVE_HOME/lib/hive-exec-0.9.0.jar:$HIVE_HOME/lib/libfb303-0.7.0.jar:
$HIVE_HOME/lib/jdo2-api-2.3-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:$HIVE_HOME/lib/slf4j-api-1.6.1.jar
export PIG_OPTS=-Dhive.metastore.uris=thrift://localhost:10001
@rjurney
rjurney / faceted_search.json
Created July 7, 2012 02:56
Elasticsearch: Faceted search for emails from one address
russell-jurneys-macbook-pro:pig rjueny$ curl - POST "http://localhost:9200/email/email/_search?pretty=true" -d ''
{
"query" : { "*" },
"facets" : {
"tags" : { "terms" : {"email.froms.from.address" : "[email protected]"} }
}
}
'
{
@rjurney
rjurney / input.json
Created July 13, 2012 03:29
Fill in blanks in a time series in Javascript where the range of the data is "00"-"23"
[{"total":16,"sent_hour":"08"},{"total":16,"sent_hour":"09"},{"total":24,"sent_hour":"10"},{"total":14,"sent_hour":"11"},{"total":6,"sent_hour":"12"},{"total":22,"sent_hour":"13"},{"total":32,"sent_hour":"14"},{"total":14,"sent_hour":"15"},{"total":10,"sent_hour":"16"},{"total":10,"sent_hour":"17"},{"total":4,"sent_hour":"18"},{"total":8,"sent_hour":"20"},{"total":6,"sent_hour":"21"},{"total":20,"sent_hour":"22"},{"total":2,"sent_hour":"23"}]
@rjurney
rjurney / fill_in_blanks.py
Created July 13, 2012 04:46
Python version of same
def fill_in_blanks(in_data):
out_data = list()
hours = [ '%02d' % i for i in range(24) ]
for hour in hours:
entry = [x for x in in_data if x['sent_hour'] == hour]
if entry:
out_data.append(entry[0])
else:
out_data.append({'sent_hour': hour, 'total': 0})
return out_data
@rjurney
rjurney / email.js
Created August 10, 2012 15:45
Simple Node Server that Returns a JSON Document from MongoDB
// Connect to the MongoDB 'enron' database and its 'emails' collection
require("mongodb");
var Db = require("mongodb").Db,
Server = require("mongodb").Server;
var db = new Db("enron", new Server("127.0.0.1", 27017, {}));
db.open(function(err, n_db) { db = n_db });
var collection = db.collection("emails");
// Setup a simple API server returning JSON
var http = require('http');
@rjurney
rjurney / common_crawl_sequence.pig
Created September 3, 2012 04:35
Loading Common Crawl text data in Pig with SequenceFileLoader
grunt> pages = load 'data/textData-00000' using SequenceFileLoader() as (key:chararray, value:chararray);
grunt> describe pages;
pages: {key: chararray,value: chararray}
grunt> ILLUSTRATE pages;
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
@rjurney
rjurney / email_schema.pig
Created September 6, 2012 21:29
Problem passing Schema from front-end to back-end via UDFContext - schema will not parse!
grunt> describe emails
emails: {message_id: chararray,date: chararray,from: (address: chararray,name: chararray),subject: chararray,body: chararray,tos: {ARRAY_ELEM: (address: chararray,name: chararray)},ccs: {ARRAY_ELEM: (address: chararray,name: chararray)},bccs: {ARRAY_ELEM: (address: chararray,name: chararray)}}
@rjurney
rjurney / flask_splat.py
Created December 11, 2012 05:26
How do I do this splatization in Flask without being SO FREAKING UGLY?
# Enable /emails and /emails/ to serve the last 20 emaildb in our inbox unless otherwise specified
default_offsets={'offset1': 0, 'offset2': 0 + config.EMAIL_RANGE}
@app.route('/', defaults=default_offsets)
@app.route('/emails', defaults=default_offsets)
@app.route('/emails/', defaults=default_offsets)
@app.route("/emails/<int:offset1>/<int:offset2>")
def list_emaildb(offset1, offset2):
offset1 = int(offset1)
offset2 = int(offset2)
emails = emaildb.find()[offset1:offset2] # Uses a MongoDB cursor
@rjurney
rjurney / example.pig
Created December 24, 2012 07:20 — forked from anonymous/Example.pig
I want to extend Pig's existing XMLLoader to go beyond capturing the text inside a tag and to actually create a Pig mapping of the Document Object Model the XML represents. This would be similar to elephant-bird's JsonLoader. Semi-structured data can vary, so this behavior can be risky but... I want people to be able to load JSON and XML data ea…
characters = load 'example.xml' using XMLLoader('character');
describe characters
{properties:map[], name:chararray, born:datetime, qualification:chararray}
@rjurney
rjurney / gist:4662459
Created January 29, 2013 07:27
Works!
/* Avro uses json-simple, and is in piggybank until Pig 0.12, where AvroStorage and TrevniStorage are builtins */
REGISTER /me/Software/pig/build/ivy/lib/Pig/avro-1.5.3.jar
REGISTER /me/Software/pig/build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER /me/Software/pig/contrib/piggybank/java/piggybank.jar
DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
REGISTER /me/Software/varaha/lib/*.jar /* */
REGISTER /me/Software/varaha/target/varaha-1.0-SNAPSHOT.jar