Skip to content

Instantly share code, notes, and snippets.

@cneud
Created April 18, 2014 15:55
Show Gist options
  • Save cneud/11051245 to your computer and use it in GitHub Desktop.
Save cneud/11051245 to your computer and use it in GitHub Desktop.
Pig script for ARC analysis using WarcBase and Tika UDF's
register './warcbase_kb/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
raw = load '/tmp/IAH-20080430204825-00000-blackbook.arc.gz' using
org.warcbase.pig.ArcLoader() as (url: chararray, date:chararray, mime:chararray, content:chararray);
a = foreach raw generate url,mime,content,SUBSTRING(date,0,12) as date,org.warcbase.pig.piggybank.DetectMimeType(content) as tikaMime;
b = filter a by (tikaMime == 'text/html');
c = foreach b generate url,mime,tikaMime,date,org.warcbase.pig.piggybank.ExtractRawText(content) as txt;
d = foreach c generate url,mime,tikaMime,date,org.warcbase.pig.piggybank.DetectLanguage(txt) as lang;
e = group d by (lang,date);
f = foreach e generate $0, COUNT($1);
dump f;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment