Skip to content

Instantly share code, notes, and snippets.

@neilkod
Created August 9, 2010 16:13
Show Gist options
  • Save neilkod/515641 to your computer and use it in GitHub Desktop.
Save neilkod/515641 to your computer and use it in GitHub Desktop.
register piggybank.jar
raw = LOAD 'parsed/' USING PigStorage('\t') AS (id:chararray,timestamp:chararray,screenname:chararray,tweet:chararray);
-- filter the tweets to only the ones that have a hashtag containing the f-word
fltr = FILTER raw BY tweet matches '.*\\#\\p{Alpha}*[Ff][Uu][Cc][Kk].*?';
--extract the actual regex-matched hashtag. note-there has to be a better way to do this
extrctd = FOREACH fltr GENERATE FLATTEN(org.apache.pig.piggybank.evaluation.string.RegexExtract(tweet,'.*(\\#\\p{Alpha}*[Ff][Uu][Cc][Kk].*?\\b)',1)) as (tweet:chararray);
--convert to lowercase, group, and sort.
lowrd = FOREACH extrctd GENERATE FLATTEN(org.apache.pig.piggybank.evaluation.string.LOWER(tweet));
grpd = GROUP lowrd by $0;
cntd = FOREACH grpd GENERATE $0 as theHour,COUNT(lowrd) as cnt;
srtd = ORDER cntd by cnt DESC;
store srtd into 'hashlower';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment