Skip to content

Instantly share code, notes, and snippets.

@grundprinzip
Last active August 29, 2015 13:56
Show Gist options
  • Save grundprinzip/9070147 to your computer and use it in GitHub Desktop.
Save grundprinzip/9070147 to your computer and use it in GitHub Desktop.
/* Load Data */
data = LOAD '/data/ngrams_2/*' USING PigStorage('\t') AS ( value:chararray, cnt:int);
splitted = FOREACH data GENERATE STRSPLIT(value, ' ') as (leftv:CHARARRAY, rightv:CHARARRAY), cnt;
/*Load Prepositions*/
prepositions = LOAD '/data/prepositions.txt' USING PigStorage('\t') as (name:CHARARRAY);
/* Join Result and project the values */
result = JOIN splitted by $0.rightv, prepositions by $0;
result = FOREACH result GENERATE $1, $2;
/*Aggreagte the result*/
grouped_right = GROUP result by name;
grouped_right = FOREACH grouped_right GENERATE group, SUM(result.cnt), (int) 1;
/* Join for the left side */
result = JOIN splitted by $0.leftv, prepositions by $0;
result = FOREACH result GENERATE $1, $2;
grouped_left = GROUP result by name;
grouped_left = FOREACH grouped_left GENERATE group, SUM(result.cnt), (int) 0;
final_result = UNION grouped_left, grouped_right;
describe final_result;
store final_result into '/data/prepositions_grouped_by_pos' using PigStorage('\t');
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment