Skip to content

Instantly share code, notes, and snippets.

@ottomata
Created October 4, 2012 16:47
Show Gist options
  • Select an option

  • Save ottomata/3834876 to your computer and use it in GitHub Desktop.

Select an option

Save ottomata/3834876 to your computer and use it in GitHub Desktop.
HTTP Response Status in bannerImpressions.log and 404.log (filtered by BannerController) 2012-10-03 between 12:00 and 16:00 UTC
We can make this file beautiful and searchable if this error is corrected: It looks like row 2 should actually have 1 column, instead of 2 in line 1.
# pig -p input=/user/otto/logs/404/404.log -p output=/user/otto/logs/404/status_counts_1200-1600_BannerController -p begin='2012-10-03T12:00:00' -p end='2012-10-03T16:00:00' -f ./status_count_filtered_by_time_and_BannerController.pig
404 1179718
We can make this file beautiful and searchable if this error is corrected: It looks like row 2 should actually have 1 column, instead of 2 in line 1.
# pig -p input=/user/otto/logs/banner1/bannerImpressions-sampled1.log-20121004 -p output=/user/otto/logs/banner1/status_counts_1200-1600 -p begin='2012-10-03T12:00:00' -p end='2012-10-03T16:00:00' -f ./status_count_filtered_by_time.pig
200 86224268
206 49
404 29
500 362
000 9367
400 3
403 182
-- Filters for requests between $begin and $end, and groups by http_status.
REGISTER 'piggybank.jar'
DEFINE RegexExtract org.apache.pig.piggybank.evaluation.string.RegexExtract();
LOG_FIELDS = LOAD '$input' USING PigStorage(' ') AS (hostname:chararray, udplog_sequence:chararray, timestamp:chararray, request_time:chararray, remote_addr:chararray, http_status:chararray, bytes_sent:chararray, request_method:chararray, uri:chararray, proxy_host:chararray, content_type:chararray, referer:chararray, x_forwarded_for:chararray, user_agent);
LOG_FIELDS = FILTER LOG_FIELDS BY (timestamp >= '$begin') AND (timestamp < '$end');
STATUS = FOREACH LOG_FIELDS GENERATE FLATTEN (RegexExtract(http_status, '.*(\\d\\d\\d).*', 1)) as status:chararray;
STATUS_COUNT = FOREACH (GROUP STATUS BY $0 PARALLEL 3) GENERATE $0, COUNT($1) as num;
STORE STATUS_COUNT into '$output';
-- Filters for requests between $begin and $end, then filters for URIs containing 'BannerController', and groups by http_status.
REGISTER 'piggybank.jar'
DEFINE RegexExtract org.apache.pig.piggybank.evaluation.string.RegexExtract();
LOG_FIELDS = LOAD '$input' USING PigStorage(' ') AS (hostname:chararray, udplog_sequence:chararray, timestamp:chararray, request_time:chararray, remote_addr:chararray, http_status:chararray, bytes_sent:chararray, request_method:chararray, uri:chararray, proxy_host:chararray, content_type:chararray, referer:chararray, x_forwarded_for:chararray, user_agent);
LOG_FIELDS = FILTER LOG_FIELDS BY (timestamp >= '$begin') AND (timestamp < '$end');
LOG_FIELDS = FILTER LOG_FIELDS BY (uri matches '.*BannerController.*');
STATUS = FOREACH LOG_FIELDS GENERATE FLATTEN (RegexExtract(http_status, '.*(\\d\\d\\d).*', 1)) as status:chararray;
STATUS_COUNT = FOREACH (GROUP STATUS BY $0 PARALLEL 3) GENERATE $0, COUNT($1) as num;
STORE STATUS_COUNT into '$output';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment