Skip to content

Instantly share code, notes, and snippets.

View ottomata's full-sized avatar

Andrew Otto ottomata

View GitHub Profile
@ottomata
ottomata / 404.log_status_counts.tsv
Created October 4, 2012 16:47
HTTP Response Status in bannerImpressions.log and 404.log (filtered by BannerController) 2012-10-03 between 12:00 and 16:00 UTC
We can make this file beautiful and searchable if this error is corrected: It looks like row 2 should actually have 1 column, instead of 2 in line 1.
# pig -p input=/user/otto/logs/404/404.log -p output=/user/otto/logs/404/status_counts_1200-1600_BannerController -p begin='2012-10-03T12:00:00' -p end='2012-10-03T16:00:00' -f ./status_count_filtered_by_time_and_BannerController.pig
404 1179718
2011-10 de 1323354
2011-10 en 9421017
2011-10 es 1925608
2011-10 fr 972096
2011-10 ja 1359201
2011-10 ru 1165275
2011-11 de 3426583
2011-11 en 23302315
2011-11 es 4592934
2011-11 fr 2516589
@ottomata
ottomata / rc_page_requests.csv
Created October 9, 2012 15:48
monthly subdomain counts
DEFINE EXTRACT org.apache.pig.builtin.REGEX_EXTRACT_ALL();
LOG_FIELDS = LOAD '$input' USING PigStorage(' ') AS (hostname:chararray, udplog_sequence:chararray, timestamp:chararray, request_time:chararray, remote_addr:chararray, http_status:chararray, bytes_sent:chararray, request_method:chararray, uri:chararray, proxy_host:chararray, content_type:chararray, referer:chararray, x_forwarded_for:chararray, user_agent);
-- only count text/html. '-' Comes from varnish.
-- See: https://gerrit.wikimedia.org/r/gitweb?p=analytics/wikistats.git;a=blob;f=squids/SquidCountArchiveProcessLogRecord.pm;h=5b0d03d6473ce5d63afc6f9495af8651bd90f74b;hb=HEAD#l18
LOG_FIELDS = FILTER LOG_FIELDS BY content_type == 'text/html' OR (content_type == '-' AND uri MATCHES '.*(.*(\\.m\\..*?\\/wiki\\/|\\.m\\..*?\\/w\\/index.php).*).*');
-- only count 200 and 302 response statuses
LOG_FIELDS = FILTER LOG_FIELDS BY (http_status MATCHES '^.*(200|302)$');
-- Extract the Month and subdomain out of the request log fields
MONTH_SUBDOMAIN = FOREACH LO

Filtering on content_type and http_status

(content_type == '-' AND 
uri MATCHES '.*(.*(\\.m\\..*?\\/wiki\\/|\\.m\\..*?\\/w\\/index.php).*).*'))
  AND (http_status == 200 OR http_status == 302)```

6950454000

#### Filtering on just content_type
```(content_type == 'text/html' OR 
@ottomata
ottomata / gist:3860933
Created October 9, 2012 19:35
2012-09 404 counts.
2012-09 de 117320000
2012-09 en 626649000
2012-09 es 120097000
2012-09 fr 111205000
2012-09 ja 110618000
2012-09 ru 111562000
@ottomata
ottomata / gist:3861365
Created October 9, 2012 20:55
HTTP Response Status Monthly Counts, 2011-11 through 2012-09
2011-11 000 1000
2011-11 200 14079352000
2011-11 206 17290000
2011-11 301 812654000
2011-11 302 1179290000
2011-11 304 1501532000
2011-11 400 18181000
2011-11 401 166000
2011-11 403 1054012000
2011-11 404 969383000
@ottomata
ottomata / gist:3868011
Created October 10, 2012 19:56
Group By referrer, filter on BannerController
LOG_FIELDS = LOAD '$input' USING PigStorage(' ') AS (hostname:chararray, udplog_sequence:chararray, timestamp:chararray, request_time:chararray, remote_addr:chararray, http_status:chararray, bytes_sent:chararray, request_method:chararray, uri:chararray, proxy_host:chararray, content_type:chararray, referer:chararray, x_forwarded_for:chararray, user_agent);
LOG_FIELDS = FILTER LOG_FIELDS BY (uri matches '.*BannerController.*');
REFERER = FOREACH LOG_FIELDS GENERATE referer;
COUNT = FOREACH (GROUP REFERER BY $0 PARALLEL 7) GENERATE $0, COUNT($1) as num;
COUNT_SORTED = ORDER COUNT BY num DESC;
DUMP COUNT_SORTED;
STORE URI_COUNT_SORTED into '$output';
LOG_FIELDS = LOAD '$input' USING PigStorage(' ') AS (hostname:chararray, udplog_sequence:chararray, timestamp:chararray, request_time:chararray, remote_addr:chararray, http_status:chararray, bytes_sent:chararray, request_method:chararray, uri:chararray, proxy_host:chararray, content_type:chararray, referer:chararray, x_forwarded_for:chararray, user_agent:chararray);
CANONICAL_STATUS = FOREACH LOG_FIELDS GENERATE (uri MATCHES '\\.m\\.' ? 'mobile' : 'desktop') as canonical:chararray, FLATTEN (RegexExtract(http_status, '.*(\\d\\d\\d).*', 1)) as status:chararray;
COUNT = FOREACH (GROUP CANONICAL_STATUS BY (canonical, status) PARALLEL 7) GENERATE FLATTEN(group), COUNT($1) as num;
STORE COUNT into '$output';
DEFINE EXTRACT org.apache.pig.builtin.REGEX_EXTRACT_ALL();
LOG_FIELDS = LOAD '$input' USING PigStorage(' ') AS (hostname:chararray, udplog_sequence:chararray, timestamp:chararray, request_time:chararray, remote_addr:chararray, http_status:chararray, bytes_sent:chararray, request_method:chararray, uri:chararray, proxy_host:chararray, content_type:chararray, referer:chararray, x_forwarded_for:chararray, user_agent:chararray);
CANONICAL_STATUS = FOREACH LOG_FIELDS GENERATE (uri MATCHES '\\.m\\.' ? 'mobile' : 'desktop') as canonical:chararray, FLATTEN (EXTRACT(http_status, '.*(\\d\\d\\d)')) as status:chararray;
COUNT = FOREACH (GROUP CANONICAL_STATUS BY (canonical, status) PARALLEL 7) GENERATE FLATTEN(group), COUNT($1) as num;
COUNT = ORDER COUNT BY $0,$1;
STORE COUNT into '$output';
- <property>
- <name>mapreduce.job.reuse.jvm.num.tasks</name>
- <value>-1</value>
- </property>
-
- <property>
- <name>mapreduce.child.java.opts</name>
- <value>-Xmx512M</value>
- </property>
-