(content_type == '-' AND
uri MATCHES '.*(.*(\\.m\\..*?\\/wiki\\/|\\.m\\..*?\\/w\\/index.php).*).*'))
AND (http_status == 200 OR http_status == 302)```
6950454000
#### Filtering on just content_type
```(content_type == 'text/html' OR
We can make this file beautiful and searchable if this error is corrected: It looks like row 2 should actually have 1 column, instead of 2 in line 1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # pig -p input=/user/otto/logs/404/404.log -p output=/user/otto/logs/404/status_counts_1200-1600_BannerController -p begin='2012-10-03T12:00:00' -p end='2012-10-03T16:00:00' -f ./status_count_filtered_by_time_and_BannerController.pig | |
| 404 1179718 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 2011-10 de 1323354 | |
| 2011-10 en 9421017 | |
| 2011-10 es 1925608 | |
| 2011-10 fr 972096 | |
| 2011-10 ja 1359201 | |
| 2011-10 ru 1165275 | |
| 2011-11 de 3426583 | |
| 2011-11 en 23302315 | |
| 2011-11 es 4592934 | |
| 2011-11 fr 2516589 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| DEFINE EXTRACT org.apache.pig.builtin.REGEX_EXTRACT_ALL(); | |
| LOG_FIELDS = LOAD '$input' USING PigStorage(' ') AS (hostname:chararray, udplog_sequence:chararray, timestamp:chararray, request_time:chararray, remote_addr:chararray, http_status:chararray, bytes_sent:chararray, request_method:chararray, uri:chararray, proxy_host:chararray, content_type:chararray, referer:chararray, x_forwarded_for:chararray, user_agent); | |
| -- only count text/html. '-' Comes from varnish. | |
| -- See: https://gerrit.wikimedia.org/r/gitweb?p=analytics/wikistats.git;a=blob;f=squids/SquidCountArchiveProcessLogRecord.pm;h=5b0d03d6473ce5d63afc6f9495af8651bd90f74b;hb=HEAD#l18 | |
| LOG_FIELDS = FILTER LOG_FIELDS BY content_type == 'text/html' OR (content_type == '-' AND uri MATCHES '.*(.*(\\.m\\..*?\\/wiki\\/|\\.m\\..*?\\/w\\/index.php).*).*'); | |
| -- only count 200 and 302 response statuses | |
| LOG_FIELDS = FILTER LOG_FIELDS BY (http_status MATCHES '^.*(200|302)$'); | |
| -- Extract the Month and subdomain out of the request log fields | |
| MONTH_SUBDOMAIN = FOREACH LO |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 2012-09 de 117320000 | |
| 2012-09 en 626649000 | |
| 2012-09 es 120097000 | |
| 2012-09 fr 111205000 | |
| 2012-09 ja 110618000 | |
| 2012-09 ru 111562000 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 2011-11 000 1000 | |
| 2011-11 200 14079352000 | |
| 2011-11 206 17290000 | |
| 2011-11 301 812654000 | |
| 2011-11 302 1179290000 | |
| 2011-11 304 1501532000 | |
| 2011-11 400 18181000 | |
| 2011-11 401 166000 | |
| 2011-11 403 1054012000 | |
| 2011-11 404 969383000 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| LOG_FIELDS = LOAD '$input' USING PigStorage(' ') AS (hostname:chararray, udplog_sequence:chararray, timestamp:chararray, request_time:chararray, remote_addr:chararray, http_status:chararray, bytes_sent:chararray, request_method:chararray, uri:chararray, proxy_host:chararray, content_type:chararray, referer:chararray, x_forwarded_for:chararray, user_agent); | |
| LOG_FIELDS = FILTER LOG_FIELDS BY (uri matches '.*BannerController.*'); | |
| REFERER = FOREACH LOG_FIELDS GENERATE referer; | |
| COUNT = FOREACH (GROUP REFERER BY $0 PARALLEL 7) GENERATE $0, COUNT($1) as num; | |
| COUNT_SORTED = ORDER COUNT BY num DESC; | |
| DUMP COUNT_SORTED; | |
| STORE URI_COUNT_SORTED into '$output'; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| LOG_FIELDS = LOAD '$input' USING PigStorage(' ') AS (hostname:chararray, udplog_sequence:chararray, timestamp:chararray, request_time:chararray, remote_addr:chararray, http_status:chararray, bytes_sent:chararray, request_method:chararray, uri:chararray, proxy_host:chararray, content_type:chararray, referer:chararray, x_forwarded_for:chararray, user_agent:chararray); | |
| CANONICAL_STATUS = FOREACH LOG_FIELDS GENERATE (uri MATCHES '\\.m\\.' ? 'mobile' : 'desktop') as canonical:chararray, FLATTEN (RegexExtract(http_status, '.*(\\d\\d\\d).*', 1)) as status:chararray; | |
| COUNT = FOREACH (GROUP CANONICAL_STATUS BY (canonical, status) PARALLEL 7) GENERATE FLATTEN(group), COUNT($1) as num; | |
| STORE COUNT into '$output'; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| DEFINE EXTRACT org.apache.pig.builtin.REGEX_EXTRACT_ALL(); | |
| LOG_FIELDS = LOAD '$input' USING PigStorage(' ') AS (hostname:chararray, udplog_sequence:chararray, timestamp:chararray, request_time:chararray, remote_addr:chararray, http_status:chararray, bytes_sent:chararray, request_method:chararray, uri:chararray, proxy_host:chararray, content_type:chararray, referer:chararray, x_forwarded_for:chararray, user_agent:chararray); | |
| CANONICAL_STATUS = FOREACH LOG_FIELDS GENERATE (uri MATCHES '\\.m\\.' ? 'mobile' : 'desktop') as canonical:chararray, FLATTEN (EXTRACT(http_status, '.*(\\d\\d\\d)')) as status:chararray; | |
| COUNT = FOREACH (GROUP CANONICAL_STATUS BY (canonical, status) PARALLEL 7) GENERATE FLATTEN(group), COUNT($1) as num; | |
| COUNT = ORDER COUNT BY $0,$1; | |
| STORE COUNT into '$output'; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| - <property> | |
| - <name>mapreduce.job.reuse.jvm.num.tasks</name> | |
| - <value>-1</value> | |
| - </property> | |
| - | |
| - <property> | |
| - <name>mapreduce.child.java.opts</name> | |
| - <value>-Xmx512M</value> | |
| - </property> | |
| - |