Note: reference article
Basic query example:
SELECT status_code,
COUNT(status_code) AS requests
FROM fastly_logs.example_com
GROUP BY requests
ORDER BY requests DESCCreate a table:
CREATE EXTERNAL TABLE `www_buzzfeed_com`(
`client_ip` string COMMENT '',
`timestamp` timestamp COMMENT '',
`method` string COMMENT '',
`path` string COMMENT '',
`http_protocol` string COMMENT '',
`status_code` smallint COMMENT '',
`response_size` int COMMENT '',
`tls_version` string COMMENT '',
`cache_status` string COMMENT '')
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='(\\S+) \\[(.+)\\] \\\"(\\S+) (\\S+) (\\S+)\\\" (\\d{3}) (\\d+) (\\S+) (\\S+)$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bf-logs-archive/Fastly/www.buzzfeed.com'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1512424557')Note: the
input.regexmatches the following log format from Fastly:
12.345.678.90 [2019-04-01 00:00:01.544] "GET /foo/bar HTTP/1.1" 200 1212 TLSv1.2 HIT-CLUSTER
%h [%{%Y-%m-%d %H:%M:%S}t.%{msec_frac}t] "%r" %>s %B %{tls.client.protocol}V %{fastly_info.state}V
Rebuild indices (e.g. Athena doesn't know about new content otherwise):
MSCK REPAIR TABLE www_buzzfeed_comSelect requests based on a specific timestamp value:
SELECT * FROM www_buzzfeed_com WHERE timestamp = timestamp '2017-12-04 21:50:01.646' AND cache_status LIKE 'HIT%'Select requests based on datetime partition field:
SELECT * FROM www_buzzfeed_com WHERE dt='2019-04-02'Select requests based on a specific date range:
SELECT * FROM www_buzzfeed_com WHERE dt>='2019-04-02' AND dt<='2019-04-03' AND cache_status LIKE 'MISS%'Faster than LIKE statement:
SELECT * FROM www_buzzfeed_com WHERE dt>='2019-04-02' AND dt<='2019-04-03' AND regexp_like(cache_status, '^(MISS|PASS)')Order data so the most recent appears at the top and only the most recent 10 records:
SELECT * FROM www_buzzfeed_com WHERE dt='2019-04-02' ORDER BY timestamp DESC limit 10Note: see peformance tricks here
To drop a table:
DROP TABLE `www_stage_buzzfeed_com_json`;https://docs.aws.amazon.com/athena/latest/ug/querying-JSON.html
CREATE EXTERNAL TABLE `www_stage_buzzfeed_com_json`(
`backend` string,
`bodySize` int,
`contentType` string,
`fastlyState` string,
`host` string,
`ip` string,
`method` string,
`protocol` string,
`statusCode` smallint,
`timestamp` timestamp,
`tlsVersion` string,
`url` string,
`userAgent` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'true')
LOCATION
's3://bf-logs-archive/Fastly/json/www-stage.buzzfeed.com'
TBLPROPERTIES (
'has_encrypted_data'='false')Notice that camelCase fields are lowercased (e.g. fastlyState becomes fastlystate):
SELECT url, statuscode, fastlystate FROM www_stage_buzzfeed_com_json WHERE dt='2019-04-04' AND backend = '' ORDER BY timestamp DESC LIMIT 10To format output from command line:
cat ~/Downloads/2019-04-04T11_55_00.000-EKBlIHaTVrnIxJ7EPna7.log | tail -n 1 | python -m json.tool
{
"backend": "",
"bodySize": "240",
"contentType": "application/json",
"fastlyState": "HIT",
"host": "www.buzzfeed.com",
"ip": "70.50.102.167",
"method": "GET",
"protocol": "HTTP/1.1",
"statusCode": "200",
"timestamp": "2019-04-04 11:59:59.650",
"tlsVersion": "TLSv1.2",
"url": "/manifest.json",
"userAgent": "Mozilla/5.0 (Linux; Android 9; SAMSUNG SM-G960W Build/PPR1.180610.011) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/9.2 Chrome/67.0.3396.87 Mobile Safari/537.36"
}If you set a field to have a field of type int and it's sent as an empty string "" by default by the provider of the data, then expect Hive DB to complain loudly when trying to search for data.