Skip to content

Instantly share code, notes, and snippets.

@vaquarkhan
Forked from Integralist/AWS Athena SQL.md
Created October 28, 2022 21:22
Show Gist options
  • Select an option

  • Save vaquarkhan/984110309bdc6f539c8425f54fa8d2ce to your computer and use it in GitHub Desktop.

Select an option

Save vaquarkhan/984110309bdc6f539c8425f54fa8d2ce to your computer and use it in GitHub Desktop.
[AWS Athena SQL] #aws #athena #s3 #sql

Note: reference article

Basic query example:

SELECT status_code,
         COUNT(status_code) AS requests
FROM fastly_logs.example_com
GROUP BY  requests
ORDER BY  requests DESC

Create a table:

CREATE EXTERNAL TABLE `www_buzzfeed_com`(
  `client_ip` string COMMENT '', 
  `timestamp` timestamp COMMENT '', 
  `method` string COMMENT '', 
  `path` string COMMENT '', 
  `http_protocol` string COMMENT '', 
  `status_code` smallint COMMENT '', 
  `response_size` int COMMENT '', 
  `tls_version` string COMMENT '', 
  `cache_status` string COMMENT '')
PARTITIONED BY ( 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
  'input.regex'='(\\S+) \\[(.+)\\] \\\"(\\S+) (\\S+) (\\S+)\\\" (\\d{3}) (\\d+) (\\S+) (\\S+)$') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bf-logs-archive/Fastly/www.buzzfeed.com'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1512424557')

Note: the input.regex matches the following log format from Fastly:

12.345.678.90 [2019-04-01 00:00:01.544] "GET /foo/bar HTTP/1.1" 200 1212 TLSv1.2 HIT-CLUSTER

%h [%{%Y-%m-%d %H:%M:%S}t.%{msec_frac}t] "%r" %>s %B %{tls.client.protocol}V %{fastly_info.state}V

Rebuild indices (e.g. Athena doesn't know about new content otherwise):

MSCK REPAIR TABLE www_buzzfeed_com

Select requests based on a specific timestamp value:

SELECT * FROM www_buzzfeed_com WHERE timestamp = timestamp '2017-12-04 21:50:01.646' AND cache_status LIKE 'HIT%'

Select requests based on datetime partition field:

SELECT * FROM www_buzzfeed_com WHERE dt='2019-04-02'

Select requests based on a specific date range:

SELECT * FROM www_buzzfeed_com WHERE dt>='2019-04-02' AND dt<='2019-04-03' AND cache_status LIKE 'MISS%'

Faster than LIKE statement:

SELECT * FROM www_buzzfeed_com WHERE dt>='2019-04-02' AND dt<='2019-04-03' AND regexp_like(cache_status, '^(MISS|PASS)')

Order data so the most recent appears at the top and only the most recent 10 records:

SELECT * FROM www_buzzfeed_com WHERE dt='2019-04-02' ORDER BY timestamp DESC limit 10

Note: see peformance tricks here

To drop a table:

DROP TABLE `www_stage_buzzfeed_com_json`;

JSON Format

https://docs.aws.amazon.com/athena/latest/ug/querying-JSON.html

CREATE EXTERNAL TABLE `www_stage_buzzfeed_com_json`(
  `backend` string,
  `bodySize` int,
  `contentType` string,
  `fastlyState` string,
  `host` string,
  `ip` string,
  `method` string,
  `protocol` string,
  `statusCode` smallint,
  `timestamp` timestamp,
  `tlsVersion` string,
  `url` string,
  `userAgent` string)
PARTITIONED BY ( 
  `dt` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'ignore.malformed.json' = 'true')
LOCATION
  's3://bf-logs-archive/Fastly/json/www-stage.buzzfeed.com'
TBLPROPERTIES (
  'has_encrypted_data'='false')

Notice that camelCase fields are lowercased (e.g. fastlyState becomes fastlystate):

SELECT url, statuscode, fastlystate FROM www_stage_buzzfeed_com_json WHERE dt='2019-04-04' AND backend = '' ORDER BY timestamp DESC LIMIT 10

To format output from command line:

cat ~/Downloads/2019-04-04T11_55_00.000-EKBlIHaTVrnIxJ7EPna7.log | tail -n 1 | python -m json.tool

{
    "backend": "",
    "bodySize": "240",
    "contentType": "application/json",
    "fastlyState": "HIT",
    "host": "www.buzzfeed.com",
    "ip": "70.50.102.167",
    "method": "GET",
    "protocol": "HTTP/1.1",
    "statusCode": "200",
    "timestamp": "2019-04-04 11:59:59.650",
    "tlsVersion": "TLSv1.2",
    "url": "/manifest.json",
    "userAgent": "Mozilla/5.0 (Linux; Android 9; SAMSUNG SM-G960W Build/PPR1.180610.011) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/9.2 Chrome/67.0.3396.87 Mobile Safari/537.36"
}

Caution!

If you set a field to have a field of type int and it's sent as an empty string "" by default by the provider of the data, then expect Hive DB to complain loudly when trying to search for data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment