Created
December 20, 2016 19:56
-
-
Save duckworth/adc47d00bcfcbcafc616ca51b5946bbc to your computer and use it in GitHub Desktop.
AWS ALB Logs Athena RegexSerDe
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs ( | |
type string, | |
timestamp string, | |
elb string, | |
client_ip string, | |
client_port int, | |
target_ip string, | |
target_port int, | |
request_processing_time double, | |
target_processing_time double, | |
response_processing_time double, | |
elb_status_code string, | |
target_status_code string, | |
received_bytes bigint, | |
sent_bytes bigint, | |
request_verb string, | |
url string, | |
protocol string, | |
user_agent string, | |
ssl_cipher string, | |
ssl_protocol string, | |
target_group_arn string, | |
trace_id string ) | |
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' | |
WITH SERDEPROPERTIES ( | |
'serialization.format' = '1','input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) ([^ ]*)$' ) | |
LOCATION 's3://somebucket/AWSLogs/'; |
It seems like logs have recently started ending with a trailing space which breaks most of the parsers Ive seen. That should be easy to fix. I also added a little regex and couple fields for separating the http route and query params out: https://gist.github.com/szinck/d456fbf691483ab77d2453c316db3371
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Per AWS ALB log documentation when a client doesn't send a full request the target is set to '-'. Thus the regex here won't return a result for target_ip:target_port and Athena returns a blank line. Can be fixed by replacing the second instance of ([^ ]*):([0-9]*) with ([^ ]*) and merging target_ip string, target_port int into target string.
Processing times, may also fail if they're replaced with -1 when the load balancer can't process the request. Fix by changing the corresponding ([.0-9]*) with ([-.0-9]*)