Skip to content

Instantly share code, notes, and snippets.

@duckworth
Created December 20, 2016 19:56
Show Gist options
  • Save duckworth/adc47d00bcfcbcafc616ca51b5946bbc to your computer and use it in GitHub Desktop.
Save duckworth/adc47d00bcfcbcafc616ca51b5946bbc to your computer and use it in GitHub Desktop.
AWS ALB Logs Athena RegexSerDe
CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
type string,
timestamp string,
elb string,
client_ip string,
client_port int,
target_ip string,
target_port int,
request_processing_time double,
target_processing_time double,
response_processing_time double,
elb_status_code string,
target_status_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
url string,
protocol string,
user_agent string,
ssl_cipher string,
ssl_protocol string,
target_group_arn string,
trace_id string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1','input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) ([^ ]*)$' )
LOCATION 's3://somebucket/AWSLogs/';
@raphK
Copy link

raphK commented Jul 14, 2017

Per AWS ALB log documentation when a client doesn't send a full request the target is set to '-'. Thus the regex here won't return a result for target_ip:target_port and Athena returns a blank line. Can be fixed by replacing the second instance of ([^ ]*):([0-9]*) with ([^ ]*) and merging target_ip string, target_port int into target string.

Processing times, may also fail if they're replaced with -1 when the load balancer can't process the request. Fix by changing the corresponding ([.0-9]*) with ([-.0-9]*)

@szinck
Copy link

szinck commented Aug 13, 2017

It seems like logs have recently started ending with a trailing space which breaks most of the parsers Ive seen. That should be easy to fix. I also added a little regex and couple fields for separating the http route and query params out: https://gist.github.com/szinck/d456fbf691483ab77d2453c316db3371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment