As of version 1.9, Apache Drill can natively ingest and query web server logs. To configure Drill to read server logs, you must modify the extensions section in the dfs configuration:
"httpd": {
"type": "httpd",
"logFormat": "%h %t \"%r\" %>s %b \"%{Referer}i\" \"%{user-agent}i\"",
"timestampFormat": null
}
The logFormat
section must match the format of your log files, otherwise Drill will not be able to correctly parse your logs. The table below is a list of the fields which can be included in log files.
The timestampformat
is optional, but you can include a format for the time stamp and Drill will parse the times in the log files into Drill dates.
Format String | Variable Name |
---|---|
%a | connection.client.ip |
%{c}a | connection.client.peerip |
%A | connection.server.ip |
%B | response.body.bytes |
%b | response.body.bytesclf |
%{Foobar}C | request.cookies.* |
%D | server.process.time |
%{Foobar}e | server.environment.* |
%f | server.filename |
%h | connection.client.host |
%H | request.protocol |
%{Foobar}i | request.header. |
%k | connection.keepalivecount |
%l | connection.client.logname |
%L | request.errorlogid STRING |
%m | request.method |
%{Foobar}n | server.module_note.* |
%{Foobar}o | response.header.* |
%p | request.server.port.canonical |
%{canonical}p | connection.server.port.canonical |
%{local}p | connection.server.port |
%{remote}p | connection.client.port |
%P | connection.server.child.processid |
%{pid}P | connection.server.child.processid |
%{tid}P | connection.server.child.threadid |
%{hextid}P | connection.server.child.hexthreadid |
%q | request.querystring |
%r | request.firstline |
%R | request.handler |
%s | request.status.original |
%>s | request.status.last |
%t | request.receive.time |
%{msec}t | request.receive.time.begin.msec |
%{begin:msec}t | request.receive.time.begin.msec |
%{end:msec}t | request.receive.time.end.msec |
%{usec}t | request.receive.time.begin.usec |
%{begin:usec}t | request.receive.time.begin.usec |
%{end:usec}t | request.receive.time.end.usec |
%{msec_frac}t | request.receive.time.begin.msec_frac |
%{begin:msec_frac}t | request.receive.time.begin.msec_frac TIME.EPOCH |
%{end:msec_frac}t | request.receive.time.end.msec_frac |
%{usec_frac}t | request.receive.time.begin.usec_frac |
%{begin:usec_frac}t | request.receive.time.begin.usec_frac |
%{end:usec_frac}t | request.receive.time.end.usec_frac |
%T | response.server.processing.time |
%u | connection.client.user |
%U | request.urlpath |
%v | connection.server.name.canonical |
%V | connection.server.name |
%X | response.connection.status |
%I | request.bytes |
%O | response.bytes |
%{cookie}i | request.cookies |
%{set-cookie}o | response.cookies |
%{user-agent}i | request.user-agent |
%{referer}i | request.referer |
In addition to the ability to read raw log files, there are two functions intended to be used whilst analyzing log files:
parse_url(<url>)
: This function accepts a URL as an argument and returns a map of the URL's protocol, authority, host, and path.parse_query( <query_string> )
: This function accepts a query string and returns a key/value pairing of the variables submitted in the request.
In addition, there is a function available here: https://github.com/cgivre/drill-useragent-function which can parse User Agent strings and return a map of all the pertinent information.