Skip to content

Instantly share code, notes, and snippets.

@mwfrost
Created May 24, 2012 12:31
Show Gist options
  • Select an option

  • Save mwfrost/2781315 to your computer and use it in GitHub Desktop.

Select an option

Save mwfrost/2781315 to your computer and use it in GitHub Desktop.
Apache logs in R
require(plyr)
require(lubridate)
log <- read.table(file='httpd.combine.20120509')
# in the file I used, there was a space between the time and the time zone, creating two fields.
names(log) <- c('host', 'identity', 'user', 'time' ,'V5','request', 'status', 'bytes','referer','agent')
# Paste the two fields together
log$time <- paste(log$time, log$V5, sep=' ')
# remove the extra field
log$V5 <- NULL
# extract the URIs from the request field
log$uri <- gsub('GET |PROPFIND |HEAD |OPTIONS | HTTP/*.*','',log$request)
# convert the timestamp to something R can work with
log$rtime <- strptime(log$time, '[%d/%B/%Y:%H:%M:%S')
# Identify the obvious bots from the agent field
log$isbot <- grepl (".*bot.*", log$agent)
views <- function(x, subset.code) {
switch(subset.code,
'nobots' = subset(x, isbot==TRUE),
'pdfs' = subset(x, grepl('.*.pdf$', x$uri))
)
}
views(log, 'pdfs')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment