Skip to content

Instantly share code, notes, and snippets.

@chuckg
Last active May 18, 2020 17:37
Show Gist options
  • Save chuckg/5872753 to your computer and use it in GitHub Desktop.
Save chuckg/5872753 to your computer and use it in GitHub Desktop.
Splunk Storm recipes for Heroku logs.

Application statistics

average/median/95th percentile response time/connect
sourcetype=syslog 
  | regex method="GET|POST|PUT|DELETE" 
  | eval connect_time=tonumber(replace(connect, "[ms]", "")) / 1000
  | eval response_time=tonumber(replace(service, "[ms]", "")) / 1000
  | timechart span="1m" avg(response_time) as "response avg"
                      , median(response_time) as "response median"
                      , perc95(response_time) as "response 95%"
                      , avg(connect_time) as "connect avg"
requests / timeouts
sourcetype=syslog 
  | regex method="GET|POST|PUT|DELETE" 
  | timechart span="1m" count as "# requests"
                      , count(eval(match(status, "503"))) as "# timeouts"
response times / requests / timeouts
sourcetype=syslog 
  | regex method="GET|POST|PUT|DELETE" 
  | eval connect_time=tonumber(replace(connect, "[ms]", "")) / 1000
  | eval response_time=tonumber(replace(service, "[ms]", "")) / 1000
  | bin _time span=15m
  | stats avg(response_time) as "response avg"
                      , median(response_time) as "response median"
                      , perc95(response_time) as "response 95%"
                      , avg(connect_time) as "connect avg"
                      , count(eval(match(status, "503"))) as "# timeouts"
                      , count as "# requests"
                      by _time
Requests > 10s
sourcetype=syslog 
  | regex method="GET|POST" 
  | eval response_time=tonumber(replace(service, "[ms]", "")) 
  | where response_time > 10000 
  | stats count, avg(response_time), max(response_time) by path
Response times grouped as ranges
sourcetype=syslog 
  | regex method="GET|POST" 
  | eval response_time=tonumber(replace(service, "[ms]", "")) 
  | rex field=path "(?<location>[^?]+)"
  | eval range=response_time/1000
  | rangemap field=range lightgreen=2-3 darkgreen=3-5 yellow=5-10 brown=10-15 orange=15-20 pink=20-25 red=25-100 default=green
  | stats count(eval(match(range, "green"))) as "<1s"
        , count(eval(match(range, "lightgreen"))) as "2..3s"
        , count(eval(match(range, "darkgreen"))) as "3..5s"
        , count(eval(match(range, "yellow"))) as "5..10s"
        , count(eval(match(range, "brown"))) as "10..15s"
        , count(eval(match(range, "orange"))) as "15..20s"
        , count(eval(match(range, "pink"))) as "20..25s"
        , count(eval(match(range, "red"))) as ">25s"
        , count as "Total"
        , values(status) as "Status Codes"
        , avg(response_time)
        , max(response_time) 
    by location

Delayed Job statistics

Default log lines:

  • Jobname completed after <float>
  • <int> jobs processed at <float> j/s, <int> failed ...
Basic delayed job stats
sourcetype="syslog" "jobs processed at"
  | rex "app (?<dyno>\w+\.\d+) -" 
  | regex dyno="worker"
  | rex "(?<completed>\d+) jobs processed at (?<per_second>\d+(.\d+)?) j/s, (?<failed>\d+) failed ..."
  | eval jobs_completed=tonumber(completed) 
  | eval jobs_per_second=tonumber(per_second) 
  | eval jobs_failed=tonumber(failed) 
  | eval jobs_count=jobs_completed + jobs_failed
  | eval jobs_avg_runtime=jobs_per_second/jobs_count
  | timechart span=3h sum(jobs_count) as "# jobs"
                    , sum(jobs_failed) as "# failed"
                    , perc95(jobs_avg_runtime) as "95th run-time"
                    , avg(jobs_avg_runtime) as "avg run-time"

Heroku errors

  • R14: memory quota exceeded (>512mb)
  • R15: memory quota vastly exceeded (>512mb) and autokill incoming
Generic errors (client and server)
at=error | timechart span="1hr" count by desc
Generic errors (server only)
at=error | regex source!="client" | timechart span="1hr" count by desc
Memory quota errors:
R14 OR R15 | stats count by date

# Sexier version
R14 OR R15 | rex "- Error (?<error_code>R\d\d) \((?<error_message>.+)\)" | eval error=error_code." ".error_message | timechart span="5m" count by error
Request timeouts by hour:
code=H12 | timechart span="1hr" count by desc

Load average

Using load_avg_5m as the measure (reference: http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages).

load average/min/max for all services (includes web, workers, jobs, scheduler, other)
sourcetype="syslog" measure="load_avg_5m" 
  | timechart span="15m" avg(val) as avg, max(val) as max, min(val) as min by sourcetype
load average/min/max by application (web/workers)
sourcetype="syslog" measure="load_avg_5m" 
  | rex "(heroku|app) (?<dyno>\w+\.\d+) -" 
  | regex dyno="(web|worker)"
  | timechart span="15m" avg(val) as avg, max(val) as max, min(val) as min by sourcetype
load avg/min/max load by web
sourcetype="syslog" measure="load_avg_5m" 
  | rex "heroku (?<dyno>\w+\.\d+) -" 
  | regex dyno="web"
  | timechart span="3hr" avg(val) as "average load", min(val) as "min load", max(val) as "max load"
load average by web dyno
sourcetype="syslog" measure="load_avg_5m" 
  | rex "(heroku|app) (?<dyno>\w+\.\d+) -" 
  | regex dyno="web"
  | timechart span="15m" avg(val) as avg by dyno
load avg/min/max load by workers
sourcetype="syslog" measure="load_avg_5m" 
  | rex "heroku (?<dyno>\w+\.\d+) -" 
  | regex dyno="worker"
  | timechart span="3hr" avg(val) as "average load", min(val) as "min load", max(val) as "max load"
load average by worker dyno
sourcetype="syslog" measure="load_avg_5m" 
  | rex "(heroku|app) (?<dyno>\w+\.\d+) -" 
  | regex dyno="worker"
  | timechart span="15m" avg(val) as avg by dyno

Memory usage

average/min/max memory for all services (includes web, workers, jobs, scheduler, other)
sourcetype="syslog" measure="memory_total" 
  | timechart span="15m" avg(val) as avg, max(val) as max, min(val) as min by sourcetype
average/min/max memory by application (web/workers)
sourcetype="syslog" measure="memory_total" 
  | rex "(heroku|app) (?<dyno>\w+\.\d+) -" 
  | regex dyno="(web|worker)"
  | timechart span="15m" avg(val) as avg, max(val) as max, min(val) as min by sourcetype
average/min/max memory by web
sourcetype="syslog" measure="memory_total" 
  | rex "(heroku|app) (?<dyno>\w+\.\d+) -" 
  | regex dyno="web"
  | timechart span="3hr" avg(val) as avg, max(val) as max, min(val) as min
average memory by web dyno (web/workers)
sourcetype="syslog" measure="memory_total" 
  | rex "(heroku|app) (?<dyno>\w+\.\d+) -" 
  | regex dyno="web"
  | timechart span="15m" avg(val) as avg by dyno  
average memory by worker dyno (web/workers)
sourcetype="syslog" measure="memory_total" 
  | rex "(heroku|app) (?<dyno>\w+\.\d+) -" 
  | regex dyno="worker"
  | timechart span="15m" avg(val) as avg by dyno  

Splunk Recipes for Heroku

A collection of Splunk recipes for Heroku logs. Instructions for setting up Splunk Storm with Heroku can be found here. For the vast majority of these recipes you'll need to have enabled the Heroku labs feature, log-runtime-metrics, for your application.

Unicorn

Monitor how often Unicorn Worker Killer is killing web workers.

unicorn worker restarts
"exceeds memory limit" 
  | timechart span="1m"
unicorn worker restarts by dyno
"exceeds memory limit" 
  | rex "(heroku|app) (?<dyno>\w+\.\d+) -" 
  | regex dyno="web"
  | timechart span="1m" count by dyno
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment