Skip to content

Instantly share code, notes, and snippets.

@martinamps
Created November 21, 2015 16:15
Show Gist options
  • Save martinamps/6c710aaa9ebd3af3953f to your computer and use it in GitHub Desktop.
Save martinamps/6c710aaa9ebd3af3953f to your computer and use it in GitHub Desktop.
Facebook: Monitoring at 250 gbit/s
Monitoring Challenges
Collect monitoring data
Snapshot of system, monitoring changes over time
Hard at scale
Analyze monitoring data
Hard to do fast/effiicienly.
Detect anomalies
Hard to get a clear signal - false positives create noise
Monitoring as a service
Service owners know their services better
Monitoring team focuses on building high quality tools
Service owners monitor their services
Incoming data is only bound by available resources and engineers imagination
Configurable data collection
Streaming aggregation
Scalable and fast time series storage
Powerful time series query engine
Scaling data collection
started in 2008
Single service. read/write -> Maestro (did everything) -> storage
Scaled up Maestro became point of failure
Added routers and aggregators to help scale (June 2011 - ~!M events/sec)
MySQL as storage layer
Soem tables got 100s of GB
Had to write migration scripts to balance writes better
In 2012, switched to HBase as storage layer, less hotspots.
2013 built custom time series cache to improve query performance
2014: Queries slow if not in TS Cache. 2 Petabytes of data. Wanted to bring all data in RAM. 90% of queries were within 24 hours of data
16TB of data per 24 hours. They built compression to get 11x compression 1.36 bytes / data point = 1.3 TB
70% faster queries (faster query speed meant more queries, more data analysis is good!!)
2015: Added 2 weeks of data to Gorilla using flash storage (275M dp/s)
Spam becomes an issue. Engineers storing useless metrics. Still working on this.
Consuming monitoring data
All charts use time series query engine. Reduce large datasets to consumable form.
Top3/bottom3, 50, 90, 95 percentile, group_by(datacenter), top(3, std. dev)
Transform time series to moving avg
Combine eg login success + login failed —> percentage graph
Anomaly Detection
Helps you keep your infrastructure healthy. Goal is to find parts of system misbehaving before it affects users.
Data exploration
How to find issues. Find relevant metrics. Find anomalous metrics. With 4bn time series, finding interesting metrics is hard.
Fixed threshold detection gives clear signal (either over or under).
Streamlining detection is good for single time series, hard for formulas/multiple time series
Query-based detection requires fast queries. Much better though.
Without thresholds it’s hard. Need to detect deviation/outliers
Ideally can adjust sensitivity to balance false positives / false negatives
Collection: Data is only bound by engineers imagination
Consumption: Fast queries can drive more data analysis
Detection: Query based detection is simpler for complex queries
Fixed thresholds give clear signal, but don’t always work
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment