Created
November 21, 2015 16:15
-
-
Save martinamps/6c710aaa9ebd3af3953f to your computer and use it in GitHub Desktop.
Facebook: Monitoring at 250 gbit/s
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Monitoring Challenges | |
Collect monitoring data | |
Snapshot of system, monitoring changes over time | |
Hard at scale | |
Analyze monitoring data | |
Hard to do fast/effiicienly. | |
Detect anomalies | |
Hard to get a clear signal - false positives create noise | |
Monitoring as a service | |
Service owners know their services better | |
Monitoring team focuses on building high quality tools | |
Service owners monitor their services | |
Incoming data is only bound by available resources and engineers imagination | |
Configurable data collection | |
Streaming aggregation | |
Scalable and fast time series storage | |
Powerful time series query engine | |
Scaling data collection | |
started in 2008 | |
Single service. read/write -> Maestro (did everything) -> storage | |
Scaled up Maestro became point of failure | |
Added routers and aggregators to help scale (June 2011 - ~!M events/sec) | |
MySQL as storage layer | |
Soem tables got 100s of GB | |
Had to write migration scripts to balance writes better | |
In 2012, switched to HBase as storage layer, less hotspots. | |
2013 built custom time series cache to improve query performance | |
2014: Queries slow if not in TS Cache. 2 Petabytes of data. Wanted to bring all data in RAM. 90% of queries were within 24 hours of data | |
16TB of data per 24 hours. They built compression to get 11x compression 1.36 bytes / data point = 1.3 TB | |
70% faster queries (faster query speed meant more queries, more data analysis is good!!) | |
2015: Added 2 weeks of data to Gorilla using flash storage (275M dp/s) | |
Spam becomes an issue. Engineers storing useless metrics. Still working on this. | |
Consuming monitoring data | |
All charts use time series query engine. Reduce large datasets to consumable form. | |
Top3/bottom3, 50, 90, 95 percentile, group_by(datacenter), top(3, std. dev) | |
Transform time series to moving avg | |
Combine eg login success + login failed —> percentage graph | |
Anomaly Detection | |
Helps you keep your infrastructure healthy. Goal is to find parts of system misbehaving before it affects users. | |
Data exploration | |
How to find issues. Find relevant metrics. Find anomalous metrics. With 4bn time series, finding interesting metrics is hard. | |
Fixed threshold detection gives clear signal (either over or under). | |
Streamlining detection is good for single time series, hard for formulas/multiple time series | |
Query-based detection requires fast queries. Much better though. | |
Without thresholds it’s hard. Need to detect deviation/outliers | |
Ideally can adjust sensitivity to balance false positives / false negatives | |
Collection: Data is only bound by engineers imagination | |
Consumption: Fast queries can drive more data analysis | |
Detection: Query based detection is simpler for complex queries | |
Fixed thresholds give clear signal, but don’t always work |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment