martinamps · November 21, 2015 16:15
diff --git a/gistfile1.txt b/gistfile1.txt



 Monitoring Challenges

 Collect monitoring data
 Snapshot of system, monitoring changes over time
 Hard at scale
 Analyze monitoring data
 Hard to do fast/effiicienly. 
 Detect anomalies
 Hard to get a clear signal - false positives create noise


 Monitoring as a service

 Service owners know their services better
 Monitoring team focuses on building high quality tools
 Service owners monitor their services
 Incoming data is only bound by available resources and engineers imagination

 Configurable data collection
 Streaming aggregation
 Scalable and fast time series storage
 Powerful time series query engine 





 Scaling data collection

 started in 2008
 Single service. read/write -> Maestro (did everything) -> storage
 Scaled up Maestro became point of failure
 Added routers and aggregators to help scale (June 2011 - ~!M events/sec)
 MySQL as storage layer
 Soem tables got 100s of GB
 Had to write migration scripts to balance writes better
 In 2012, switched to HBase as storage layer, less hotspots. 
 2013 built custom time series cache to improve query performance
 2014: Queries slow if not in TS Cache. 2 Petabytes of data. Wanted to bring all data in RAM. 90% of queries were within 24 hours of data
 16TB of data per 24 hours. They built compression to get 11x compression 1.36 bytes / data point = 1.3 TB
 70% faster queries (faster query speed meant more queries, more data analysis is good!!)
 2015: Added 2 weeks of data to Gorilla using flash storage (275M dp/s)
 Spam becomes an issue. Engineers storing useless metrics. Still working on this.




 Consuming monitoring data

 All charts use time series query engine. Reduce large datasets to consumable form.
 Top3/bottom3, 50, 90, 95 percentile, group_by(datacenter), top(3, std. dev)
 Transform time series to moving avg
 Combine eg login success + login failed —> percentage graph


 Anomaly Detection

 Helps you keep your infrastructure healthy. Goal is to find parts of system misbehaving before it affects users.

 Data exploration
 How to find issues. Find relevant metrics. Find anomalous metrics. With 4bn time series, finding interesting metrics is hard.
 Fixed threshold detection gives clear signal (either over or under).
 Streamlining detection is good for single time series, hard for formulas/multiple time series
 Query-based detection requires fast queries. Much better though.
 Without thresholds it’s hard. Need to detect deviation/outliers
 Ideally can adjust sensitivity to balance false positives / false negatives



 Collection: Data is only bound by engineers imagination
 Consumption: Fast queries can drive more data analysis
 Detection: Query based detection is simpler for complex queries
 Fixed thresholds give clear signal, but don’t always work



	Monitoring Challenges

	Collect monitoring data
	Snapshot of system, monitoring changes over time
	Hard at scale
	Analyze monitoring data
	Hard to do fast/effiicienly.
	Detect anomalies
	Hard to get a clear signal - false positives create noise


	Monitoring as a service

	Service owners know their services better
	Monitoring team focuses on building high quality tools
	Service owners monitor their services
	Incoming data is only bound by available resources and engineers imagination

	Configurable data collection
	Streaming aggregation
	Scalable and fast time series storage
	Powerful time series query engine





	Scaling data collection

	started in 2008
	Single service. read/write -> Maestro (did everything) -> storage
	Scaled up Maestro became point of failure
	Added routers and aggregators to help scale (June 2011 - ~!M events/sec)
	MySQL as storage layer
	Soem tables got 100s of GB
	Had to write migration scripts to balance writes better
	In 2012, switched to HBase as storage layer, less hotspots.
	2013 built custom time series cache to improve query performance
	2014: Queries slow if not in TS Cache. 2 Petabytes of data. Wanted to bring all data in RAM. 90% of queries were within 24 hours of data
	16TB of data per 24 hours. They built compression to get 11x compression 1.36 bytes / data point = 1.3 TB
	70% faster queries (faster query speed meant more queries, more data analysis is good!!)
	2015: Added 2 weeks of data to Gorilla using flash storage (275M dp/s)
	Spam becomes an issue. Engineers storing useless metrics. Still working on this.




	Consuming monitoring data

	All charts use time series query engine. Reduce large datasets to consumable form.
	Top3/bottom3, 50, 90, 95 percentile, group_by(datacenter), top(3, std. dev)
	Transform time series to moving avg
	Combine eg login success + login failed —> percentage graph


	Anomaly Detection

	Helps you keep your infrastructure healthy. Goal is to find parts of system misbehaving before it affects users.

	Data exploration
	How to find issues. Find relevant metrics. Find anomalous metrics. With 4bn time series, finding interesting metrics is hard.
	Fixed threshold detection gives clear signal (either over or under).
	Streamlining detection is good for single time series, hard for formulas/multiple time series
	Query-based detection requires fast queries. Much better though.
	Without thresholds it’s hard. Need to detect deviation/outliers
	Ideally can adjust sensitivity to balance false positives / false negatives



	Collection: Data is only bound by engineers imagination
	Consumption: Fast queries can drive more data analysis
	Detection: Query based detection is simpler for complex queries
	Fixed thresholds give clear signal, but don’t always work