Skip to content

Instantly share code, notes, and snippets.

@roc
Created November 20, 2014 15:57
Show Gist options
  • Select an option

  • Save roc/d979ebb4fc4cb666b999 to your computer and use it in GitHub Desktop.

Select an option

Save roc/d979ebb4fc4cb666b999 to your computer and use it in GitHub Desktop.
Data analysis on big systems
----------------------------
Theo Schlossnagle
http://www.slideshare.net/postwait/math-41511295
|
|/ |
,,,,, ,+ /|
/ \ () | ||
\ C '\ /|_() ||
) _| .'___/,,,// ||
.'=. (____E.' / / \ ||
| \)`-\ _F_.' \ c `\ ||
\ \ !'__/ ) _| ||
\ \,' / /`._( ||
|`. .' / \ \ ||
\ `-' | .-. | | \ E ||
>====[] | \ |__| | O OE ||
/ |_/ | |___)| `.__j____ \|E ||
\_ | || __`.________ `. |""|\|
\ |\ ||| \///_ _|__|_|
\ __ | \ ||`""\\""""//"' \`. \ |
|[__]| \ ||.---\\__//---. | | \____|
||__|/ / \|____________|\ |/ |
| | / || || /| | |
-----| |/------------||-||-/`| |----------|
/| | || ||/`-|___| |
/\| | || \\._ [____] h|
/`.|____| || \\ `-/ '`._ j|
`=.\____/ || \\__`-.____) w|
) '`--. _.-||-._ `""""""" |
`='====' ,-' ' ` `-. |
`-.________.-' |
R/SciPi stuff
- Why are you doing the math that you're doing
- response to data and problems
Monitoring
- Have some way of classifying signals
- online/offline modeling of data
tl;dr - hire PhDs!
- Not given a clean problem
- Never monitor the rate of something, because you have to specify the time
- track the total
- ask at as regular intervals as possible
- then you can take a derivative
- derive from all POSITIVE value changes (so when you get a reset on your statistics engine you don't track down rates)
Categorization issues
- bring out derivitives
- does it go up/down, have spikes... descrete stoic transforms
maximum entrophy/bayesian
Statistics are only useful when the p-value is less than 0.1
Without large scale systems, you need greater frequency of metrics
- instead of measuring avg latency, have it report every single allocation within data centre
- track average of residuals derived from signal
EWM vs SQM
- exponential weighted mean
- very efficient computationally
- low memory usage
- hard to repeat offline (how far back would you track?)
CUSUM test - applies hypothesis test
- has issues when there is noise in the signal
Can do Tukey test
- non-parametric statistical test
Summarisation & exrtaction
- take at high-velocity
- summarise as a histogram (introduces error though)
- extract useful less-dimensional signal
- do e.g. CUMSUM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment