Created
November 20, 2014 15:57
-
-
Save roc/d979ebb4fc4cb666b999 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Data analysis on big systems | |
| ---------------------------- | |
| Theo Schlossnagle | |
| http://www.slideshare.net/postwait/math-41511295 | |
| | | |
| |/ | | |
| ,,,,, ,+ /| | |
| / \ () | || | |
| \ C '\ /|_() || | |
| ) _| .'___/,,,// || | |
| .'=. (____E.' / / \ || | |
| | \)`-\ _F_.' \ c `\ || | |
| \ \ !'__/ ) _| || | |
| \ \,' / /`._( || | |
| |`. .' / \ \ || | |
| \ `-' | .-. | | \ E || | |
| >====[] | \ |__| | O OE || | |
| / |_/ | |___)| `.__j____ \|E || | |
| \_ | || __`.________ `. |""|\| | |
| \ |\ ||| \///_ _|__|_| | |
| \ __ | \ ||`""\\""""//"' \`. \ | | |
| |[__]| \ ||.---\\__//---. | | \____| | |
| ||__|/ / \|____________|\ |/ | | |
| | | / || || /| | | | |
| -----| |/------------||-||-/`| |----------| | |
| /| | || ||/`-|___| | | |
| /\| | || \\._ [____] h| | |
| /`.|____| || \\ `-/ '`._ j| | |
| `=.\____/ || \\__`-.____) w| | |
| ) '`--. _.-||-._ `""""""" | | |
| `='====' ,-' ' ` `-. | | |
| `-.________.-' | | |
| R/SciPi stuff | |
| - Why are you doing the math that you're doing | |
| - response to data and problems | |
| Monitoring | |
| - Have some way of classifying signals | |
| - online/offline modeling of data | |
| tl;dr - hire PhDs! | |
| - Not given a clean problem | |
| - Never monitor the rate of something, because you have to specify the time | |
| - track the total | |
| - ask at as regular intervals as possible | |
| - then you can take a derivative | |
| - derive from all POSITIVE value changes (so when you get a reset on your statistics engine you don't track down rates) | |
| Categorization issues | |
| - bring out derivitives | |
| - does it go up/down, have spikes... descrete stoic transforms | |
| maximum entrophy/bayesian | |
| Statistics are only useful when the p-value is less than 0.1 | |
| Without large scale systems, you need greater frequency of metrics | |
| - instead of measuring avg latency, have it report every single allocation within data centre | |
| - track average of residuals derived from signal | |
| EWM vs SQM | |
| - exponential weighted mean | |
| - very efficient computationally | |
| - low memory usage | |
| - hard to repeat offline (how far back would you track?) | |
| CUSUM test - applies hypothesis test | |
| - has issues when there is noise in the signal | |
| Can do Tukey test | |
| - non-parametric statistical test | |
| Summarisation & exrtaction | |
| - take at high-velocity | |
| - summarise as a histogram (introduces error though) | |
| - extract useful less-dimensional signal | |
| - do e.g. CUMSUM | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment