Ben and I wanted to take a minute today to thank everyone for coming to our tutorial. We hope you all learned something new and useful, and encourage everyone to continue the lively discussions from the sessions throughout this week and beyond. Towards that aim of facilitating further discussion of these topics, here is a quick rundown of the topics we went over and some additional resources for those interested in learning more.
- The tutorial GitHub repo contains the slides and exercises, and should stay up for a while.
- Your demo accounts on Wakari.io are not permanent, but it's super easy to sign up for a free account. Wakari is in active development, so if there's a feature you want or an annoyance you don't, feel free to give us a shout!
- Series and DataFrame extend NumPy to enable more expressive data processing.
- Indices are optional, but allow features like selection of date ranges.
- Handles missing data well, provided you tell it what missing data looks like with
na_values
. - Resampling and reindexing are powerful. Learn them, love them.
- Exercise 1: TODO: FILL THIS IN
- Exercise 2: TODO: FILL THIS IN
- Recommended text: Python for Data Analysis, by Wes McKinney.
-
IPCluster clients talk to a central controller, which in turn wrangles remote nodes, each running one or more engines.
-
An engine is like a thread. You can run an engine on the same node as a controller, and nodes can run more than one engine.
-
Configuration is flexible, but somewhat poorly documented. For development, run
ipcluster start -n 3
to start three engines, and connect to them from IPython withfrom IPython.parallel import Client client = Client()
-
Execute commands with view methods e.g.
direct.execute('foo()')
notclient.execute('foo()')
. -
IPCluster is ideal for embarassingly parallel workloads that are CPU/GPU/RAM-heavy and light on data transfer.
-
Exercise: MCMC sampling for Bayesian Estimation. <------ LINK TO BUNDLE
-
Recommended notebook: Introduction to Parallel Python with IPCluster and Wakari, Ian Stokes-Rees.
-
Recommended text: Doing Bayesian Data Analysis, John K Kruschke.
- Good for data-heavy workloads that can be implemented as a set of filters (map) and aggregators (reduce).
- Can generate large amounts of network traffic.
- Obligatory SQL analogy of a MapReduce job:
- Map step filters data (SELECT BY ...), outputs key/value pair (SQL GROUP BY key).
- Partitioning sends k/v pairs such that all pairs with similar key go to same reduce node.
- Reduce step performs the aggregate function (COUNT, SUM, GROUP CONCAT) on all values with the same key.
- MapReduce as a concept is separate from its implementations. Popular implementations include Disco and Hadoop.
- Exercise: Bitly data from
.gov
and.mil
. - Extra exercise in repo: WikiLogs
- Recommended blogposts: Map / Reduce – A Visual Explanation, Ayende Rahien, and its follow-up What is map/reduce for, anyway?.
- Recommended reading: MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
- Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) find the axes with highest variance.
- These high variance axes represent the "important" variables.
- Implementations in Numpu/Scipy and SciKits-Learn
- K-means clustering tries to group "similar" data points together.
- The number of clusters K is an input parameter. This is good or bad depending on the problem.
- Methods like Bayesian Information Content can help determine K from the data if it is unknown. Exercises: K-means and PCA clustering