Week One Experiences with Unified Telemetry Spark Stuff

Author: Gregg Lind

Analysis 1: Flash on Day 0 (Point in Time)

Approach: get_pings for a day.

Result: this was fine, and went according to plan.

Get a cohort of ids...

get_pings(...).filter(isADay1Ping).map(itemgetter('clientId')

Not really a way to avoid this :) It's really wasteful though! (Filter of 99 %)

Then:

get_pings(some later days).filter(one of the client ids!).aggregate(complicated, seqOp, combOp)

Aggreates are hard to write and grok. The filtering has to live somewhere for clientIds. This smells, and is 99% wasteful.
get_clients_history, filter, then????

This is slow as molasses. Should be O(fraction).
Use the executive stream.
```
eStream = sqlContext.load("s3://telemetry-parquet/ExecutiveStream", "parquet")
idTable = sqlContext.createDataFrame(map(tuple, clientIds),['clientId']

idTable.join(eStream, idTable.clientId == eStream.clientId, "inner")
```
This should work awesome. But lots things go wrong.
- sqlContext.rdd has weird bugs.
- eStream.show() is fine. But eStream.limit().collect() is totally busted in 1.3
- Lots of pieces of this have far sub-optimal algorithms. Unpredictable failures.

getAllPacketsForClientList([cid, cid],submission_date)
Index more 'traits' on the way in?
- get all created on day, and all forward packets

Trying to remember the CURRENT MACHINE is a pain.
In the email, a line says:

here: https://telemetry-dash.mozilla.org/cluster/monitor/j-972SCL5SRTUO

This is annoying, b/c then I can't copy the whole line
At the monitor page, the text could be more scriptable

http://analysis.telemetry.mozilla.org/cluster/monitor/j-2RFX65L0H4RBI

could be more like
```
echo "machine name" > ~/.currentcluster
ssh -i key $(cat ~/.currentcluster) ...  
```
Tell me about SparkShell!
```
ssh -i key  -L 4040:localhost:4040  -L ...
```
This wires it, but it's not quite right.
The scp stuff is all grody, because the machine names change.