Author: Gregg Lind
Approach: get_pings for a day.
Result: this was fine, and went according to plan.
-
Get a cohort of ids...
get_pings(...).filter(isADay1Ping).map(itemgetter('clientId')Not really a way to avoid this :) It's really wasteful though! (Filter of 99 %)
Then:
-
get_pings(some later days).filter(one of the client ids!).aggregate(complicated, seqOp, combOp)Aggreates are hard to write and grok. The filtering has to live somewhere for
clientIds. This smells, and is 99% wasteful. -
get_clients_history, filter, then????This is slow as molasses. Should be
O(fraction). -
Use the executive stream.
eStream = sqlContext.load("s3://telemetry-parquet/ExecutiveStream", "parquet") idTable = sqlContext.createDataFrame(map(tuple, clientIds),['clientId'] idTable.join(eStream, idTable.clientId == eStream.clientId, "inner")This should work awesome. But lots things go wrong.
sqlContext.rddhas weird bugs.eStream.show()is fine. ButeStream.limit().collect()is totally busted in 1.3- Lots of pieces of this have far sub-optimal algorithms. Unpredictable failures.
-
getAllPacketsForClientList([cid, cid],submission_date) -
Index more 'traits' on the way in?
- get all created on day, and all forward packets
- Cluster Creation website mostly fixes most of the problems.
- The
executive summarystream - Example notebooks
ipythonis all set up nicely.
-
Trying to remember the CURRENT MACHINE is a pain.
-
In the email, a line says:
here: https://telemetry-dash.mozilla.org/cluster/monitor/j-972SCL5SRTUOThis is annoying, b/c then I can't copy the whole line
-
At the
monitorpage, the text could be more scriptablehttp://analysis.telemetry.mozilla.org/cluster/monitor/j-2RFX65L0H4RBI
could be more like
echo "machine name" > ~/.currentcluster ssh -i key $(cat ~/.currentcluster) ... -
Tell me about SparkShell!
ssh -i key -L 4040:localhost:4040 -L ...This wires it, but it's not quite right.
-
The
scpstuff is all grody, because the machine names change.