Skip to content

Instantly share code, notes, and snippets.

@gregglind
Created November 5, 2015 17:34
Show Gist options
  • Select an option

  • Save gregglind/f657d0194b66bb4fd381 to your computer and use it in GitHub Desktop.

Select an option

Save gregglind/f657d0194b66bb4fd381 to your computer and use it in GitHub Desktop.
Some notes from the first week of spark use.

Week One Experiences with Unified Telemetry Spark Stuff

Author: Gregg Lind

Analysis 1: Flash on Day 0 (Point in Time)

Approach: get_pings for a day.

Result: this was fine, and went according to plan.

Analysis 2: Flash Impact on 30 day Retention

Approaches

  1. Get a cohort of ids...

    get_pings(...).filter(isADay1Ping).map(itemgetter('clientId')

    Not really a way to avoid this :) It's really wasteful though! (Filter of 99 %)

Then:

  1. get_pings(some later days).filter(one of the client ids!).aggregate(complicated, seqOp, combOp)

    Aggreates are hard to write and grok. The filtering has to live somewhere for clientIds. This smells, and is 99% wasteful.

  2. get_clients_history, filter, then????

    This is slow as molasses. Should be O(fraction).

  3. Use the executive stream.

    eStream = sqlContext.load("s3://telemetry-parquet/ExecutiveStream", "parquet")
    idTable = sqlContext.createDataFrame(map(tuple, clientIds),['clientId']
    
    idTable.join(eStream, idTable.clientId == eStream.clientId, "inner")
    
    

    This should work awesome. But lots things go wrong.

    • sqlContext.rdd has weird bugs.
    • eStream.show() is fine. But eStream.limit().collect() is totally busted in 1.3
    • Lots of pieces of this have far sub-optimal algorithms. Unpredictable failures.

Improvments Wanted

  1. getAllPacketsForClientList([cid, cid],submission_date)

  2. Index more 'traits' on the way in?

    • get all created on day, and all forward packets

412 Things that are awesome.

  1. Cluster Creation website mostly fixes most of the problems.
  2. The executive summary stream
  3. Example notebooks
  4. ipython is all set up nicely.

55 Nits

Cluster Creation

  1. Trying to remember the CURRENT MACHINE is a pain.

  2. In the email, a line says:

    here: https://telemetry-dash.mozilla.org/cluster/monitor/j-972SCL5SRTUO

    This is annoying, b/c then I can't copy the whole line

  3. At the monitor page, the text could be more scriptable

    http://analysis.telemetry.mozilla.org/cluster/monitor/j-2RFX65L0H4RBI

    could be more like

    echo "machine name" > ~/.currentcluster
    ssh -i key $(cat ~/.currentcluster) ...  
    
  4. Tell me about SparkShell!

    ssh -i key  -L 4040:localhost:4040  -L ...
    

    This wires it, but it's not quite right.

  5. The scp stuff is all grody, because the machine names change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment