Skip to content

Instantly share code, notes, and snippets.

@SamPenrose
Last active August 29, 2015 14:26
Show Gist options
  • Save SamPenrose/43251a594c63abaceea3 to your computer and use it in GitHub Desktop.
Save SamPenrose/43251a594c63abaceea3 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{"nbformat_minor": 0, "cells": [{"source": "This is a stripped-down version of Roberto's excellent notebook:\n\nhttp://nbviewer.ipython.org/gist/vitillo/3047c0d896b08f75c403\n\nIt drills down into the specifics of subsessionCounter gaps.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 1, "cell_type": "code", "source": "from moztelemetry import get_pings, get_pings_properties", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 2, "cell_type": "code", "source": "build_ids = (\"20150722000000\", \"20150729999999\")\nmain_pings = get_pings(sc,\n app=\"Firefox\",\n channel=\"nightly\",\n build_id=build_ids,\n doc_type=\"main\",\n schema=\"v4\")", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 3, "cell_type": "code", "source": "total = main_pings.count()\nprint total # 1,292,730 pings", "outputs": [{"output_type": "stream", "name": "stdout", "text": "1292730\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "We will look at 128,000 pings.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 5, "cell_type": "code", "source": "import binascii\ntenth = main_pings.filter(lambda d: binascii.crc32((d or {}).get(\"clientId\", 'a')) % 100 < 10) # 'a' -> 11\ntenth_total = tenth.count()\nprint tenth_total # 128,543 pings", "outputs": [{"output_type": "stream", "name": "stdout", "text": "128543\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 7, "cell_type": "code", "source": "subset_10 = get_pings_properties(tenth, [\"clientId\",\n \"meta/documentId\",\n \"meta/submissionDate\",\n \"meta/creationTimestamp\",\n \"environment/system/os/name\",\n \"payload/info/reason\",\n \"payload/info/sessionId\",\n \"payload/info/subsessionId\",\n \"payload/info/previousSessionId\",\n \"payload/info/previousSubsessionId\",\n \"payload/info/subsessionCounter\",\n \"payload/info/profileSubsessionCounter\",\n \"payload/simpleMeasurements/firstPaint\",\n \"payload/simpleMeasurements/savedPings\",\n \"payload/simpleMeasurements/uptime\",\n \"payload/histograms/STARTUP_CRASH_DETECTED\"])", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 8, "cell_type": "code", "source": "from operator import itemgetter\ndef dedupe_and_sort(group):\n key, history = group\n seen = set()\n result = []\n for fragment in history:\n id = fragment[\"meta/documentId\"]\n if id in seen:\n continue \n seen.add(id)\n result.append(fragment) \n result.sort(key=itemgetter(\"payload/info/profileSubsessionCounter\"))\n return result\n\ngrouped = subset_10.groupBy(lambda x: x[\"clientId\"]).map(dedupe_and_sort).collect()", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 17, "cell_type": "code", "source": "len(grouped) # 7,421 clients.", "outputs": [{"execution_count": 17, "output_type": "execute_result", "data": {"text/plain": "7421"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 9, "cell_type": "code", "source": "from collections import defaultdict\ndef duplicate_pssc(grouped):\n dupes = 0\n dupe_clients = set()\n\n for history in grouped:\n counts = defaultdict(int)\n\n for fragment in history:\n key = fragment[\"payload/info/profileSubsessionCounter\"]\n counts[key] += 1\n\n for _, v in counts.iteritems():\n if v > 1:\n dupes += 1\n dupe_clients.add(history[0][\"clientId\"])\n break\n\n print 100.0*dupes/len(grouped)\n return dupe_clients\n \ndupe_clients = duplicate_pssc(grouped)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "2.8837083951\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 16, "cell_type": "code", "source": "len(dupe_clients)", "outputs": [{"execution_count": 16, "output_type": "execute_result", "data": {"text/plain": "214"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "3% of clients have a subsessionCounter which under-incremented.\nRoberto discarded them, so let's follow that.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 10, "cell_type": "code", "source": "dd_grouped = filter(lambda h: h[0][\"clientId\"] not in dupe_clients, grouped)", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 18, "cell_type": "code", "source": "len(dd_grouped)", "outputs": [{"execution_count": 18, "output_type": "execute_result", "data": {"text/plain": "7207"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 13, "cell_type": "code", "source": "def missing_subsession_ids(groups):\n missing = set()\n ok = 0\n for history in groups:\n ok += len(history)\n last = history[0]\n for current in history[1:]:\n last_pssc = last[\"payload/info/profileSubsessionCounter\"]\n this_pssc = current[\"payload/info/profileSubsessionCounter\"]\n if last_pssc + 1 != this_pssc:\n missing.add((current['clientId'], current['payload/info/previousSubsessionId']))\n ok -= 1\n last = current\n return missing, ok\nprint missing_subsession_ids([dd_grouped[0]])", "outputs": [{"output_type": "stream", "name": "stdout", "text": "(set([]), 34)\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 14, "cell_type": "code", "source": "missing_pings, ok_ping_count = missing_subsession_ids(dd_grouped)\nprint len(missing_pings), ok_ping_count", "outputs": [{"output_type": "stream", "name": "stdout", "text": "556 116848\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 15, "cell_type": "code", "source": "556.0/116848", "outputs": [{"execution_count": 15, "output_type": "execute_result", "data": {"text/plain": "0.004758318499246885"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "So 0.4% of pings are missing. Victory! Except, we assume\none missing ping in each gap. Let's count them.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 26, "cell_type": "code", "source": "def count_missing_subsessions(groups):\n present = 0\n missing = []\n negative_gaps = 0\n for history in groups:\n present += len(history)\n last = history[0]\n for current in history[1:]:\n last_pssc = last[\"payload/info/profileSubsessionCounter\"]\n this_pssc = current[\"payload/info/profileSubsessionCounter\"]\n gap = this_pssc - (last_pssc + 1)\n if gap:\n present -= gap\n missing.append(gap)\n if gap < 0:\n negative_gaps += 1\n last = current\n return present, missing, negative_gaps\nprint count_missing_subsessions([dd_grouped[0]])", "outputs": [{"output_type": "stream", "name": "stdout", "text": "(34, [], 0)\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 27, "cell_type": "code", "source": "present, missing, negative_gaps = count_missing_subsessions(dd_grouped)\nmissing_count = sum(missing)\nprint present, missing_count, negative_gaps\nprint (1.0*missing_count)/present", "outputs": [{"output_type": "stream", "name": "stdout", "text": "108832 8572 0\n0.0787635989415\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "We are missing a whopping 8% of pings!", "cell_type": "markdown", "metadata": {}}, {"execution_count": 28, "cell_type": "code", "source": "missing.sort()", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 38, "cell_type": "code", "source": "[missing[i] for i in range(50, 550, 50)]+[missing[-1]] # Tall tail!", "outputs": [{"execution_count": 38, "output_type": "execute_result", "data": {"text/plain": "[1, 1, 1, 1, 1, 1, 2, 3, 6, 28, 529]"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 32, "cell_type": "code", "source": "sum(missing[:500])", "outputs": [{"execution_count": 32, "output_type": "execute_result", "data": {"text/plain": "1309"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "... but most of them are from a handful of clients.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 33, "cell_type": "code", "source": "1309.0 / 108832", "outputs": [{"execution_count": 33, "output_type": "execute_result", "data": {"text/plain": "0.012027712437518377"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "... even if we damp, rather than discard, the long tail", "cell_type": "markdown", "metadata": {}}, {"execution_count": 34, "cell_type": "code", "source": "(1.0 * (sum(missing[:500]) + 6 * 56)) / 108832", "outputs": [{"execution_count": 34, "output_type": "execute_result", "data": {"text/plain": "0.015115039694207586"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "Conclusion:\n\n* In 2.5% of clients, the subsessionCounter under-increments.\n* In 4.5% of clients, either it over-increments or we lose pings.\n\n - If it never over-increments, then 8% of pings are lost.\n - BUT, most of those losses are from 56 of 7,421 (or 7,207) clients.\n - Specifically 82% of lost pings are from 0.8% of clients.\n\n* Maybe we can learn more by closely examining the pings with large gaps.", "cell_type": "markdown", "metadata": {}}, {"source": "One more way of looking at it:\n\n- 98.5% of clients have 0.5% of subsessions missing or over-incremented.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 41, "cell_type": "code", "source": "all_but_106 = sum(missing[:450]) # 674\nprint \"clients:\", (7421-106.0) / 7421\nprint \"pings:\", (1.0*all_but_106) / 128543", "outputs": [{"output_type": "stream", "name": "stdout", "text": "clients: 0.985716210753\npings: 0.00524338159215\n"}], "metadata": {"collapsed": false, "trusted": true}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "Python 2", "name": "python2", "language": "python"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "version": "2.7.9", "name": "python", "file_extension": ".py", "pygments_lexer": "ipython2", "codemirror_mode": {"version": 2, "name": "ipython"}}}}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment