Last active
August 29, 2015 14:26
-
-
Save SamPenrose/43251a594c63abaceea3 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"nbformat_minor": 0, "cells": [{"source": "This is a stripped-down version of Roberto's excellent notebook:\n\nhttp://nbviewer.ipython.org/gist/vitillo/3047c0d896b08f75c403\n\nIt drills down into the specifics of subsessionCounter gaps.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 1, "cell_type": "code", "source": "from moztelemetry import get_pings, get_pings_properties", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 2, "cell_type": "code", "source": "build_ids = (\"20150722000000\", \"20150729999999\")\nmain_pings = get_pings(sc,\n app=\"Firefox\",\n channel=\"nightly\",\n build_id=build_ids,\n doc_type=\"main\",\n schema=\"v4\")", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 3, "cell_type": "code", "source": "total = main_pings.count()\nprint total # 1,292,730 pings", "outputs": [{"output_type": "stream", "name": "stdout", "text": "1292730\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "We will look at 128,000 pings.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 5, "cell_type": "code", "source": "import binascii\ntenth = main_pings.filter(lambda d: binascii.crc32((d or {}).get(\"clientId\", 'a')) % 100 < 10) # 'a' -> 11\ntenth_total = tenth.count()\nprint tenth_total # 128,543 pings", "outputs": [{"output_type": "stream", "name": "stdout", "text": "128543\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 7, "cell_type": "code", "source": "subset_10 = get_pings_properties(tenth, [\"clientId\",\n \"meta/documentId\",\n \"meta/submissionDate\",\n \"meta/creationTimestamp\",\n \"environment/system/os/name\",\n \"payload/info/reason\",\n \"payload/info/sessionId\",\n \"payload/info/subsessionId\",\n \"payload/info/previousSessionId\",\n \"payload/info/previousSubsessionId\",\n \"payload/info/subsessionCounter\",\n \"payload/info/profileSubsessionCounter\",\n \"payload/simpleMeasurements/firstPaint\",\n \"payload/simpleMeasurements/savedPings\",\n \"payload/simpleMeasurements/uptime\",\n \"payload/histograms/STARTUP_CRASH_DETECTED\"])", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 8, "cell_type": "code", "source": "from operator import itemgetter\ndef dedupe_and_sort(group):\n key, history = group\n seen = set()\n result = []\n for fragment in history:\n id = fragment[\"meta/documentId\"]\n if id in seen:\n continue \n seen.add(id)\n result.append(fragment) \n result.sort(key=itemgetter(\"payload/info/profileSubsessionCounter\"))\n return result\n\ngrouped = subset_10.groupBy(lambda x: x[\"clientId\"]).map(dedupe_and_sort).collect()", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 17, "cell_type": "code", "source": "len(grouped) # 7,421 clients.", "outputs": [{"execution_count": 17, "output_type": "execute_result", "data": {"text/plain": "7421"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 9, "cell_type": "code", "source": "from collections import defaultdict\ndef duplicate_pssc(grouped):\n dupes = 0\n dupe_clients = set()\n\n for history in grouped:\n counts = defaultdict(int)\n\n for fragment in history:\n key = fragment[\"payload/info/profileSubsessionCounter\"]\n counts[key] += 1\n\n for _, v in counts.iteritems():\n if v > 1:\n dupes += 1\n dupe_clients.add(history[0][\"clientId\"])\n break\n\n print 100.0*dupes/len(grouped)\n return dupe_clients\n \ndupe_clients = duplicate_pssc(grouped)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "2.8837083951\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 16, "cell_type": "code", "source": "len(dupe_clients)", "outputs": [{"execution_count": 16, "output_type": "execute_result", "data": {"text/plain": "214"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "3% of clients have a subsessionCounter which under-incremented.\nRoberto discarded them, so let's follow that.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 10, "cell_type": "code", "source": "dd_grouped = filter(lambda h: h[0][\"clientId\"] not in dupe_clients, grouped)", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 18, "cell_type": "code", "source": "len(dd_grouped)", "outputs": [{"execution_count": 18, "output_type": "execute_result", "data": {"text/plain": "7207"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 13, "cell_type": "code", "source": "def missing_subsession_ids(groups):\n missing = set()\n ok = 0\n for history in groups:\n ok += len(history)\n last = history[0]\n for current in history[1:]:\n last_pssc = last[\"payload/info/profileSubsessionCounter\"]\n this_pssc = current[\"payload/info/profileSubsessionCounter\"]\n if last_pssc + 1 != this_pssc:\n missing.add((current['clientId'], current['payload/info/previousSubsessionId']))\n ok -= 1\n last = current\n return missing, ok\nprint missing_subsession_ids([dd_grouped[0]])", "outputs": [{"output_type": "stream", "name": "stdout", "text": "(set([]), 34)\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 14, "cell_type": "code", "source": "missing_pings, ok_ping_count = missing_subsession_ids(dd_grouped)\nprint len(missing_pings), ok_ping_count", "outputs": [{"output_type": "stream", "name": "stdout", "text": "556 116848\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 15, "cell_type": "code", "source": "556.0/116848", "outputs": [{"execution_count": 15, "output_type": "execute_result", "data": {"text/plain": "0.004758318499246885"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "So 0.4% of pings are missing. Victory! Except, we assume\none missing ping in each gap. Let's count them.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 26, "cell_type": "code", "source": "def count_missing_subsessions(groups):\n present = 0\n missing = []\n negative_gaps = 0\n for history in groups:\n present += len(history)\n last = history[0]\n for current in history[1:]:\n last_pssc = last[\"payload/info/profileSubsessionCounter\"]\n this_pssc = current[\"payload/info/profileSubsessionCounter\"]\n gap = this_pssc - (last_pssc + 1)\n if gap:\n present -= gap\n missing.append(gap)\n if gap < 0:\n negative_gaps += 1\n last = current\n return present, missing, negative_gaps\nprint count_missing_subsessions([dd_grouped[0]])", "outputs": [{"output_type": "stream", "name": "stdout", "text": "(34, [], 0)\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 27, "cell_type": "code", "source": "present, missing, negative_gaps = count_missing_subsessions(dd_grouped)\nmissing_count = sum(missing)\nprint present, missing_count, negative_gaps\nprint (1.0*missing_count)/present", "outputs": [{"output_type": "stream", "name": "stdout", "text": "108832 8572 0\n0.0787635989415\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "We are missing a whopping 8% of pings!", "cell_type": "markdown", "metadata": {}}, {"execution_count": 28, "cell_type": "code", "source": "missing.sort()", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 38, "cell_type": "code", "source": "[missing[i] for i in range(50, 550, 50)]+[missing[-1]] # Tall tail!", "outputs": [{"execution_count": 38, "output_type": "execute_result", "data": {"text/plain": "[1, 1, 1, 1, 1, 1, 2, 3, 6, 28, 529]"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 32, "cell_type": "code", "source": "sum(missing[:500])", "outputs": [{"execution_count": 32, "output_type": "execute_result", "data": {"text/plain": "1309"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "... but most of them are from a handful of clients.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 33, "cell_type": "code", "source": "1309.0 / 108832", "outputs": [{"execution_count": 33, "output_type": "execute_result", "data": {"text/plain": "0.012027712437518377"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "... even if we damp, rather than discard, the long tail", "cell_type": "markdown", "metadata": {}}, {"execution_count": 34, "cell_type": "code", "source": "(1.0 * (sum(missing[:500]) + 6 * 56)) / 108832", "outputs": [{"execution_count": 34, "output_type": "execute_result", "data": {"text/plain": "0.015115039694207586"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "Conclusion:\n\n* In 2.5% of clients, the subsessionCounter under-increments.\n* In 4.5% of clients, either it over-increments or we lose pings.\n\n - If it never over-increments, then 8% of pings are lost.\n - BUT, most of those losses are from 56 of 7,421 (or 7,207) clients.\n - Specifically 82% of lost pings are from 0.8% of clients.\n\n* Maybe we can learn more by closely examining the pings with large gaps.", "cell_type": "markdown", "metadata": {}}, {"source": "One more way of looking at it:\n\n- 98.5% of clients have 0.5% of subsessions missing or over-incremented.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 41, "cell_type": "code", "source": "all_but_106 = sum(missing[:450]) # 674\nprint \"clients:\", (7421-106.0) / 7421\nprint \"pings:\", (1.0*all_but_106) / 128543", "outputs": [{"output_type": "stream", "name": "stdout", "text": "clients: 0.985716210753\npings: 0.00524338159215\n"}], "metadata": {"collapsed": false, "trusted": true}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "Python 2", "name": "python2", "language": "python"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "version": "2.7.9", "name": "python", "file_extension": ".py", "pygments_lexer": "ipython2", "codemirror_mode": {"version": 2, "name": "ipython"}}}} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment