Last active
November 15, 2022 02:36
-
-
Save stharrold/17589e6809d249942debe3a5c43d38cc to your computer and use it in GitHub Desktop.
20170928_pandas_timeseries_rolling_large-window_unique-count.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# How to efficiently compute a rolling unique count in a `pandas` time series?\n", | |
"\n", | |
"I have a time series of people visiting a building. Each person has a unique ID. For every record in the time series, I want to know the number of unique people visiting the building in the last 365 days (i.e. a rolling unique count with a window of 365 days).\n", | |
"\n", | |
"`pandas` does not seem to have a built-in method for this calculation. The calculation becomes computationally intensive when there are a large number of unique visitors and/or a large window. (The actual data is larger than this example.)\n", | |
"\n", | |
"Is there a better way to calculate than what I've done below? I'm not sure why the fast method I made, `windowed_nunique` (under \"Speed test 3\"), is off by 1.\n", | |
"\n", | |
"Thanks for any help!\n", | |
"\n", | |
"Related links:\n", | |
"* https://github.com/pandas-dev/pandas/issues/14336" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Initialization" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Import libraries.\n", | |
"import pandas as pd\n", | |
"import numba\n", | |
"import numpy as np" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Created data of people visiting a building:\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Date</th>\n", | |
" <th>PersonId</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2010-01-01</td>\n", | |
" <td>76</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2010-01-01</td>\n", | |
" <td>63</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2010-01-01</td>\n", | |
" <td>89</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2010-01-01</td>\n", | |
" <td>81</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2010-01-01</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>2010-01-02</td>\n", | |
" <td>22</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>2010-01-02</td>\n", | |
" <td>83</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>2010-01-02</td>\n", | |
" <td>78</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>2010-01-02</td>\n", | |
" <td>47</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>2010-01-02</td>\n", | |
" <td>68</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>10</th>\n", | |
" <td>2010-01-02</td>\n", | |
" <td>72</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>11</th>\n", | |
" <td>2010-01-03</td>\n", | |
" <td>89</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>12</th>\n", | |
" <td>2010-01-03</td>\n", | |
" <td>94</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13</th>\n", | |
" <td>2010-01-03</td>\n", | |
" <td>44</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>14</th>\n", | |
" <td>2010-01-04</td>\n", | |
" <td>67</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>15</th>\n", | |
" <td>2010-01-04</td>\n", | |
" <td>88</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>16</th>\n", | |
" <td>2010-01-04</td>\n", | |
" <td>90</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>17</th>\n", | |
" <td>2010-01-05</td>\n", | |
" <td>30</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>18</th>\n", | |
" <td>2010-01-05</td>\n", | |
" <td>90</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>19</th>\n", | |
" <td>2010-01-05</td>\n", | |
" <td>70</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>20</th>\n", | |
" <td>2010-01-06</td>\n", | |
" <td>10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>21</th>\n", | |
" <td>2010-01-06</td>\n", | |
" <td>77</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>22</th>\n", | |
" <td>2010-01-07</td>\n", | |
" <td>15</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>23</th>\n", | |
" <td>2010-01-08</td>\n", | |
" <td>78</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>24</th>\n", | |
" <td>2010-01-08</td>\n", | |
" <td>81</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25</th>\n", | |
" <td>2010-01-08</td>\n", | |
" <td>49</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>26</th>\n", | |
" <td>2010-01-08</td>\n", | |
" <td>96</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>27</th>\n", | |
" <td>2010-01-08</td>\n", | |
" <td>92</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>28</th>\n", | |
" <td>2010-01-09</td>\n", | |
" <td>35</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>29</th>\n", | |
" <td>2010-01-09</td>\n", | |
" <td>69</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9151</th>\n", | |
" <td>2014-12-26</td>\n", | |
" <td>89</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9152</th>\n", | |
" <td>2014-12-26</td>\n", | |
" <td>54</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9153</th>\n", | |
" <td>2014-12-26</td>\n", | |
" <td>56</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9154</th>\n", | |
" <td>2014-12-26</td>\n", | |
" <td>76</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9155</th>\n", | |
" <td>2014-12-26</td>\n", | |
" <td>95</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9156</th>\n", | |
" <td>2014-12-26</td>\n", | |
" <td>32</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9157</th>\n", | |
" <td>2014-12-27</td>\n", | |
" <td>90</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9158</th>\n", | |
" <td>2014-12-27</td>\n", | |
" <td>73</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9159</th>\n", | |
" <td>2014-12-28</td>\n", | |
" <td>90</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9160</th>\n", | |
" <td>2014-12-28</td>\n", | |
" <td>55</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9161</th>\n", | |
" <td>2014-12-28</td>\n", | |
" <td>88</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9162</th>\n", | |
" <td>2014-12-28</td>\n", | |
" <td>49</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9163</th>\n", | |
" <td>2014-12-28</td>\n", | |
" <td>93</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9164</th>\n", | |
" <td>2014-12-29</td>\n", | |
" <td>51</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9165</th>\n", | |
" <td>2014-12-29</td>\n", | |
" <td>63</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9166</th>\n", | |
" <td>2014-12-29</td>\n", | |
" <td>27</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9167</th>\n", | |
" <td>2014-12-29</td>\n", | |
" <td>92</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9168</th>\n", | |
" <td>2014-12-29</td>\n", | |
" <td>53</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9169</th>\n", | |
" <td>2014-12-30</td>\n", | |
" <td>66</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9170</th>\n", | |
" <td>2014-12-30</td>\n", | |
" <td>92</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9171</th>\n", | |
" <td>2014-12-30</td>\n", | |
" <td>94</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9172</th>\n", | |
" <td>2014-12-30</td>\n", | |
" <td>75</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9173</th>\n", | |
" <td>2014-12-30</td>\n", | |
" <td>27</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9174</th>\n", | |
" <td>2014-12-30</td>\n", | |
" <td>99</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9175</th>\n", | |
" <td>2014-12-31</td>\n", | |
" <td>83</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9176</th>\n", | |
" <td>2014-12-31</td>\n", | |
" <td>42</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9177</th>\n", | |
" <td>2014-12-31</td>\n", | |
" <td>44</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9178</th>\n", | |
" <td>2015-01-01</td>\n", | |
" <td>93</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9179</th>\n", | |
" <td>2015-01-01</td>\n", | |
" <td>30</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9180</th>\n", | |
" <td>2015-01-01</td>\n", | |
" <td>80</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>9181 rows × 2 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Date PersonId\n", | |
"0 2010-01-01 76\n", | |
"1 2010-01-01 63\n", | |
"2 2010-01-01 89\n", | |
"3 2010-01-01 81\n", | |
"4 2010-01-01 7\n", | |
"5 2010-01-02 22\n", | |
"6 2010-01-02 83\n", | |
"7 2010-01-02 78\n", | |
"8 2010-01-02 47\n", | |
"9 2010-01-02 68\n", | |
"10 2010-01-02 72\n", | |
"11 2010-01-03 89\n", | |
"12 2010-01-03 94\n", | |
"13 2010-01-03 44\n", | |
"14 2010-01-04 67\n", | |
"15 2010-01-04 88\n", | |
"16 2010-01-04 90\n", | |
"17 2010-01-05 30\n", | |
"18 2010-01-05 90\n", | |
"19 2010-01-05 70\n", | |
"20 2010-01-06 10\n", | |
"21 2010-01-06 77\n", | |
"22 2010-01-07 15\n", | |
"23 2010-01-08 78\n", | |
"24 2010-01-08 81\n", | |
"25 2010-01-08 49\n", | |
"26 2010-01-08 96\n", | |
"27 2010-01-08 92\n", | |
"28 2010-01-09 35\n", | |
"29 2010-01-09 69\n", | |
"... ... ...\n", | |
"9151 2014-12-26 89\n", | |
"9152 2014-12-26 54\n", | |
"9153 2014-12-26 56\n", | |
"9154 2014-12-26 76\n", | |
"9155 2014-12-26 95\n", | |
"9156 2014-12-26 32\n", | |
"9157 2014-12-27 90\n", | |
"9158 2014-12-27 73\n", | |
"9159 2014-12-28 90\n", | |
"9160 2014-12-28 55\n", | |
"9161 2014-12-28 88\n", | |
"9162 2014-12-28 49\n", | |
"9163 2014-12-28 93\n", | |
"9164 2014-12-29 51\n", | |
"9165 2014-12-29 63\n", | |
"9166 2014-12-29 27\n", | |
"9167 2014-12-29 92\n", | |
"9168 2014-12-29 53\n", | |
"9169 2014-12-30 66\n", | |
"9170 2014-12-30 92\n", | |
"9171 2014-12-30 94\n", | |
"9172 2014-12-30 75\n", | |
"9173 2014-12-30 27\n", | |
"9174 2014-12-30 99\n", | |
"9175 2014-12-31 83\n", | |
"9176 2014-12-31 42\n", | |
"9177 2014-12-31 44\n", | |
"9178 2015-01-01 93\n", | |
"9179 2015-01-01 30\n", | |
"9180 2015-01-01 80\n", | |
"\n", | |
"[9181 rows x 2 columns]" | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Create data of people visiting a building.\n", | |
"\n", | |
"np.random.seed(seed=0)\n", | |
"dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')\n", | |
"window = 365 # days\n", | |
"num_pids = 100\n", | |
"probs = np.linspace(start=0.001, stop=0.1, num=num_pids)\n", | |
"\n", | |
"df = pd\\\n", | |
" .DataFrame(\n", | |
" data=[(date, pid)\n", | |
" for (pid, prob) in zip(range(num_pids), probs)\n", | |
" for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],\n", | |
" columns=['Date', 'PersonId'])\\\n", | |
" .sort_values(by='Date')\\\n", | |
" .reset_index(drop=True)\n", | |
"\n", | |
"print(\"Created data of people visiting a building:\")\n", | |
"df" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Speed reference" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"3.37 ms ± 82.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"# This counts the number of people visiting the building, not the number of unique people.\n", | |
"# Provided as a speed reference.\n", | |
"df.rolling(window='{:d}D'.format(window), on='Date').count()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Speed test 1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"2.49 s ± 275 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Save results as a reference to check calculation accuracy.\n", | |
"ref = df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Speed test 2" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Define a custom function and implement a just-in-time compiler.\n", | |
"@numba.jit(nopython=True)\n", | |
"def nunique(arr):\n", | |
" return len(set(arr))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"443 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Check accuracy of results.\n", | |
"test = df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)['PersonId'].values\n", | |
"assert all(ref == test)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Speed test 3" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Define a custom function and implement a just-in-time compiler.\n", | |
"@numba.jit(nopython=True)\n", | |
"def windowed_nunique(dates, pids, window):\n", | |
" r\"\"\"Track number of unique persons in window,\n", | |
" reading through arrays only once.\n", | |
"\n", | |
" Args:\n", | |
" dates (numpy.ndarray): Array of dates as number of days since epoch.\n", | |
" pids (numpy.ndarray): Array of integer person identifiers.\n", | |
" window (int): Width of window in units of difference of `dates`.\n", | |
"\n", | |
" Returns:\n", | |
" ucts (numpy.ndarray): Array of unique counts.\n", | |
"\n", | |
" Raises:\n", | |
" AssertionError: Raised if `len(dates) != len(pids)`\n", | |
"\n", | |
" Notes:\n", | |
" * May be off by 1 compared to `pandas.core.window.Rolling`\n", | |
" with a time series alias offset.\n", | |
"\n", | |
" \"\"\"\n", | |
"\n", | |
" # Check arguments.\n", | |
" assert dates.shape == pids.shape\n", | |
"\n", | |
" # Initialize counters.\n", | |
" idx_min = 0\n", | |
" idx_max = dates.shape[0]\n", | |
" date_min = dates[idx_min]\n", | |
" pid_min = pids[idx_min]\n", | |
" pid_max = np.max(pids)\n", | |
" pid_cts = np.zeros(pid_max, dtype=np.int64)\n", | |
" pid_cts[pid_min] = 1\n", | |
" uct = 1\n", | |
" ucts = np.zeros(idx_max, dtype=np.int64)\n", | |
" ucts[idx_min] = uct\n", | |
" idx = 1\n", | |
"\n", | |
" # For each (date, person)...\n", | |
" while idx < idx_max:\n", | |
"\n", | |
" # If person count went from 0 to 1, increment unique person count.\n", | |
" date = dates[idx]\n", | |
" pid = pids[idx]\n", | |
" pid_cts[pid] += 1\n", | |
" if pid_cts[pid] == 1:\n", | |
" uct += 1\n", | |
"\n", | |
" # For past dates outside of window...\n", | |
" while (date - date_min) > window:\n", | |
"\n", | |
" # If person count went from 1 to 0, decrement unique person count.\n", | |
" pid_cts[pid_min] -= 1\n", | |
" if pid_cts[pid_min] == 0:\n", | |
" uct -= 1\n", | |
" idx_min += 1\n", | |
" date_min = dates[idx_min]\n", | |
" pid_min = pids[idx_min]\n", | |
"\n", | |
" # Record unique person count.\n", | |
" ucts[idx] = uct\n", | |
" idx += 1\n", | |
"\n", | |
" return ucts" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Cast dates to integers.\n", | |
"df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')\n", | |
"df['DateEpoch'] = df['DateEpoch'].astype(int)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"110 µs ± 56.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"windowed_nunique(\n", | |
" dates=df['DateEpoch'].values,\n", | |
" pids=df['PersonId'].values,\n", | |
" window=window)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Check accuracy of results.\n", | |
"test = windowed_nunique(\n", | |
" dates=df['DateEpoch'].values,\n", | |
" pids=df['PersonId'].values,\n", | |
" window=window)\n", | |
"# Note: Method may be off by 1.\n", | |
"assert all(np.isclose(ref, np.asarray(test), atol=1))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Where reference ('ref') calculation of number of unique people doesn't match 'test':\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Date</th>\n", | |
" <th>PersonId</th>\n", | |
" <th>DateEpoch</th>\n", | |
" <th>ref</th>\n", | |
" <th>test</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>2331</th>\n", | |
" <td>2011-04-14</td>\n", | |
" <td>83</td>\n", | |
" <td>15078</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2332</th>\n", | |
" <td>2011-04-14</td>\n", | |
" <td>91</td>\n", | |
" <td>15078</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2333</th>\n", | |
" <td>2011-04-14</td>\n", | |
" <td>33</td>\n", | |
" <td>15078</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2393</th>\n", | |
" <td>2011-04-29</td>\n", | |
" <td>95</td>\n", | |
" <td>15093</td>\n", | |
" <td>95.0</td>\n", | |
" <td>96</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2394</th>\n", | |
" <td>2011-04-29</td>\n", | |
" <td>70</td>\n", | |
" <td>15093</td>\n", | |
" <td>95.0</td>\n", | |
" <td>96</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2395</th>\n", | |
" <td>2011-04-29</td>\n", | |
" <td>65</td>\n", | |
" <td>15093</td>\n", | |
" <td>95.0</td>\n", | |
" <td>96</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2396</th>\n", | |
" <td>2011-04-29</td>\n", | |
" <td>59</td>\n", | |
" <td>15093</td>\n", | |
" <td>95.0</td>\n", | |
" <td>96</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2397</th>\n", | |
" <td>2011-04-29</td>\n", | |
" <td>98</td>\n", | |
" <td>15093</td>\n", | |
" <td>95.0</td>\n", | |
" <td>96</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2464</th>\n", | |
" <td>2011-05-11</td>\n", | |
" <td>58</td>\n", | |
" <td>15105</td>\n", | |
" <td>94.0</td>\n", | |
" <td>95</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2465</th>\n", | |
" <td>2011-05-11</td>\n", | |
" <td>35</td>\n", | |
" <td>15105</td>\n", | |
" <td>94.0</td>\n", | |
" <td>95</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2466</th>\n", | |
" <td>2011-05-11</td>\n", | |
" <td>92</td>\n", | |
" <td>15105</td>\n", | |
" <td>94.0</td>\n", | |
" <td>95</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2467</th>\n", | |
" <td>2011-05-11</td>\n", | |
" <td>28</td>\n", | |
" <td>15105</td>\n", | |
" <td>94.0</td>\n", | |
" <td>95</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3337</th>\n", | |
" <td>2011-10-29</td>\n", | |
" <td>84</td>\n", | |
" <td>15276</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3338</th>\n", | |
" <td>2011-10-29</td>\n", | |
" <td>30</td>\n", | |
" <td>15276</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3339</th>\n", | |
" <td>2011-10-29</td>\n", | |
" <td>91</td>\n", | |
" <td>15276</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3340</th>\n", | |
" <td>2011-10-29</td>\n", | |
" <td>81</td>\n", | |
" <td>15276</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3383</th>\n", | |
" <td>2011-11-07</td>\n", | |
" <td>78</td>\n", | |
" <td>15285</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3384</th>\n", | |
" <td>2011-11-07</td>\n", | |
" <td>53</td>\n", | |
" <td>15285</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3385</th>\n", | |
" <td>2011-11-07</td>\n", | |
" <td>42</td>\n", | |
" <td>15285</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3386</th>\n", | |
" <td>2011-11-07</td>\n", | |
" <td>99</td>\n", | |
" <td>15285</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3387</th>\n", | |
" <td>2011-11-07</td>\n", | |
" <td>83</td>\n", | |
" <td>15285</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3388</th>\n", | |
" <td>2011-11-07</td>\n", | |
" <td>69</td>\n", | |
" <td>15285</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3389</th>\n", | |
" <td>2011-11-07</td>\n", | |
" <td>26</td>\n", | |
" <td>15285</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3390</th>\n", | |
" <td>2011-11-07</td>\n", | |
" <td>63</td>\n", | |
" <td>15285</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6017</th>\n", | |
" <td>2013-04-19</td>\n", | |
" <td>70</td>\n", | |
" <td>15814</td>\n", | |
" <td>98.0</td>\n", | |
" <td>99</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6018</th>\n", | |
" <td>2013-04-19</td>\n", | |
" <td>92</td>\n", | |
" <td>15814</td>\n", | |
" <td>98.0</td>\n", | |
" <td>99</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6019</th>\n", | |
" <td>2013-04-19</td>\n", | |
" <td>79</td>\n", | |
" <td>15814</td>\n", | |
" <td>98.0</td>\n", | |
" <td>99</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6020</th>\n", | |
" <td>2013-04-19</td>\n", | |
" <td>60</td>\n", | |
" <td>15814</td>\n", | |
" <td>98.0</td>\n", | |
" <td>99</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6021</th>\n", | |
" <td>2013-04-19</td>\n", | |
" <td>51</td>\n", | |
" <td>15814</td>\n", | |
" <td>98.0</td>\n", | |
" <td>99</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6022</th>\n", | |
" <td>2013-04-19</td>\n", | |
" <td>80</td>\n", | |
" <td>15814</td>\n", | |
" <td>98.0</td>\n", | |
" <td>99</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6107</th>\n", | |
" <td>2013-05-05</td>\n", | |
" <td>88</td>\n", | |
" <td>15830</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6108</th>\n", | |
" <td>2013-05-05</td>\n", | |
" <td>97</td>\n", | |
" <td>15830</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6109</th>\n", | |
" <td>2013-05-05</td>\n", | |
" <td>98</td>\n", | |
" <td>15830</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6698</th>\n", | |
" <td>2013-09-01</td>\n", | |
" <td>93</td>\n", | |
" <td>15949</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6699</th>\n", | |
" <td>2013-09-01</td>\n", | |
" <td>49</td>\n", | |
" <td>15949</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6700</th>\n", | |
" <td>2013-09-01</td>\n", | |
" <td>61</td>\n", | |
" <td>15949</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6701</th>\n", | |
" <td>2013-09-01</td>\n", | |
" <td>35</td>\n", | |
" <td>15949</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6702</th>\n", | |
" <td>2013-09-01</td>\n", | |
" <td>70</td>\n", | |
" <td>15949</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6703</th>\n", | |
" <td>2013-09-01</td>\n", | |
" <td>83</td>\n", | |
" <td>15949</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6704</th>\n", | |
" <td>2013-09-01</td>\n", | |
" <td>30</td>\n", | |
" <td>15949</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6882</th>\n", | |
" <td>2013-10-06</td>\n", | |
" <td>80</td>\n", | |
" <td>15984</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6883</th>\n", | |
" <td>2013-10-06</td>\n", | |
" <td>71</td>\n", | |
" <td>15984</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6884</th>\n", | |
" <td>2013-10-06</td>\n", | |
" <td>31</td>\n", | |
" <td>15984</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6885</th>\n", | |
" <td>2013-10-06</td>\n", | |
" <td>53</td>\n", | |
" <td>15984</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6886</th>\n", | |
" <td>2013-10-06</td>\n", | |
" <td>76</td>\n", | |
" <td>15984</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6887</th>\n", | |
" <td>2013-10-06</td>\n", | |
" <td>99</td>\n", | |
" <td>15984</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6888</th>\n", | |
" <td>2013-10-06</td>\n", | |
" <td>85</td>\n", | |
" <td>15984</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6889</th>\n", | |
" <td>2013-10-06</td>\n", | |
" <td>82</td>\n", | |
" <td>15984</td>\n", | |
" <td>96.0</td>\n", | |
" <td>97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7998</th>\n", | |
" <td>2014-05-12</td>\n", | |
" <td>96</td>\n", | |
" <td>16202</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7999</th>\n", | |
" <td>2014-05-12</td>\n", | |
" <td>32</td>\n", | |
" <td>16202</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8000</th>\n", | |
" <td>2014-05-12</td>\n", | |
" <td>24</td>\n", | |
" <td>16202</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8001</th>\n", | |
" <td>2014-05-12</td>\n", | |
" <td>29</td>\n", | |
" <td>16202</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9057</th>\n", | |
" <td>2014-12-08</td>\n", | |
" <td>73</td>\n", | |
" <td>16412</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9058</th>\n", | |
" <td>2014-12-08</td>\n", | |
" <td>66</td>\n", | |
" <td>16412</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9059</th>\n", | |
" <td>2014-12-08</td>\n", | |
" <td>37</td>\n", | |
" <td>16412</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9060</th>\n", | |
" <td>2014-12-08</td>\n", | |
" <td>48</td>\n", | |
" <td>16412</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9061</th>\n", | |
" <td>2014-12-08</td>\n", | |
" <td>99</td>\n", | |
" <td>16412</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9062</th>\n", | |
" <td>2014-12-08</td>\n", | |
" <td>81</td>\n", | |
" <td>16412</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9063</th>\n", | |
" <td>2014-12-08</td>\n", | |
" <td>40</td>\n", | |
" <td>16412</td>\n", | |
" <td>97.0</td>\n", | |
" <td>98</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Date PersonId DateEpoch ref test\n", | |
"2331 2011-04-14 83 15078 96.0 97\n", | |
"2332 2011-04-14 91 15078 96.0 97\n", | |
"2333 2011-04-14 33 15078 96.0 97\n", | |
"2393 2011-04-29 95 15093 95.0 96\n", | |
"2394 2011-04-29 70 15093 95.0 96\n", | |
"2395 2011-04-29 65 15093 95.0 96\n", | |
"2396 2011-04-29 59 15093 95.0 96\n", | |
"2397 2011-04-29 98 15093 95.0 96\n", | |
"2464 2011-05-11 58 15105 94.0 95\n", | |
"2465 2011-05-11 35 15105 94.0 95\n", | |
"2466 2011-05-11 92 15105 94.0 95\n", | |
"2467 2011-05-11 28 15105 94.0 95\n", | |
"3337 2011-10-29 84 15276 97.0 98\n", | |
"3338 2011-10-29 30 15276 97.0 98\n", | |
"3339 2011-10-29 91 15276 97.0 98\n", | |
"3340 2011-10-29 81 15276 97.0 98\n", | |
"3383 2011-11-07 78 15285 96.0 97\n", | |
"3384 2011-11-07 53 15285 96.0 97\n", | |
"3385 2011-11-07 42 15285 96.0 97\n", | |
"3386 2011-11-07 99 15285 96.0 97\n", | |
"3387 2011-11-07 83 15285 96.0 97\n", | |
"3388 2011-11-07 69 15285 96.0 97\n", | |
"3389 2011-11-07 26 15285 96.0 97\n", | |
"3390 2011-11-07 63 15285 96.0 97\n", | |
"6017 2013-04-19 70 15814 98.0 99\n", | |
"6018 2013-04-19 92 15814 98.0 99\n", | |
"6019 2013-04-19 79 15814 98.0 99\n", | |
"6020 2013-04-19 60 15814 98.0 99\n", | |
"6021 2013-04-19 51 15814 98.0 99\n", | |
"6022 2013-04-19 80 15814 98.0 99\n", | |
"6107 2013-05-05 88 15830 97.0 98\n", | |
"6108 2013-05-05 97 15830 97.0 98\n", | |
"6109 2013-05-05 98 15830 97.0 98\n", | |
"6698 2013-09-01 93 15949 97.0 98\n", | |
"6699 2013-09-01 49 15949 97.0 98\n", | |
"6700 2013-09-01 61 15949 97.0 98\n", | |
"6701 2013-09-01 35 15949 97.0 98\n", | |
"6702 2013-09-01 70 15949 97.0 98\n", | |
"6703 2013-09-01 83 15949 97.0 98\n", | |
"6704 2013-09-01 30 15949 97.0 98\n", | |
"6882 2013-10-06 80 15984 96.0 97\n", | |
"6883 2013-10-06 71 15984 96.0 97\n", | |
"6884 2013-10-06 31 15984 96.0 97\n", | |
"6885 2013-10-06 53 15984 96.0 97\n", | |
"6886 2013-10-06 76 15984 96.0 97\n", | |
"6887 2013-10-06 99 15984 96.0 97\n", | |
"6888 2013-10-06 85 15984 96.0 97\n", | |
"6889 2013-10-06 82 15984 96.0 97\n", | |
"7998 2014-05-12 96 16202 97.0 98\n", | |
"7999 2014-05-12 32 16202 97.0 98\n", | |
"8000 2014-05-12 24 16202 97.0 98\n", | |
"8001 2014-05-12 29 16202 97.0 98\n", | |
"9057 2014-12-08 73 16412 97.0 98\n", | |
"9058 2014-12-08 66 16412 97.0 98\n", | |
"9059 2014-12-08 37 16412 97.0 98\n", | |
"9060 2014-12-08 48 16412 97.0 98\n", | |
"9061 2014-12-08 99 16412 97.0 98\n", | |
"9062 2014-12-08 81 16412 97.0 98\n", | |
"9063 2014-12-08 40 16412 97.0 98" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Show where the calculation doesn't match.\n", | |
"print(\"Where reference ('ref') calculation of number of unique people doesn't match 'test':\")\n", | |
"df['ref'] = ref\n", | |
"df['test'] = test\n", | |
"df.loc[df['ref'] != df['test']]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Solution" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Define a custom function and implement a just-in-time compiler.\n", | |
"@numba.jit(nopython=True)\n", | |
"def windowed_nunique_corrected(dates, pids, window):\n", | |
" r\"\"\"Track number of unique persons in window,\n", | |
" reading through arrays only once.\n", | |
"\n", | |
" Args:\n", | |
" dates (numpy.ndarray): Array of dates as number of days since epoch.\n", | |
" pids (numpy.ndarray): Array of integer person identifiers.\n", | |
" Required: min(pids) >= 0\n", | |
" window (int): Width of window in units of difference of `dates`.\n", | |
" Required: window >= 1\n", | |
"\n", | |
" Returns:\n", | |
" ucts (numpy.ndarray): Array of unique counts.\n", | |
"\n", | |
" Raises:\n", | |
" AssertionError: Raised if not...\n", | |
" * len(dates) == len(pids)\n", | |
" * min(pids) >= 0\n", | |
" * window >= 1\n", | |
"\n", | |
" Notes:\n", | |
" * Matches `pandas.core.window.Rolling`\n", | |
" with a time series alias offset.\n", | |
"\n", | |
" \"\"\"\n", | |
"\n", | |
" # Check arguments.\n", | |
" assert len(dates) == len(pids)\n", | |
" assert np.min(pids) >= 0\n", | |
" assert window >= 1\n", | |
"\n", | |
" # Initialize counters.\n", | |
" idx_min = 0\n", | |
" idx_max = dates.shape[0]\n", | |
" date_min = dates[idx_min]\n", | |
" pid_min = pids[idx_min]\n", | |
" pid_max = np.max(pids) + 1\n", | |
" pid_cts = np.zeros(pid_max, dtype=np.int64)\n", | |
" pid_cts[pid_min] = 1\n", | |
" uct = 1\n", | |
" ucts = np.zeros(idx_max, dtype=np.int64)\n", | |
" ucts[idx_min] = uct\n", | |
" idx = 1\n", | |
"\n", | |
" # For each (date, person)...\n", | |
" while idx < idx_max:\n", | |
"\n", | |
" # Lookup date, person.\n", | |
" date = dates[idx]\n", | |
" pid = pids[idx]\n", | |
"\n", | |
" # If person count went from 0 to 1, increment unique person count.\n", | |
" pid_cts[pid] += 1\n", | |
" if pid_cts[pid] == 1:\n", | |
" uct += 1\n", | |
"\n", | |
" # For past dates outside of window...\n", | |
" # Note: If window=3, it includes day0,day1,day2.\n", | |
" while (date - date_min + 1) > window:\n", | |
"\n", | |
" # If person count went from 1 to 0, decrement unique person count.\n", | |
" pid_cts[pid_min] -= 1\n", | |
" if pid_cts[pid_min] == 0:\n", | |
" uct -= 1\n", | |
" idx_min += 1\n", | |
" date_min = dates[idx_min]\n", | |
" pid_min = pids[idx_min]\n", | |
"\n", | |
" # Record unique person count.\n", | |
" ucts[idx] = uct\n", | |
" idx += 1\n", | |
"\n", | |
" return ucts" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Cast dates to integers.\n", | |
"df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')\n", | |
"df['DateEpoch'] = df['DateEpoch'].astype(int)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"98.8 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"windowed_nunique_corrected(\n", | |
" dates=df['DateEpoch'].values,\n", | |
" pids=df['PersonId'].values,\n", | |
" window=window)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Check accuracy of results.\n", | |
"test = windowed_nunique_corrected(\n", | |
" dates=df['DateEpoch'].values,\n", | |
" pids=df['PersonId'].values,\n", | |
" window=window)\n", | |
"assert all(ref == test)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.0" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Updated with solution.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Posted to Stack Overflow: https://stackoverflow.com/questions/46470743/how-to-efficiently-compute-a-rolling-unique-count-in-a-pandas-time-series