Last active
March 2, 2018 16:47
-
-
Save rth/78c45b1d19b71f6c86abef4824fdf0d7 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Benchmark tsfresh performance\n", | |
"\n", | |
"In this example we benchmark, \n", | |
" * A. applying tsfresh directly to extract features on a columnar pd.DataFrame (uncompressed CSV of ~10MB)\n", | |
"\n", | |
"with aggregating the DataFrame by user and date, to convert it to a labeled array (with xarray)\n", | |
" * B. followed by computing individual features manually with `_do_extraction_on_chunk`\n", | |
" * C. re-implementing a few metrics in a vectorized fashion and applying it on the xarray directly" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/home/datageek/anaconda2/envs/ts-env/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n", | |
" from pandas.core import datetools\n" | |
] | |
} | |
], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"import xarray as xr\n", | |
"from tqdm import tqdm\n", | |
"from IPython.display import display\n", | |
"\n", | |
"from tsfresh import extract_features\n", | |
"from tsfresh.feature_extraction.extraction import _do_extraction_on_chunk" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here we load a randomly generated dataset," | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df = pd.read_csv('https://github.com/blue-yonder/tsfresh/files/1751897/sample_dataset.csv.gz')\n", | |
"df['t'] = pd.to_datetime(df.t)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Shape: (200000, 3)\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>amount</th>\n", | |
" <th>t</th>\n", | |
" <th>uid</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>-28.00</td>\n", | |
" <td>2013-01-01</td>\n", | |
" <td>5b3ecda7b4f48aa7fad7ceb2ae6b11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>-7.99</td>\n", | |
" <td>2013-01-01</td>\n", | |
" <td>020c7a57c3393ea13d6a0c30eee62e</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1.79</td>\n", | |
" <td>2013-01-01</td>\n", | |
" <td>f8618d79da85a037f52221517e6147</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0.89</td>\n", | |
" <td>2013-01-01</td>\n", | |
" <td>f8618d79da85a037f52221517e6147</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>103.00</td>\n", | |
" <td>2013-01-01</td>\n", | |
" <td>5acbd7dac6c86bc773a5689b38489d</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" amount t uid\n", | |
"0 -28.00 2013-01-01 5b3ecda7b4f48aa7fad7ceb2ae6b11\n", | |
"1 -7.99 2013-01-01 020c7a57c3393ea13d6a0c30eee62e\n", | |
"2 1.79 2013-01-01 f8618d79da85a037f52221517e6147\n", | |
"3 0.89 2013-01-01 f8618d79da85a037f52221517e6147\n", | |
"4 103.00 2013-01-01 5acbd7dac6c86bc773a5689b38489d" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"print('Shape:', df.shape)\n", | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"amount 24738\n", | |
"t 601\n", | |
"uid 250\n", | |
"dtype: int64" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.nunique()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Evaluate conversion to xarray**\n", | |
"\n", | |
"Next we will aggregate by `uid` and `t` (time), and convert the DataFrame to an xarray.\n", | |
"\n", | |
"**Note:** here, the dates have a daily precision, but to do this properly we should use `pd.Grouper(freq=<some_freq>)`to aggregate with a specific time frequency. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<xarray.Dataset>\n", | |
"Dimensions: (t: 601, uid: 250)\n", | |
"Coordinates:\n", | |
" * uid (uid) object '00c052da3b927126b4553e96c4083c' ...\n", | |
" * t (t) datetime64[ns] 2013-01-01 2013-01-02 2013-01-03 2013-01-04 ...\n", | |
"Data variables:\n", | |
" amount (uid, t) float64 24.25 25.49 -1.98 254.7 9.94 0.0 202.2 -230.2 ..." | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 84 ms, sys: 12 ms, total: 96 ms\n", | |
"Wall time: 96.7 ms\n" | |
] | |
} | |
], | |
"source": [ | |
"%%time\n", | |
"\n", | |
"X = xr.Dataset.from_dataframe(df.groupby(['uid', 't']).sum()).fillna(0.0)\n", | |
"display(X)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## A. Feature extraction with pandas.DataFrame input\n", | |
"\n", | |
"This is the default tsfresh approach" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 156 ms, sys: 24 ms, total: 180 ms\n", | |
"Wall time: 235 ms\n" | |
] | |
} | |
], | |
"source": [ | |
"%%time\n", | |
"\n", | |
"fc_params = {'abs_energy': None, 'absolute_sum_of_changes': None}\n", | |
"\n", | |
"F_A = extract_features(df, column_id=\"uid\", column_sort=\"t\",\n", | |
" default_fc_parameters=fc_params,\n", | |
" disable_progressbar=True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## B. Feature extraction with xarray input (tsfresh)\n", | |
"\n", | |
"We manually apply `_do_extraction_on_chunk` on the rows of the aggregated matrix." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 160 ms, sys: 12 ms, total: 172 ms\n", | |
"Wall time: 170 ms\n" | |
] | |
} | |
], | |
"source": [ | |
"%%time\n", | |
"\n", | |
"res = []\n", | |
"for row in X['amount']:\n", | |
" idx = np.asscalar(row.coords['uid'].values)\n", | |
" \n", | |
" res_row = _do_extraction_on_chunk((idx, 'amount', pd.Series(row.values, index=X.coords['t'].values)),\n", | |
" fc_params, None)\n", | |
" res += res_row\n", | |
"\n", | |
"F_B = pd.DataFrame(res).groupby(['id', 'variable']).value.sum().unstack()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## C. Feature extraction with xarray input (vectorized)\n", | |
"\n", | |
"Here we reimplement a few features extraction functions that work directly on the whole xarray using vectorized numpy functions" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 12 ms, sys: 4 ms, total: 16 ms\n", | |
"Wall time: 21.5 ms\n" | |
] | |
} | |
], | |
"source": [ | |
"%%time\n", | |
"\n", | |
"\n", | |
"def abs_energy(X):\n", | |
" return xr.apply_ufunc(np.linalg.norm, X,\n", | |
" input_core_dims=[['t']],\n", | |
" kwargs={'ord': 2, 'axis': -1})**2\n", | |
"\n", | |
"\n", | |
"def absolute_sum_of_changes(X):\n", | |
" return np.abs(X.diff('t')).sum('t')\n", | |
"\n", | |
"\n", | |
"res = []\n", | |
"for name, func in [('abs_energy', abs_energy),\n", | |
" ('absolute_sum_of_changes', absolute_sum_of_changes)]:\n", | |
" y = func(X['amount'])\n", | |
" y.coords['variable'] = \"amount__\" + name\n", | |
" res.append(y)\n", | |
"F_C = xr.concat(res, dim='variable').to_dataframe()['amount'].unstack(0)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th>variable</th>\n", | |
" <th>amount__abs_energy</th>\n", | |
" <th>amount__absolute_sum_of_changes</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>id</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>00c052da3b927126b4553e96c4083c</th>\n", | |
" <td>1.236324e+07</td>\n", | |
" <td>68395.37</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0195b9bdf14913d11d95d294e1afd9</th>\n", | |
" <td>1.026265e+07</td>\n", | |
" <td>61450.25</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>01f06920099b3c97bbc5b7e6c0dcb4</th>\n", | |
" <td>2.046349e+07</td>\n", | |
" <td>105852.06</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>020c7a57c3393ea13d6a0c30eee62e</th>\n", | |
" <td>1.015636e+07</td>\n", | |
" <td>49664.44</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0227d2368280e2754ce05a15c32f5c</th>\n", | |
" <td>1.253965e+07</td>\n", | |
" <td>93912.65</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
"variable amount__abs_energy \\\n", | |
"id \n", | |
"00c052da3b927126b4553e96c4083c 1.236324e+07 \n", | |
"0195b9bdf14913d11d95d294e1afd9 1.026265e+07 \n", | |
"01f06920099b3c97bbc5b7e6c0dcb4 2.046349e+07 \n", | |
"020c7a57c3393ea13d6a0c30eee62e 1.015636e+07 \n", | |
"0227d2368280e2754ce05a15c32f5c 1.253965e+07 \n", | |
"\n", | |
"variable amount__absolute_sum_of_changes \n", | |
"id \n", | |
"00c052da3b927126b4553e96c4083c 68395.37 \n", | |
"0195b9bdf14913d11d95d294e1afd9 61450.25 \n", | |
"01f06920099b3c97bbc5b7e6c0dcb4 105852.06 \n", | |
"020c7a57c3393ea13d6a0c30eee62e 49664.44 \n", | |
"0227d2368280e2754ce05a15c32f5c 93912.65 " | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"F_A.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th>variable</th>\n", | |
" <th>amount__abs_energy</th>\n", | |
" <th>amount__absolute_sum_of_changes</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>id</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>00c052da3b927126b4553e96c4083c</th>\n", | |
" <td>1.387087e+07</td>\n", | |
" <td>57329.05</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0195b9bdf14913d11d95d294e1afd9</th>\n", | |
" <td>1.143035e+07</td>\n", | |
" <td>56149.97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>01f06920099b3c97bbc5b7e6c0dcb4</th>\n", | |
" <td>1.865184e+07</td>\n", | |
" <td>77300.61</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>020c7a57c3393ea13d6a0c30eee62e</th>\n", | |
" <td>9.911486e+06</td>\n", | |
" <td>45632.03</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0227d2368280e2754ce05a15c32f5c</th>\n", | |
" <td>1.201602e+07</td>\n", | |
" <td>67769.26</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
"variable amount__abs_energy \\\n", | |
"id \n", | |
"00c052da3b927126b4553e96c4083c 1.387087e+07 \n", | |
"0195b9bdf14913d11d95d294e1afd9 1.143035e+07 \n", | |
"01f06920099b3c97bbc5b7e6c0dcb4 1.865184e+07 \n", | |
"020c7a57c3393ea13d6a0c30eee62e 9.911486e+06 \n", | |
"0227d2368280e2754ce05a15c32f5c 1.201602e+07 \n", | |
"\n", | |
"variable amount__absolute_sum_of_changes \n", | |
"id \n", | |
"00c052da3b927126b4553e96c4083c 57329.05 \n", | |
"0195b9bdf14913d11d95d294e1afd9 56149.97 \n", | |
"01f06920099b3c97bbc5b7e6c0dcb4 77300.61 \n", | |
"020c7a57c3393ea13d6a0c30eee62e 45632.03 \n", | |
"0227d2368280e2754ce05a15c32f5c 67769.26 " | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"F_B.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th>variable</th>\n", | |
" <th>amount__abs_energy</th>\n", | |
" <th>amount__absolute_sum_of_changes</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>uid</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>00c052da3b927126b4553e96c4083c</th>\n", | |
" <td>1.387087e+07</td>\n", | |
" <td>57329.05</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0195b9bdf14913d11d95d294e1afd9</th>\n", | |
" <td>1.143035e+07</td>\n", | |
" <td>56149.97</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>01f06920099b3c97bbc5b7e6c0dcb4</th>\n", | |
" <td>1.865184e+07</td>\n", | |
" <td>77300.61</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>020c7a57c3393ea13d6a0c30eee62e</th>\n", | |
" <td>9.911486e+06</td>\n", | |
" <td>45632.03</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0227d2368280e2754ce05a15c32f5c</th>\n", | |
" <td>1.201602e+07</td>\n", | |
" <td>67769.26</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
"variable amount__abs_energy \\\n", | |
"uid \n", | |
"00c052da3b927126b4553e96c4083c 1.387087e+07 \n", | |
"0195b9bdf14913d11d95d294e1afd9 1.143035e+07 \n", | |
"01f06920099b3c97bbc5b7e6c0dcb4 1.865184e+07 \n", | |
"020c7a57c3393ea13d6a0c30eee62e 9.911486e+06 \n", | |
"0227d2368280e2754ce05a15c32f5c 1.201602e+07 \n", | |
"\n", | |
"variable amount__absolute_sum_of_changes \n", | |
"uid \n", | |
"00c052da3b927126b4553e96c4083c 57329.05 \n", | |
"0195b9bdf14913d11d95d294e1afd9 56149.97 \n", | |
"01f06920099b3c97bbc5b7e6c0dcb4 77300.61 \n", | |
"020c7a57c3393ea13d6a0c30eee62e 45632.03 \n", | |
"0227d2368280e2754ce05a15c32f5c 67769.26 " | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"F_C.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Check that approaches B and C produce identical result" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"assert ((np.abs(F_B - F_C) / F_B) < 1e-9).values.all()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Conclusion\n", | |
"\n", | |
"\n", | |
"The total run time for method A is ~200 ms. Comparing with the runtime of method B, it seems likely that ~100-150ms are spent on the feature extraction proper.\n", | |
"\n", | |
"\n", | |
"The cost of converting to xarray is ~100ms. If we use the vectorized implementations, computing the `abs_energy` and `absolute_sum_of_changes` is ~10x faster. \n", | |
"\n", | |
"On much a much larger dataset (i.e. 14M rows instead of 0.2 M rows) this same operations have the following run time,\n", | |
" * the conversion to xarray takes 6.6s\n", | |
" * method A: 10.8 s\n", | |
" * method B: 5.5 s\n", | |
" * method C: 160 ms\n", | |
" \n", | |
"so on purely on the feature extraction we seem to get an improvement of ~30x. Conversion to xarray has some fixed cost but when computing hundreds of features, it will be negligible with respect to running feature extraction in tsfresh." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment