Last active
May 28, 2020 19:04
-
-
Save jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Explore impact of pandas' Block consolidation" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"See https://github.com/pandas-dev/pandas/issues/10556 for context. \n", | |
"\n", | |
"This notebook is using a branch of pandas that adds a \"consolidation policy\" which lets us explore the differences a bit (https://github.com/jorisvandenbossche/pandas/commit/a6c744d38405b6645e076f63ac34517b7e960345, and in addition I also added the Floating extension array changes from https://github.com/pandas-dev/pandas/pull/34307)\n", | |
"\n", | |
"**Note** This notebook includes some benchmarks. However, the results of the timings are highly dependent on the characteristics of the data that is being used. I am now using a DataFrame with 100 columns and a 1000:1 rows/column ratio, which I think is a quite typical scenario. But when tweaking this (more or less columns, different ratio), this easily changes the results / interpretation of the benchmarks." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"pd.options.compute.use_numexpr = False\n", | |
"pd.options.compute.use_bottleneck = False" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%load_ext snakeviz" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%load_ext line_profiler" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"arr = np.random.randn(100_000, 100)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df1_orig = pd.DataFrame(arr, policy=\"block\") # the current default\n", | |
"# additional copy to ensure the memory is column-aligned\n", | |
"df1 = df1_orig.copy()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" <th>3</th>\n", | |
" <th>4</th>\n", | |
" <th>5</th>\n", | |
" <th>6</th>\n", | |
" <th>7</th>\n", | |
" <th>8</th>\n", | |
" <th>9</th>\n", | |
" <th>...</th>\n", | |
" <th>90</th>\n", | |
" <th>91</th>\n", | |
" <th>92</th>\n", | |
" <th>93</th>\n", | |
" <th>94</th>\n", | |
" <th>95</th>\n", | |
" <th>96</th>\n", | |
" <th>97</th>\n", | |
" <th>98</th>\n", | |
" <th>99</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0.371022</td>\n", | |
" <td>-0.242209</td>\n", | |
" <td>2.113007</td>\n", | |
" <td>0.065577</td>\n", | |
" <td>-1.395332</td>\n", | |
" <td>0.345039</td>\n", | |
" <td>0.430615</td>\n", | |
" <td>0.910320</td>\n", | |
" <td>-0.887256</td>\n", | |
" <td>-1.248485</td>\n", | |
" <td>...</td>\n", | |
" <td>0.832916</td>\n", | |
" <td>-0.020408</td>\n", | |
" <td>1.140463</td>\n", | |
" <td>1.465279</td>\n", | |
" <td>0.333421</td>\n", | |
" <td>-1.686267</td>\n", | |
" <td>0.182302</td>\n", | |
" <td>0.260248</td>\n", | |
" <td>-0.056024</td>\n", | |
" <td>-0.645074</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>0.813260</td>\n", | |
" <td>0.936914</td>\n", | |
" <td>-0.210771</td>\n", | |
" <td>0.564948</td>\n", | |
" <td>0.019599</td>\n", | |
" <td>-0.166271</td>\n", | |
" <td>0.251355</td>\n", | |
" <td>-0.134027</td>\n", | |
" <td>1.280353</td>\n", | |
" <td>-0.770996</td>\n", | |
" <td>...</td>\n", | |
" <td>0.028680</td>\n", | |
" <td>-0.224805</td>\n", | |
" <td>0.857354</td>\n", | |
" <td>0.462102</td>\n", | |
" <td>-0.786376</td>\n", | |
" <td>2.388081</td>\n", | |
" <td>-1.658082</td>\n", | |
" <td>-0.539284</td>\n", | |
" <td>-0.880693</td>\n", | |
" <td>0.421976</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>0.455793</td>\n", | |
" <td>-0.381480</td>\n", | |
" <td>-0.502506</td>\n", | |
" <td>0.602100</td>\n", | |
" <td>0.293811</td>\n", | |
" <td>-0.370855</td>\n", | |
" <td>0.135328</td>\n", | |
" <td>1.038086</td>\n", | |
" <td>0.575656</td>\n", | |
" <td>0.928458</td>\n", | |
" <td>...</td>\n", | |
" <td>0.211374</td>\n", | |
" <td>1.911427</td>\n", | |
" <td>-0.383501</td>\n", | |
" <td>0.712876</td>\n", | |
" <td>-1.345929</td>\n", | |
" <td>-1.054745</td>\n", | |
" <td>-0.502296</td>\n", | |
" <td>0.255767</td>\n", | |
" <td>-0.555215</td>\n", | |
" <td>-0.374579</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0.055037</td>\n", | |
" <td>0.943844</td>\n", | |
" <td>1.186883</td>\n", | |
" <td>-0.097353</td>\n", | |
" <td>0.453953</td>\n", | |
" <td>0.336523</td>\n", | |
" <td>-0.227028</td>\n", | |
" <td>1.804491</td>\n", | |
" <td>-1.255713</td>\n", | |
" <td>-0.974349</td>\n", | |
" <td>...</td>\n", | |
" <td>0.435610</td>\n", | |
" <td>1.250981</td>\n", | |
" <td>1.484357</td>\n", | |
" <td>-0.528372</td>\n", | |
" <td>-0.226552</td>\n", | |
" <td>-0.891205</td>\n", | |
" <td>1.306358</td>\n", | |
" <td>1.130840</td>\n", | |
" <td>0.204551</td>\n", | |
" <td>1.501261</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>0.429316</td>\n", | |
" <td>1.937749</td>\n", | |
" <td>0.980708</td>\n", | |
" <td>-0.124727</td>\n", | |
" <td>1.250186</td>\n", | |
" <td>0.181841</td>\n", | |
" <td>-1.260060</td>\n", | |
" <td>2.688526</td>\n", | |
" <td>0.764125</td>\n", | |
" <td>-0.687951</td>\n", | |
" <td>...</td>\n", | |
" <td>-2.689082</td>\n", | |
" <td>1.038927</td>\n", | |
" <td>-3.161205</td>\n", | |
" <td>1.209361</td>\n", | |
" <td>-0.064672</td>\n", | |
" <td>1.074549</td>\n", | |
" <td>0.433635</td>\n", | |
" <td>-0.492203</td>\n", | |
" <td>1.134994</td>\n", | |
" <td>0.922045</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99995</th>\n", | |
" <td>0.650366</td>\n", | |
" <td>0.296693</td>\n", | |
" <td>-0.648191</td>\n", | |
" <td>-0.027173</td>\n", | |
" <td>-1.368694</td>\n", | |
" <td>-1.414733</td>\n", | |
" <td>-0.098120</td>\n", | |
" <td>0.850049</td>\n", | |
" <td>0.931312</td>\n", | |
" <td>0.577836</td>\n", | |
" <td>...</td>\n", | |
" <td>-1.738731</td>\n", | |
" <td>0.826997</td>\n", | |
" <td>1.199955</td>\n", | |
" <td>0.048497</td>\n", | |
" <td>-0.452846</td>\n", | |
" <td>1.593279</td>\n", | |
" <td>-1.461541</td>\n", | |
" <td>1.250952</td>\n", | |
" <td>0.765631</td>\n", | |
" <td>1.259925</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99996</th>\n", | |
" <td>-1.025004</td>\n", | |
" <td>-0.033401</td>\n", | |
" <td>0.352300</td>\n", | |
" <td>2.029560</td>\n", | |
" <td>-1.081474</td>\n", | |
" <td>0.372101</td>\n", | |
" <td>-0.231158</td>\n", | |
" <td>0.129621</td>\n", | |
" <td>0.332448</td>\n", | |
" <td>-1.622283</td>\n", | |
" <td>...</td>\n", | |
" <td>0.491512</td>\n", | |
" <td>-0.619863</td>\n", | |
" <td>0.921266</td>\n", | |
" <td>0.258329</td>\n", | |
" <td>-0.385118</td>\n", | |
" <td>0.865576</td>\n", | |
" <td>0.142066</td>\n", | |
" <td>0.249012</td>\n", | |
" <td>1.083895</td>\n", | |
" <td>0.424546</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99997</th>\n", | |
" <td>-0.799553</td>\n", | |
" <td>-0.302133</td>\n", | |
" <td>-1.528081</td>\n", | |
" <td>0.159766</td>\n", | |
" <td>-0.194533</td>\n", | |
" <td>-0.942236</td>\n", | |
" <td>0.175192</td>\n", | |
" <td>0.295742</td>\n", | |
" <td>2.772861</td>\n", | |
" <td>0.499414</td>\n", | |
" <td>...</td>\n", | |
" <td>0.162933</td>\n", | |
" <td>0.919790</td>\n", | |
" <td>1.617661</td>\n", | |
" <td>-1.441009</td>\n", | |
" <td>-0.930747</td>\n", | |
" <td>-1.212881</td>\n", | |
" <td>-1.595667</td>\n", | |
" <td>-0.378324</td>\n", | |
" <td>-0.644536</td>\n", | |
" <td>-0.979022</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99998</th>\n", | |
" <td>-1.153394</td>\n", | |
" <td>-3.726555</td>\n", | |
" <td>1.006661</td>\n", | |
" <td>1.938871</td>\n", | |
" <td>-2.395232</td>\n", | |
" <td>0.255857</td>\n", | |
" <td>-0.609283</td>\n", | |
" <td>-0.351652</td>\n", | |
" <td>0.443662</td>\n", | |
" <td>-1.992509</td>\n", | |
" <td>...</td>\n", | |
" <td>1.074920</td>\n", | |
" <td>1.201669</td>\n", | |
" <td>0.665734</td>\n", | |
" <td>-0.600312</td>\n", | |
" <td>1.116904</td>\n", | |
" <td>-0.682997</td>\n", | |
" <td>-0.901353</td>\n", | |
" <td>2.203541</td>\n", | |
" <td>0.322509</td>\n", | |
" <td>-0.478056</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99999</th>\n", | |
" <td>1.461720</td>\n", | |
" <td>0.397803</td>\n", | |
" <td>0.186679</td>\n", | |
" <td>-0.155946</td>\n", | |
" <td>-0.427969</td>\n", | |
" <td>-0.385579</td>\n", | |
" <td>-0.359528</td>\n", | |
" <td>-0.491198</td>\n", | |
" <td>0.733553</td>\n", | |
" <td>-0.349641</td>\n", | |
" <td>...</td>\n", | |
" <td>0.370914</td>\n", | |
" <td>0.988451</td>\n", | |
" <td>1.326340</td>\n", | |
" <td>0.467480</td>\n", | |
" <td>0.635124</td>\n", | |
" <td>0.622167</td>\n", | |
" <td>0.377851</td>\n", | |
" <td>0.057745</td>\n", | |
" <td>0.370907</td>\n", | |
" <td>0.117684</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>100000 rows × 100 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2 3 4 5 6 \\\n", | |
"0 0.371022 -0.242209 2.113007 0.065577 -1.395332 0.345039 0.430615 \n", | |
"1 0.813260 0.936914 -0.210771 0.564948 0.019599 -0.166271 0.251355 \n", | |
"2 0.455793 -0.381480 -0.502506 0.602100 0.293811 -0.370855 0.135328 \n", | |
"3 0.055037 0.943844 1.186883 -0.097353 0.453953 0.336523 -0.227028 \n", | |
"4 0.429316 1.937749 0.980708 -0.124727 1.250186 0.181841 -1.260060 \n", | |
"... ... ... ... ... ... ... ... \n", | |
"99995 0.650366 0.296693 -0.648191 -0.027173 -1.368694 -1.414733 -0.098120 \n", | |
"99996 -1.025004 -0.033401 0.352300 2.029560 -1.081474 0.372101 -0.231158 \n", | |
"99997 -0.799553 -0.302133 -1.528081 0.159766 -0.194533 -0.942236 0.175192 \n", | |
"99998 -1.153394 -3.726555 1.006661 1.938871 -2.395232 0.255857 -0.609283 \n", | |
"99999 1.461720 0.397803 0.186679 -0.155946 -0.427969 -0.385579 -0.359528 \n", | |
"\n", | |
" 7 8 9 ... 90 91 92 \\\n", | |
"0 0.910320 -0.887256 -1.248485 ... 0.832916 -0.020408 1.140463 \n", | |
"1 -0.134027 1.280353 -0.770996 ... 0.028680 -0.224805 0.857354 \n", | |
"2 1.038086 0.575656 0.928458 ... 0.211374 1.911427 -0.383501 \n", | |
"3 1.804491 -1.255713 -0.974349 ... 0.435610 1.250981 1.484357 \n", | |
"4 2.688526 0.764125 -0.687951 ... -2.689082 1.038927 -3.161205 \n", | |
"... ... ... ... ... ... ... ... \n", | |
"99995 0.850049 0.931312 0.577836 ... -1.738731 0.826997 1.199955 \n", | |
"99996 0.129621 0.332448 -1.622283 ... 0.491512 -0.619863 0.921266 \n", | |
"99997 0.295742 2.772861 0.499414 ... 0.162933 0.919790 1.617661 \n", | |
"99998 -0.351652 0.443662 -1.992509 ... 1.074920 1.201669 0.665734 \n", | |
"99999 -0.491198 0.733553 -0.349641 ... 0.370914 0.988451 1.326340 \n", | |
"\n", | |
" 93 94 95 96 97 98 99 \n", | |
"0 1.465279 0.333421 -1.686267 0.182302 0.260248 -0.056024 -0.645074 \n", | |
"1 0.462102 -0.786376 2.388081 -1.658082 -0.539284 -0.880693 0.421976 \n", | |
"2 0.712876 -1.345929 -1.054745 -0.502296 0.255767 -0.555215 -0.374579 \n", | |
"3 -0.528372 -0.226552 -0.891205 1.306358 1.130840 0.204551 1.501261 \n", | |
"4 1.209361 -0.064672 1.074549 0.433635 -0.492203 1.134994 0.922045 \n", | |
"... ... ... ... ... ... ... ... \n", | |
"99995 0.048497 -0.452846 1.593279 -1.461541 1.250952 0.765631 1.259925 \n", | |
"99996 0.258329 -0.385118 0.865576 0.142066 0.249012 1.083895 0.424546 \n", | |
"99997 -1.441009 -0.930747 -1.212881 -1.595667 -0.378324 -0.644536 -0.979022 \n", | |
"99998 -0.600312 1.116904 -0.682997 -0.901353 2.203541 0.322509 -0.478056 \n", | |
"99999 0.467480 0.635124 0.622167 0.377851 0.057745 0.370907 0.117684 \n", | |
"\n", | |
"[100000 rows x 100 columns]" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len(df1._data.blocks)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df2 = pd.DataFrame(arr, policy=\"split\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" <th>3</th>\n", | |
" <th>4</th>\n", | |
" <th>5</th>\n", | |
" <th>6</th>\n", | |
" <th>7</th>\n", | |
" <th>8</th>\n", | |
" <th>9</th>\n", | |
" <th>...</th>\n", | |
" <th>90</th>\n", | |
" <th>91</th>\n", | |
" <th>92</th>\n", | |
" <th>93</th>\n", | |
" <th>94</th>\n", | |
" <th>95</th>\n", | |
" <th>96</th>\n", | |
" <th>97</th>\n", | |
" <th>98</th>\n", | |
" <th>99</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0.371022</td>\n", | |
" <td>-0.242209</td>\n", | |
" <td>2.113007</td>\n", | |
" <td>0.065577</td>\n", | |
" <td>-1.395332</td>\n", | |
" <td>0.345039</td>\n", | |
" <td>0.430615</td>\n", | |
" <td>0.910320</td>\n", | |
" <td>-0.887256</td>\n", | |
" <td>-1.248485</td>\n", | |
" <td>...</td>\n", | |
" <td>0.832916</td>\n", | |
" <td>-0.020408</td>\n", | |
" <td>1.140463</td>\n", | |
" <td>1.465279</td>\n", | |
" <td>0.333421</td>\n", | |
" <td>-1.686267</td>\n", | |
" <td>0.182302</td>\n", | |
" <td>0.260248</td>\n", | |
" <td>-0.056024</td>\n", | |
" <td>-0.645074</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>0.813260</td>\n", | |
" <td>0.936914</td>\n", | |
" <td>-0.210771</td>\n", | |
" <td>0.564948</td>\n", | |
" <td>0.019599</td>\n", | |
" <td>-0.166271</td>\n", | |
" <td>0.251355</td>\n", | |
" <td>-0.134027</td>\n", | |
" <td>1.280353</td>\n", | |
" <td>-0.770996</td>\n", | |
" <td>...</td>\n", | |
" <td>0.028680</td>\n", | |
" <td>-0.224805</td>\n", | |
" <td>0.857354</td>\n", | |
" <td>0.462102</td>\n", | |
" <td>-0.786376</td>\n", | |
" <td>2.388081</td>\n", | |
" <td>-1.658082</td>\n", | |
" <td>-0.539284</td>\n", | |
" <td>-0.880693</td>\n", | |
" <td>0.421976</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>0.455793</td>\n", | |
" <td>-0.381480</td>\n", | |
" <td>-0.502506</td>\n", | |
" <td>0.602100</td>\n", | |
" <td>0.293811</td>\n", | |
" <td>-0.370855</td>\n", | |
" <td>0.135328</td>\n", | |
" <td>1.038086</td>\n", | |
" <td>0.575656</td>\n", | |
" <td>0.928458</td>\n", | |
" <td>...</td>\n", | |
" <td>0.211374</td>\n", | |
" <td>1.911427</td>\n", | |
" <td>-0.383501</td>\n", | |
" <td>0.712876</td>\n", | |
" <td>-1.345929</td>\n", | |
" <td>-1.054745</td>\n", | |
" <td>-0.502296</td>\n", | |
" <td>0.255767</td>\n", | |
" <td>-0.555215</td>\n", | |
" <td>-0.374579</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0.055037</td>\n", | |
" <td>0.943844</td>\n", | |
" <td>1.186883</td>\n", | |
" <td>-0.097353</td>\n", | |
" <td>0.453953</td>\n", | |
" <td>0.336523</td>\n", | |
" <td>-0.227028</td>\n", | |
" <td>1.804491</td>\n", | |
" <td>-1.255713</td>\n", | |
" <td>-0.974349</td>\n", | |
" <td>...</td>\n", | |
" <td>0.435610</td>\n", | |
" <td>1.250981</td>\n", | |
" <td>1.484357</td>\n", | |
" <td>-0.528372</td>\n", | |
" <td>-0.226552</td>\n", | |
" <td>-0.891205</td>\n", | |
" <td>1.306358</td>\n", | |
" <td>1.130840</td>\n", | |
" <td>0.204551</td>\n", | |
" <td>1.501261</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>0.429316</td>\n", | |
" <td>1.937749</td>\n", | |
" <td>0.980708</td>\n", | |
" <td>-0.124727</td>\n", | |
" <td>1.250186</td>\n", | |
" <td>0.181841</td>\n", | |
" <td>-1.260060</td>\n", | |
" <td>2.688526</td>\n", | |
" <td>0.764125</td>\n", | |
" <td>-0.687951</td>\n", | |
" <td>...</td>\n", | |
" <td>-2.689082</td>\n", | |
" <td>1.038927</td>\n", | |
" <td>-3.161205</td>\n", | |
" <td>1.209361</td>\n", | |
" <td>-0.064672</td>\n", | |
" <td>1.074549</td>\n", | |
" <td>0.433635</td>\n", | |
" <td>-0.492203</td>\n", | |
" <td>1.134994</td>\n", | |
" <td>0.922045</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99995</th>\n", | |
" <td>0.650366</td>\n", | |
" <td>0.296693</td>\n", | |
" <td>-0.648191</td>\n", | |
" <td>-0.027173</td>\n", | |
" <td>-1.368694</td>\n", | |
" <td>-1.414733</td>\n", | |
" <td>-0.098120</td>\n", | |
" <td>0.850049</td>\n", | |
" <td>0.931312</td>\n", | |
" <td>0.577836</td>\n", | |
" <td>...</td>\n", | |
" <td>-1.738731</td>\n", | |
" <td>0.826997</td>\n", | |
" <td>1.199955</td>\n", | |
" <td>0.048497</td>\n", | |
" <td>-0.452846</td>\n", | |
" <td>1.593279</td>\n", | |
" <td>-1.461541</td>\n", | |
" <td>1.250952</td>\n", | |
" <td>0.765631</td>\n", | |
" <td>1.259925</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99996</th>\n", | |
" <td>-1.025004</td>\n", | |
" <td>-0.033401</td>\n", | |
" <td>0.352300</td>\n", | |
" <td>2.029560</td>\n", | |
" <td>-1.081474</td>\n", | |
" <td>0.372101</td>\n", | |
" <td>-0.231158</td>\n", | |
" <td>0.129621</td>\n", | |
" <td>0.332448</td>\n", | |
" <td>-1.622283</td>\n", | |
" <td>...</td>\n", | |
" <td>0.491512</td>\n", | |
" <td>-0.619863</td>\n", | |
" <td>0.921266</td>\n", | |
" <td>0.258329</td>\n", | |
" <td>-0.385118</td>\n", | |
" <td>0.865576</td>\n", | |
" <td>0.142066</td>\n", | |
" <td>0.249012</td>\n", | |
" <td>1.083895</td>\n", | |
" <td>0.424546</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99997</th>\n", | |
" <td>-0.799553</td>\n", | |
" <td>-0.302133</td>\n", | |
" <td>-1.528081</td>\n", | |
" <td>0.159766</td>\n", | |
" <td>-0.194533</td>\n", | |
" <td>-0.942236</td>\n", | |
" <td>0.175192</td>\n", | |
" <td>0.295742</td>\n", | |
" <td>2.772861</td>\n", | |
" <td>0.499414</td>\n", | |
" <td>...</td>\n", | |
" <td>0.162933</td>\n", | |
" <td>0.919790</td>\n", | |
" <td>1.617661</td>\n", | |
" <td>-1.441009</td>\n", | |
" <td>-0.930747</td>\n", | |
" <td>-1.212881</td>\n", | |
" <td>-1.595667</td>\n", | |
" <td>-0.378324</td>\n", | |
" <td>-0.644536</td>\n", | |
" <td>-0.979022</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99998</th>\n", | |
" <td>-1.153394</td>\n", | |
" <td>-3.726555</td>\n", | |
" <td>1.006661</td>\n", | |
" <td>1.938871</td>\n", | |
" <td>-2.395232</td>\n", | |
" <td>0.255857</td>\n", | |
" <td>-0.609283</td>\n", | |
" <td>-0.351652</td>\n", | |
" <td>0.443662</td>\n", | |
" <td>-1.992509</td>\n", | |
" <td>...</td>\n", | |
" <td>1.074920</td>\n", | |
" <td>1.201669</td>\n", | |
" <td>0.665734</td>\n", | |
" <td>-0.600312</td>\n", | |
" <td>1.116904</td>\n", | |
" <td>-0.682997</td>\n", | |
" <td>-0.901353</td>\n", | |
" <td>2.203541</td>\n", | |
" <td>0.322509</td>\n", | |
" <td>-0.478056</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99999</th>\n", | |
" <td>1.461720</td>\n", | |
" <td>0.397803</td>\n", | |
" <td>0.186679</td>\n", | |
" <td>-0.155946</td>\n", | |
" <td>-0.427969</td>\n", | |
" <td>-0.385579</td>\n", | |
" <td>-0.359528</td>\n", | |
" <td>-0.491198</td>\n", | |
" <td>0.733553</td>\n", | |
" <td>-0.349641</td>\n", | |
" <td>...</td>\n", | |
" <td>0.370914</td>\n", | |
" <td>0.988451</td>\n", | |
" <td>1.326340</td>\n", | |
" <td>0.467480</td>\n", | |
" <td>0.635124</td>\n", | |
" <td>0.622167</td>\n", | |
" <td>0.377851</td>\n", | |
" <td>0.057745</td>\n", | |
" <td>0.370907</td>\n", | |
" <td>0.117684</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>100000 rows × 100 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2 3 4 5 6 \\\n", | |
"0 0.371022 -0.242209 2.113007 0.065577 -1.395332 0.345039 0.430615 \n", | |
"1 0.813260 0.936914 -0.210771 0.564948 0.019599 -0.166271 0.251355 \n", | |
"2 0.455793 -0.381480 -0.502506 0.602100 0.293811 -0.370855 0.135328 \n", | |
"3 0.055037 0.943844 1.186883 -0.097353 0.453953 0.336523 -0.227028 \n", | |
"4 0.429316 1.937749 0.980708 -0.124727 1.250186 0.181841 -1.260060 \n", | |
"... ... ... ... ... ... ... ... \n", | |
"99995 0.650366 0.296693 -0.648191 -0.027173 -1.368694 -1.414733 -0.098120 \n", | |
"99996 -1.025004 -0.033401 0.352300 2.029560 -1.081474 0.372101 -0.231158 \n", | |
"99997 -0.799553 -0.302133 -1.528081 0.159766 -0.194533 -0.942236 0.175192 \n", | |
"99998 -1.153394 -3.726555 1.006661 1.938871 -2.395232 0.255857 -0.609283 \n", | |
"99999 1.461720 0.397803 0.186679 -0.155946 -0.427969 -0.385579 -0.359528 \n", | |
"\n", | |
" 7 8 9 ... 90 91 92 \\\n", | |
"0 0.910320 -0.887256 -1.248485 ... 0.832916 -0.020408 1.140463 \n", | |
"1 -0.134027 1.280353 -0.770996 ... 0.028680 -0.224805 0.857354 \n", | |
"2 1.038086 0.575656 0.928458 ... 0.211374 1.911427 -0.383501 \n", | |
"3 1.804491 -1.255713 -0.974349 ... 0.435610 1.250981 1.484357 \n", | |
"4 2.688526 0.764125 -0.687951 ... -2.689082 1.038927 -3.161205 \n", | |
"... ... ... ... ... ... ... ... \n", | |
"99995 0.850049 0.931312 0.577836 ... -1.738731 0.826997 1.199955 \n", | |
"99996 0.129621 0.332448 -1.622283 ... 0.491512 -0.619863 0.921266 \n", | |
"99997 0.295742 2.772861 0.499414 ... 0.162933 0.919790 1.617661 \n", | |
"99998 -0.351652 0.443662 -1.992509 ... 1.074920 1.201669 0.665734 \n", | |
"99999 -0.491198 0.733553 -0.349641 ... 0.370914 0.988451 1.326340 \n", | |
"\n", | |
" 93 94 95 96 97 98 99 \n", | |
"0 1.465279 0.333421 -1.686267 0.182302 0.260248 -0.056024 -0.645074 \n", | |
"1 0.462102 -0.786376 2.388081 -1.658082 -0.539284 -0.880693 0.421976 \n", | |
"2 0.712876 -1.345929 -1.054745 -0.502296 0.255767 -0.555215 -0.374579 \n", | |
"3 -0.528372 -0.226552 -0.891205 1.306358 1.130840 0.204551 1.501261 \n", | |
"4 1.209361 -0.064672 1.074549 0.433635 -0.492203 1.134994 0.922045 \n", | |
"... ... ... ... ... ... ... ... \n", | |
"99995 0.048497 -0.452846 1.593279 -1.461541 1.250952 0.765631 1.259925 \n", | |
"99996 0.258329 -0.385118 0.865576 0.142066 0.249012 1.083895 0.424546 \n", | |
"99997 -1.441009 -0.930747 -1.212881 -1.595667 -0.378324 -0.644536 -0.979022 \n", | |
"99998 -0.600312 1.116904 -0.682997 -0.901353 2.203541 0.322509 -0.478056 \n", | |
"99999 0.467480 0.635124 0.622167 0.377851 0.057745 0.370907 0.117684 \n", | |
"\n", | |
"[100000 rows x 100 columns]" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df2" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"100" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len(df2._data.blocks)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Arithmetic with a scalar" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Check that this operation preserves the (non-)consolidation of the dataframe:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"res1 = df1 + 1\n", | |
"len(res1._data.blocks)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"100" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"res2 = df2 + 1\n", | |
"len(res2._data.blocks)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Timing:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"27 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df1 + 1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"29.1 ms ± 445 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df2 + 1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"So this is a bit slower (as expected, as the numpy operation is run on multiple arrays instead of on 1), but it's also not a huge difference." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Arithmetic with other dataframes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len((df1 + df1)._data.blocks)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"100" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len((df2 + df2)._data.blocks)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"28.1 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df1 + df1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"37 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df2 + df2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Again, a bit slower. A larger difference in this case, but if we look at the profile of this:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" " | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
" 208414 function calls (206374 primitive calls) in 0.383 seconds\n", | |
"\n", | |
" Ordered by: cumulative time\n", | |
" List reduced from 123 to 20 due to restriction <20>\n", | |
"\n", | |
" ncalls tottime percall cumtime percall filename:lineno(function)\n", | |
" 1 0.000 0.000 0.383 0.383 {built-in method builtins.exec}\n", | |
" 1 0.027 0.027 0.383 0.383 <string>:1(<module>)\n", | |
" 1 0.000 0.000 0.357 0.357 <string>:1(<listcomp>)\n", | |
" 10 0.000 0.000 0.357 0.036 __init__.py:615(f)\n", | |
" 10 0.000 0.000 0.356 0.036 frame.py:5751(_combine_frame)\n", | |
" 10 0.000 0.000 0.356 0.036 __init__.py:242(dispatch_to_series)\n", | |
" 10 0.000 0.000 0.356 0.036 managers.py:374(operate_blockwise)\n", | |
" 10 0.007 0.001 0.356 0.036 ops.py:12(operate_blockwise)\n", | |
" 1000 0.004 0.000 0.264 0.000 array_ops.py:159(arithmetic_op)\n", | |
" 1000 0.003 0.000 0.243 0.000 array_ops.py:118(na_arithmetic_op)\n", | |
" 1000 0.001 0.000 0.240 0.000 expressions.py:214(evaluate)\n", | |
" 1000 0.003 0.000 0.236 0.000 expressions.py:61(_evaluate_standard)\n", | |
" 1000 0.225 0.000 0.225 0.000 {built-in method _operator.add}\n", | |
" 1000 0.013 0.000 0.036 0.000 managers.py:1290(_slice_take_blocks_ax0)\n", | |
" 1000 0.001 0.000 0.034 0.000 blocks.py:344(_split_op_result)\n", | |
" 1000 0.001 0.000 0.030 0.000 blocks.py:236(make_block)\n", | |
" 1000 0.002 0.000 0.028 0.000 blocks.py:2712(make_block)\n", | |
" 1000 0.003 0.000 0.022 0.000 blocks.py:2666(get_block_type)\n", | |
" 39250 0.008 0.000 0.013 0.000 {built-in method builtins.isinstance}\n", | |
" 4000 0.006 0.000 0.013 0.000 _ufunc_config.py:39(seterr)" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"%prun -l 20 -s cumtime [df2 + df2 for _ in range(10)]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see that the functions to deal with aligning the blocks (`_slice_take_blocks_ax0` and `_split_op_result`; their cumtime can be summed since they are not nested) take about 20% of the total time. This should easily be eliminated in the case of 1D blocks, since the blocks are always aligned." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We now check the performance difference with mixed dtypes (since the non-consolidated block could potentially be faster here, since the blocks don't need to be splitted to align for the operation). A few different cases:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# make a second frame where half of it are integer columns\n", | |
"df1_b = df1.copy()\n", | |
"df1_b.iloc[:, 50] = df1_b.iloc[:, 50].astype(int)\n", | |
"df1_b.iloc[:, 51:] = df1_b.iloc[:, 51:].astype(int)\n", | |
"df1_b._consolidate_inplace()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"34.3 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df1 + df1_b" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df2_b = df2.copy()\n", | |
"df2_b._policy = \"split\"\n", | |
"df2_b.iloc[:, 50] = df2_b.iloc[:, 50].astype(int)\n", | |
"df2_b.iloc[:, 51:] = df2_b.iloc[:, 51:].astype(int)\n", | |
"df2_b._consolidate_inplace()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"39.3 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df2 + df2_b" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"res1 = df1 + df1_b" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"res2 = df2 + df2_b" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"pd.testing.assert_frame_equal(res1, res2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Again only a small difference (and would need to look at a profile if it could be easily avoided).\n", | |
"\n", | |
"Second example with and integer dataframe more some random columns as floats:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"arr2 = np.random.randint(0, 1000, (100_000, 100))\n", | |
"dfint1 = pd.DataFrame(arr2, policy=\"block\") # the current default\n", | |
"dfint1 = dfint1.copy()\n", | |
"dfint2 = pd.DataFrame(arr2, policy=\"split\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"dfint1_a = dfint1.copy()\n", | |
"dfint1_a.iloc[0, [10, 15, 26]] = np.nan\n", | |
"\n", | |
"dfint1_b = dfint1.copy()\n", | |
"dfint1_b.iloc[0, [13, 52]] = np.nan" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"53 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit dfint1_a + dfint1_b" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"dfint2_a = dfint2.copy()\n", | |
"dfint2_a._policy = \"split\"\n", | |
"dfint2_a.iloc[0, [10, 15, 26]] = np.nan\n", | |
"\n", | |
"dfint2_b = dfint2.copy()\n", | |
"dfint2_b._policy = \"split\"\n", | |
"dfint2_b.iloc[0, [13, 52]] = np.nan" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"37 ms ± 430 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit dfint2_a + dfint2_b" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In this case, column-wise is actually faster." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Column-wise reduction" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"72.7 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df1.sum()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"22.6 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df2.sum()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"52.4 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df1_orig.sum() # the dataframe with the original array layout memory" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The fact that column-wise is faster here (and a lot) is a bit strange (also the fact that `df1_orig.sum()` is faster is strange, that might indicate a performance bug in the current implementation). Based on a quick profile with `%snakeviz`, it seems that the `_get_values` function in `nanops.py` is taking a larger portion for the block case (so something to investigate if we keep the block consolidation)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Additionally, we can also check the extension array version, which takes a more efficient column-wise code path, and has a more efficient sum implementation, which together gives a large speedup compared to the original block-wise dataframe:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df3 = df2.astype(\"Float64\")\n", | |
"df3._policy = \"split\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"9.78 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df3.sum()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Row-wise reduction" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If we look at a row-wise reduction instead, the block-wise dataframe is of course faster:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"64.6 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df1.sum(axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"93 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df2.sum(axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"However, if we wanted, this is probably also something that could be optimized, if we think that is worth it (it's certainly a much less common operation as column-wise reductions). \n", | |
"\n", | |
"A very crude appraoch for this specific case of sum gives the same result but is much faster:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"7.92 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit \n", | |
"arrays = df2._iter_column_arrays()\n", | |
"res = next(arrays)\n", | |
"\n", | |
"for a in arrays:\n", | |
" res = res + a\n", | |
"\n", | |
"res = pd.Series(res, index=df2.index)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's ensure it gives the same result:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"arrays = df2._iter_column_arrays()\n", | |
"res = next(arrays)\n", | |
"\n", | |
"for a in arrays:\n", | |
" res = res + a\n", | |
"\n", | |
"res = pd.Series(res, index=df2.index)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"pd.testing.assert_series_equal(df2.sum(axis=1), res)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Slicing rows" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"21 µs ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df1.iloc[0:1000]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"243 µs ± 9.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit df2.iloc[0:1000]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" \n", | |
"*** Profile stats marshalled to file '/tmp/tmph1qm7s01'. \n", | |
"Opening SnakeViz in a new tab...\n" | |
] | |
} | |
], | |
"source": [ | |
"%snakeviz -t df2.iloc[0:1000]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is a lot slower (at least relatively, it are also much smaller timings as the other operations in this notebook).\n", | |
"\n", | |
"But, let's implement a simple version of this to see why this is the case:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def slice_rows(df):\n", | |
" from pandas.core.internals import create_block_manager_from_blocks\n", | |
"\n", | |
" res = []\n", | |
"\n", | |
" for block in df._data.blocks:\n", | |
" new_values = block.values[:, 0:1000]\n", | |
" new_block = block.make_block_same_class(new_values)\n", | |
" res.append(new_block)\n", | |
"\n", | |
" new_index = df.index[0:1000]\n", | |
" mgr = create_block_manager_from_blocks(res, [df.columns, new_index], policy=\"split\")\n", | |
" return pd.DataFrame(mgr)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"226 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%timeit slice_rows(df2)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Timer unit: 1e-06 s\n", | |
"\n", | |
"Total time: 0.930681 s\n", | |
"File: <ipython-input-45-e7729d6d7e88>\n", | |
"Function: slice_rows at line 1\n", | |
"\n", | |
"Line # Hits Time Per Hit % Time Line Contents\n", | |
"==============================================================\n", | |
" 1 def slice_rows(df):\n", | |
" 2 1000 4093.0 4.1 0.4 from pandas.core.internals import create_block_manager_from_blocks\n", | |
" 3 \n", | |
" 4 1000 440.0 0.4 0.0 res = []\n", | |
" 5 \n", | |
" 6 101000 36873.0 0.4 4.0 for block in df._data.blocks:\n", | |
" 7 100000 119689.0 1.2 12.9 new_values = block.values[:, 0:1000]\n", | |
" 8 100000 399894.0 4.0 43.0 new_block = block.make_block_same_class(new_values)\n", | |
" 9 100000 48970.0 0.5 5.3 res.append(new_block)\n", | |
" 10 \n", | |
" 11 1000 11357.0 11.4 1.2 new_index = df.index[0:1000]\n", | |
" 12 1000 300365.0 300.4 32.3 mgr = create_block_manager_from_blocks(res, [df.columns, new_index], policy=\"split\")\n", | |
" 13 1000 9000.0 9.0 1.0 return pd.DataFrame(mgr)" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"%lprun -f slice_rows [slice_rows(df2) for _ in range(1000)]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"So we notice that the actual slicing operation on all columns (`new_values = block.values[:, 0:1000]`) is only ca 10-15% of the overall timing. If we can trim down the overhead on creating new blocks / creating a blockmanager from those blocks (which should be possible in a simple case like this where the dtypes or the number of columns doesn't change), we should also be able to get this performance closer to the block-wise slicing." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Adding a column" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I am including the `_consolidate_inplace` here, as this is something that would otherwise often happen in a next operation:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"23.4 ms ± 619 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit df = pd.DataFrame(arr, policy=\"block\")\n", | |
"df[100] = 1.0\n", | |
"df._consolidate_inplace()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"108 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit df = pd.DataFrame(arr, policy=\"split\")\n", | |
"df[100] = 1.0\n", | |
"df._consolidate_inplace()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"When including the consolidation, it is of course much cheaper to add a column for a non-consolidated dataframe. \n", | |
"It's not necessarily fully fair comparison, as you typically only add a column once / a few times, and may do many operations on the dataframe afterwards (so the consolidation might pay off), but it's nevertheless a use case that would improve a lot with non-consolidated dataframes." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python (dev)", | |
"language": "python", | |
"name": "dev" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment