Skip to content

Instantly share code, notes, and snippets.

@jorisvandenbossche
Last active May 28, 2020 19:04
Show Gist options
  • Save jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c to your computer and use it in GitHub Desktop.
Save jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Explore impact of pandas' Block consolidation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See https://github.com/pandas-dev/pandas/issues/10556 for context. \n",
"\n",
"This notebook is using a branch of pandas that adds a \"consolidation policy\" which lets us explore the differences a bit (https://github.com/jorisvandenbossche/pandas/commit/a6c744d38405b6645e076f63ac34517b7e960345, and in addition I also added the Floating extension array changes from https://github.com/pandas-dev/pandas/pull/34307)\n",
"\n",
"**Note** This notebook includes some benchmarks. However, the results of the timings are highly dependent on the characteristics of the data that is being used. I am now using a DataFrame with 100 columns and a 1000:1 rows/column ratio, which I think is a quite typical scenario. But when tweaking this (more or less columns, different ratio), this easily changes the results / interpretation of the benchmarks."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"pd.options.compute.use_numexpr = False\n",
"pd.options.compute.use_bottleneck = False"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"%load_ext snakeviz"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"%load_ext line_profiler"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"arr = np.random.randn(100_000, 100)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"df1_orig = pd.DataFrame(arr, policy=\"block\") # the current default\n",
"# additional copy to ensure the memory is column-aligned\n",
"df1 = df1_orig.copy()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>90</th>\n",
" <th>91</th>\n",
" <th>92</th>\n",
" <th>93</th>\n",
" <th>94</th>\n",
" <th>95</th>\n",
" <th>96</th>\n",
" <th>97</th>\n",
" <th>98</th>\n",
" <th>99</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.371022</td>\n",
" <td>-0.242209</td>\n",
" <td>2.113007</td>\n",
" <td>0.065577</td>\n",
" <td>-1.395332</td>\n",
" <td>0.345039</td>\n",
" <td>0.430615</td>\n",
" <td>0.910320</td>\n",
" <td>-0.887256</td>\n",
" <td>-1.248485</td>\n",
" <td>...</td>\n",
" <td>0.832916</td>\n",
" <td>-0.020408</td>\n",
" <td>1.140463</td>\n",
" <td>1.465279</td>\n",
" <td>0.333421</td>\n",
" <td>-1.686267</td>\n",
" <td>0.182302</td>\n",
" <td>0.260248</td>\n",
" <td>-0.056024</td>\n",
" <td>-0.645074</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.813260</td>\n",
" <td>0.936914</td>\n",
" <td>-0.210771</td>\n",
" <td>0.564948</td>\n",
" <td>0.019599</td>\n",
" <td>-0.166271</td>\n",
" <td>0.251355</td>\n",
" <td>-0.134027</td>\n",
" <td>1.280353</td>\n",
" <td>-0.770996</td>\n",
" <td>...</td>\n",
" <td>0.028680</td>\n",
" <td>-0.224805</td>\n",
" <td>0.857354</td>\n",
" <td>0.462102</td>\n",
" <td>-0.786376</td>\n",
" <td>2.388081</td>\n",
" <td>-1.658082</td>\n",
" <td>-0.539284</td>\n",
" <td>-0.880693</td>\n",
" <td>0.421976</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.455793</td>\n",
" <td>-0.381480</td>\n",
" <td>-0.502506</td>\n",
" <td>0.602100</td>\n",
" <td>0.293811</td>\n",
" <td>-0.370855</td>\n",
" <td>0.135328</td>\n",
" <td>1.038086</td>\n",
" <td>0.575656</td>\n",
" <td>0.928458</td>\n",
" <td>...</td>\n",
" <td>0.211374</td>\n",
" <td>1.911427</td>\n",
" <td>-0.383501</td>\n",
" <td>0.712876</td>\n",
" <td>-1.345929</td>\n",
" <td>-1.054745</td>\n",
" <td>-0.502296</td>\n",
" <td>0.255767</td>\n",
" <td>-0.555215</td>\n",
" <td>-0.374579</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.055037</td>\n",
" <td>0.943844</td>\n",
" <td>1.186883</td>\n",
" <td>-0.097353</td>\n",
" <td>0.453953</td>\n",
" <td>0.336523</td>\n",
" <td>-0.227028</td>\n",
" <td>1.804491</td>\n",
" <td>-1.255713</td>\n",
" <td>-0.974349</td>\n",
" <td>...</td>\n",
" <td>0.435610</td>\n",
" <td>1.250981</td>\n",
" <td>1.484357</td>\n",
" <td>-0.528372</td>\n",
" <td>-0.226552</td>\n",
" <td>-0.891205</td>\n",
" <td>1.306358</td>\n",
" <td>1.130840</td>\n",
" <td>0.204551</td>\n",
" <td>1.501261</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.429316</td>\n",
" <td>1.937749</td>\n",
" <td>0.980708</td>\n",
" <td>-0.124727</td>\n",
" <td>1.250186</td>\n",
" <td>0.181841</td>\n",
" <td>-1.260060</td>\n",
" <td>2.688526</td>\n",
" <td>0.764125</td>\n",
" <td>-0.687951</td>\n",
" <td>...</td>\n",
" <td>-2.689082</td>\n",
" <td>1.038927</td>\n",
" <td>-3.161205</td>\n",
" <td>1.209361</td>\n",
" <td>-0.064672</td>\n",
" <td>1.074549</td>\n",
" <td>0.433635</td>\n",
" <td>-0.492203</td>\n",
" <td>1.134994</td>\n",
" <td>0.922045</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99995</th>\n",
" <td>0.650366</td>\n",
" <td>0.296693</td>\n",
" <td>-0.648191</td>\n",
" <td>-0.027173</td>\n",
" <td>-1.368694</td>\n",
" <td>-1.414733</td>\n",
" <td>-0.098120</td>\n",
" <td>0.850049</td>\n",
" <td>0.931312</td>\n",
" <td>0.577836</td>\n",
" <td>...</td>\n",
" <td>-1.738731</td>\n",
" <td>0.826997</td>\n",
" <td>1.199955</td>\n",
" <td>0.048497</td>\n",
" <td>-0.452846</td>\n",
" <td>1.593279</td>\n",
" <td>-1.461541</td>\n",
" <td>1.250952</td>\n",
" <td>0.765631</td>\n",
" <td>1.259925</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99996</th>\n",
" <td>-1.025004</td>\n",
" <td>-0.033401</td>\n",
" <td>0.352300</td>\n",
" <td>2.029560</td>\n",
" <td>-1.081474</td>\n",
" <td>0.372101</td>\n",
" <td>-0.231158</td>\n",
" <td>0.129621</td>\n",
" <td>0.332448</td>\n",
" <td>-1.622283</td>\n",
" <td>...</td>\n",
" <td>0.491512</td>\n",
" <td>-0.619863</td>\n",
" <td>0.921266</td>\n",
" <td>0.258329</td>\n",
" <td>-0.385118</td>\n",
" <td>0.865576</td>\n",
" <td>0.142066</td>\n",
" <td>0.249012</td>\n",
" <td>1.083895</td>\n",
" <td>0.424546</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99997</th>\n",
" <td>-0.799553</td>\n",
" <td>-0.302133</td>\n",
" <td>-1.528081</td>\n",
" <td>0.159766</td>\n",
" <td>-0.194533</td>\n",
" <td>-0.942236</td>\n",
" <td>0.175192</td>\n",
" <td>0.295742</td>\n",
" <td>2.772861</td>\n",
" <td>0.499414</td>\n",
" <td>...</td>\n",
" <td>0.162933</td>\n",
" <td>0.919790</td>\n",
" <td>1.617661</td>\n",
" <td>-1.441009</td>\n",
" <td>-0.930747</td>\n",
" <td>-1.212881</td>\n",
" <td>-1.595667</td>\n",
" <td>-0.378324</td>\n",
" <td>-0.644536</td>\n",
" <td>-0.979022</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99998</th>\n",
" <td>-1.153394</td>\n",
" <td>-3.726555</td>\n",
" <td>1.006661</td>\n",
" <td>1.938871</td>\n",
" <td>-2.395232</td>\n",
" <td>0.255857</td>\n",
" <td>-0.609283</td>\n",
" <td>-0.351652</td>\n",
" <td>0.443662</td>\n",
" <td>-1.992509</td>\n",
" <td>...</td>\n",
" <td>1.074920</td>\n",
" <td>1.201669</td>\n",
" <td>0.665734</td>\n",
" <td>-0.600312</td>\n",
" <td>1.116904</td>\n",
" <td>-0.682997</td>\n",
" <td>-0.901353</td>\n",
" <td>2.203541</td>\n",
" <td>0.322509</td>\n",
" <td>-0.478056</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99999</th>\n",
" <td>1.461720</td>\n",
" <td>0.397803</td>\n",
" <td>0.186679</td>\n",
" <td>-0.155946</td>\n",
" <td>-0.427969</td>\n",
" <td>-0.385579</td>\n",
" <td>-0.359528</td>\n",
" <td>-0.491198</td>\n",
" <td>0.733553</td>\n",
" <td>-0.349641</td>\n",
" <td>...</td>\n",
" <td>0.370914</td>\n",
" <td>0.988451</td>\n",
" <td>1.326340</td>\n",
" <td>0.467480</td>\n",
" <td>0.635124</td>\n",
" <td>0.622167</td>\n",
" <td>0.377851</td>\n",
" <td>0.057745</td>\n",
" <td>0.370907</td>\n",
" <td>0.117684</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>100000 rows × 100 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 \\\n",
"0 0.371022 -0.242209 2.113007 0.065577 -1.395332 0.345039 0.430615 \n",
"1 0.813260 0.936914 -0.210771 0.564948 0.019599 -0.166271 0.251355 \n",
"2 0.455793 -0.381480 -0.502506 0.602100 0.293811 -0.370855 0.135328 \n",
"3 0.055037 0.943844 1.186883 -0.097353 0.453953 0.336523 -0.227028 \n",
"4 0.429316 1.937749 0.980708 -0.124727 1.250186 0.181841 -1.260060 \n",
"... ... ... ... ... ... ... ... \n",
"99995 0.650366 0.296693 -0.648191 -0.027173 -1.368694 -1.414733 -0.098120 \n",
"99996 -1.025004 -0.033401 0.352300 2.029560 -1.081474 0.372101 -0.231158 \n",
"99997 -0.799553 -0.302133 -1.528081 0.159766 -0.194533 -0.942236 0.175192 \n",
"99998 -1.153394 -3.726555 1.006661 1.938871 -2.395232 0.255857 -0.609283 \n",
"99999 1.461720 0.397803 0.186679 -0.155946 -0.427969 -0.385579 -0.359528 \n",
"\n",
" 7 8 9 ... 90 91 92 \\\n",
"0 0.910320 -0.887256 -1.248485 ... 0.832916 -0.020408 1.140463 \n",
"1 -0.134027 1.280353 -0.770996 ... 0.028680 -0.224805 0.857354 \n",
"2 1.038086 0.575656 0.928458 ... 0.211374 1.911427 -0.383501 \n",
"3 1.804491 -1.255713 -0.974349 ... 0.435610 1.250981 1.484357 \n",
"4 2.688526 0.764125 -0.687951 ... -2.689082 1.038927 -3.161205 \n",
"... ... ... ... ... ... ... ... \n",
"99995 0.850049 0.931312 0.577836 ... -1.738731 0.826997 1.199955 \n",
"99996 0.129621 0.332448 -1.622283 ... 0.491512 -0.619863 0.921266 \n",
"99997 0.295742 2.772861 0.499414 ... 0.162933 0.919790 1.617661 \n",
"99998 -0.351652 0.443662 -1.992509 ... 1.074920 1.201669 0.665734 \n",
"99999 -0.491198 0.733553 -0.349641 ... 0.370914 0.988451 1.326340 \n",
"\n",
" 93 94 95 96 97 98 99 \n",
"0 1.465279 0.333421 -1.686267 0.182302 0.260248 -0.056024 -0.645074 \n",
"1 0.462102 -0.786376 2.388081 -1.658082 -0.539284 -0.880693 0.421976 \n",
"2 0.712876 -1.345929 -1.054745 -0.502296 0.255767 -0.555215 -0.374579 \n",
"3 -0.528372 -0.226552 -0.891205 1.306358 1.130840 0.204551 1.501261 \n",
"4 1.209361 -0.064672 1.074549 0.433635 -0.492203 1.134994 0.922045 \n",
"... ... ... ... ... ... ... ... \n",
"99995 0.048497 -0.452846 1.593279 -1.461541 1.250952 0.765631 1.259925 \n",
"99996 0.258329 -0.385118 0.865576 0.142066 0.249012 1.083895 0.424546 \n",
"99997 -1.441009 -0.930747 -1.212881 -1.595667 -0.378324 -0.644536 -0.979022 \n",
"99998 -0.600312 1.116904 -0.682997 -0.901353 2.203541 0.322509 -0.478056 \n",
"99999 0.467480 0.635124 0.622167 0.377851 0.057745 0.370907 0.117684 \n",
"\n",
"[100000 rows x 100 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df1._data.blocks)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"df2 = pd.DataFrame(arr, policy=\"split\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>90</th>\n",
" <th>91</th>\n",
" <th>92</th>\n",
" <th>93</th>\n",
" <th>94</th>\n",
" <th>95</th>\n",
" <th>96</th>\n",
" <th>97</th>\n",
" <th>98</th>\n",
" <th>99</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.371022</td>\n",
" <td>-0.242209</td>\n",
" <td>2.113007</td>\n",
" <td>0.065577</td>\n",
" <td>-1.395332</td>\n",
" <td>0.345039</td>\n",
" <td>0.430615</td>\n",
" <td>0.910320</td>\n",
" <td>-0.887256</td>\n",
" <td>-1.248485</td>\n",
" <td>...</td>\n",
" <td>0.832916</td>\n",
" <td>-0.020408</td>\n",
" <td>1.140463</td>\n",
" <td>1.465279</td>\n",
" <td>0.333421</td>\n",
" <td>-1.686267</td>\n",
" <td>0.182302</td>\n",
" <td>0.260248</td>\n",
" <td>-0.056024</td>\n",
" <td>-0.645074</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.813260</td>\n",
" <td>0.936914</td>\n",
" <td>-0.210771</td>\n",
" <td>0.564948</td>\n",
" <td>0.019599</td>\n",
" <td>-0.166271</td>\n",
" <td>0.251355</td>\n",
" <td>-0.134027</td>\n",
" <td>1.280353</td>\n",
" <td>-0.770996</td>\n",
" <td>...</td>\n",
" <td>0.028680</td>\n",
" <td>-0.224805</td>\n",
" <td>0.857354</td>\n",
" <td>0.462102</td>\n",
" <td>-0.786376</td>\n",
" <td>2.388081</td>\n",
" <td>-1.658082</td>\n",
" <td>-0.539284</td>\n",
" <td>-0.880693</td>\n",
" <td>0.421976</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.455793</td>\n",
" <td>-0.381480</td>\n",
" <td>-0.502506</td>\n",
" <td>0.602100</td>\n",
" <td>0.293811</td>\n",
" <td>-0.370855</td>\n",
" <td>0.135328</td>\n",
" <td>1.038086</td>\n",
" <td>0.575656</td>\n",
" <td>0.928458</td>\n",
" <td>...</td>\n",
" <td>0.211374</td>\n",
" <td>1.911427</td>\n",
" <td>-0.383501</td>\n",
" <td>0.712876</td>\n",
" <td>-1.345929</td>\n",
" <td>-1.054745</td>\n",
" <td>-0.502296</td>\n",
" <td>0.255767</td>\n",
" <td>-0.555215</td>\n",
" <td>-0.374579</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.055037</td>\n",
" <td>0.943844</td>\n",
" <td>1.186883</td>\n",
" <td>-0.097353</td>\n",
" <td>0.453953</td>\n",
" <td>0.336523</td>\n",
" <td>-0.227028</td>\n",
" <td>1.804491</td>\n",
" <td>-1.255713</td>\n",
" <td>-0.974349</td>\n",
" <td>...</td>\n",
" <td>0.435610</td>\n",
" <td>1.250981</td>\n",
" <td>1.484357</td>\n",
" <td>-0.528372</td>\n",
" <td>-0.226552</td>\n",
" <td>-0.891205</td>\n",
" <td>1.306358</td>\n",
" <td>1.130840</td>\n",
" <td>0.204551</td>\n",
" <td>1.501261</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.429316</td>\n",
" <td>1.937749</td>\n",
" <td>0.980708</td>\n",
" <td>-0.124727</td>\n",
" <td>1.250186</td>\n",
" <td>0.181841</td>\n",
" <td>-1.260060</td>\n",
" <td>2.688526</td>\n",
" <td>0.764125</td>\n",
" <td>-0.687951</td>\n",
" <td>...</td>\n",
" <td>-2.689082</td>\n",
" <td>1.038927</td>\n",
" <td>-3.161205</td>\n",
" <td>1.209361</td>\n",
" <td>-0.064672</td>\n",
" <td>1.074549</td>\n",
" <td>0.433635</td>\n",
" <td>-0.492203</td>\n",
" <td>1.134994</td>\n",
" <td>0.922045</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99995</th>\n",
" <td>0.650366</td>\n",
" <td>0.296693</td>\n",
" <td>-0.648191</td>\n",
" <td>-0.027173</td>\n",
" <td>-1.368694</td>\n",
" <td>-1.414733</td>\n",
" <td>-0.098120</td>\n",
" <td>0.850049</td>\n",
" <td>0.931312</td>\n",
" <td>0.577836</td>\n",
" <td>...</td>\n",
" <td>-1.738731</td>\n",
" <td>0.826997</td>\n",
" <td>1.199955</td>\n",
" <td>0.048497</td>\n",
" <td>-0.452846</td>\n",
" <td>1.593279</td>\n",
" <td>-1.461541</td>\n",
" <td>1.250952</td>\n",
" <td>0.765631</td>\n",
" <td>1.259925</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99996</th>\n",
" <td>-1.025004</td>\n",
" <td>-0.033401</td>\n",
" <td>0.352300</td>\n",
" <td>2.029560</td>\n",
" <td>-1.081474</td>\n",
" <td>0.372101</td>\n",
" <td>-0.231158</td>\n",
" <td>0.129621</td>\n",
" <td>0.332448</td>\n",
" <td>-1.622283</td>\n",
" <td>...</td>\n",
" <td>0.491512</td>\n",
" <td>-0.619863</td>\n",
" <td>0.921266</td>\n",
" <td>0.258329</td>\n",
" <td>-0.385118</td>\n",
" <td>0.865576</td>\n",
" <td>0.142066</td>\n",
" <td>0.249012</td>\n",
" <td>1.083895</td>\n",
" <td>0.424546</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99997</th>\n",
" <td>-0.799553</td>\n",
" <td>-0.302133</td>\n",
" <td>-1.528081</td>\n",
" <td>0.159766</td>\n",
" <td>-0.194533</td>\n",
" <td>-0.942236</td>\n",
" <td>0.175192</td>\n",
" <td>0.295742</td>\n",
" <td>2.772861</td>\n",
" <td>0.499414</td>\n",
" <td>...</td>\n",
" <td>0.162933</td>\n",
" <td>0.919790</td>\n",
" <td>1.617661</td>\n",
" <td>-1.441009</td>\n",
" <td>-0.930747</td>\n",
" <td>-1.212881</td>\n",
" <td>-1.595667</td>\n",
" <td>-0.378324</td>\n",
" <td>-0.644536</td>\n",
" <td>-0.979022</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99998</th>\n",
" <td>-1.153394</td>\n",
" <td>-3.726555</td>\n",
" <td>1.006661</td>\n",
" <td>1.938871</td>\n",
" <td>-2.395232</td>\n",
" <td>0.255857</td>\n",
" <td>-0.609283</td>\n",
" <td>-0.351652</td>\n",
" <td>0.443662</td>\n",
" <td>-1.992509</td>\n",
" <td>...</td>\n",
" <td>1.074920</td>\n",
" <td>1.201669</td>\n",
" <td>0.665734</td>\n",
" <td>-0.600312</td>\n",
" <td>1.116904</td>\n",
" <td>-0.682997</td>\n",
" <td>-0.901353</td>\n",
" <td>2.203541</td>\n",
" <td>0.322509</td>\n",
" <td>-0.478056</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99999</th>\n",
" <td>1.461720</td>\n",
" <td>0.397803</td>\n",
" <td>0.186679</td>\n",
" <td>-0.155946</td>\n",
" <td>-0.427969</td>\n",
" <td>-0.385579</td>\n",
" <td>-0.359528</td>\n",
" <td>-0.491198</td>\n",
" <td>0.733553</td>\n",
" <td>-0.349641</td>\n",
" <td>...</td>\n",
" <td>0.370914</td>\n",
" <td>0.988451</td>\n",
" <td>1.326340</td>\n",
" <td>0.467480</td>\n",
" <td>0.635124</td>\n",
" <td>0.622167</td>\n",
" <td>0.377851</td>\n",
" <td>0.057745</td>\n",
" <td>0.370907</td>\n",
" <td>0.117684</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>100000 rows × 100 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 \\\n",
"0 0.371022 -0.242209 2.113007 0.065577 -1.395332 0.345039 0.430615 \n",
"1 0.813260 0.936914 -0.210771 0.564948 0.019599 -0.166271 0.251355 \n",
"2 0.455793 -0.381480 -0.502506 0.602100 0.293811 -0.370855 0.135328 \n",
"3 0.055037 0.943844 1.186883 -0.097353 0.453953 0.336523 -0.227028 \n",
"4 0.429316 1.937749 0.980708 -0.124727 1.250186 0.181841 -1.260060 \n",
"... ... ... ... ... ... ... ... \n",
"99995 0.650366 0.296693 -0.648191 -0.027173 -1.368694 -1.414733 -0.098120 \n",
"99996 -1.025004 -0.033401 0.352300 2.029560 -1.081474 0.372101 -0.231158 \n",
"99997 -0.799553 -0.302133 -1.528081 0.159766 -0.194533 -0.942236 0.175192 \n",
"99998 -1.153394 -3.726555 1.006661 1.938871 -2.395232 0.255857 -0.609283 \n",
"99999 1.461720 0.397803 0.186679 -0.155946 -0.427969 -0.385579 -0.359528 \n",
"\n",
" 7 8 9 ... 90 91 92 \\\n",
"0 0.910320 -0.887256 -1.248485 ... 0.832916 -0.020408 1.140463 \n",
"1 -0.134027 1.280353 -0.770996 ... 0.028680 -0.224805 0.857354 \n",
"2 1.038086 0.575656 0.928458 ... 0.211374 1.911427 -0.383501 \n",
"3 1.804491 -1.255713 -0.974349 ... 0.435610 1.250981 1.484357 \n",
"4 2.688526 0.764125 -0.687951 ... -2.689082 1.038927 -3.161205 \n",
"... ... ... ... ... ... ... ... \n",
"99995 0.850049 0.931312 0.577836 ... -1.738731 0.826997 1.199955 \n",
"99996 0.129621 0.332448 -1.622283 ... 0.491512 -0.619863 0.921266 \n",
"99997 0.295742 2.772861 0.499414 ... 0.162933 0.919790 1.617661 \n",
"99998 -0.351652 0.443662 -1.992509 ... 1.074920 1.201669 0.665734 \n",
"99999 -0.491198 0.733553 -0.349641 ... 0.370914 0.988451 1.326340 \n",
"\n",
" 93 94 95 96 97 98 99 \n",
"0 1.465279 0.333421 -1.686267 0.182302 0.260248 -0.056024 -0.645074 \n",
"1 0.462102 -0.786376 2.388081 -1.658082 -0.539284 -0.880693 0.421976 \n",
"2 0.712876 -1.345929 -1.054745 -0.502296 0.255767 -0.555215 -0.374579 \n",
"3 -0.528372 -0.226552 -0.891205 1.306358 1.130840 0.204551 1.501261 \n",
"4 1.209361 -0.064672 1.074549 0.433635 -0.492203 1.134994 0.922045 \n",
"... ... ... ... ... ... ... ... \n",
"99995 0.048497 -0.452846 1.593279 -1.461541 1.250952 0.765631 1.259925 \n",
"99996 0.258329 -0.385118 0.865576 0.142066 0.249012 1.083895 0.424546 \n",
"99997 -1.441009 -0.930747 -1.212881 -1.595667 -0.378324 -0.644536 -0.979022 \n",
"99998 -0.600312 1.116904 -0.682997 -0.901353 2.203541 0.322509 -0.478056 \n",
"99999 0.467480 0.635124 0.622167 0.377851 0.057745 0.370907 0.117684 \n",
"\n",
"[100000 rows x 100 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df2._data.blocks)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Arithmetic with a scalar"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check that this operation preserves the (non-)consolidation of the dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res1 = df1 + 1\n",
"len(res1._data.blocks)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res2 = df2 + 1\n",
"len(res2._data.blocks)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Timing:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"27 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1 + 1"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"29.1 ms ± 445 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df2 + 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So this is a bit slower (as expected, as the numpy operation is run on multiple arrays instead of on 1), but it's also not a huge difference."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Arithmetic with other dataframes"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len((df1 + df1)._data.blocks)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len((df2 + df2)._data.blocks)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"28.1 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1 + df1"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"37 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df2 + df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, a bit slower. A larger difference in this case, but if we look at the profile of this:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" "
]
},
{
"data": {
"text/plain": [
" 208414 function calls (206374 primitive calls) in 0.383 seconds\n",
"\n",
" Ordered by: cumulative time\n",
" List reduced from 123 to 20 due to restriction <20>\n",
"\n",
" ncalls tottime percall cumtime percall filename:lineno(function)\n",
" 1 0.000 0.000 0.383 0.383 {built-in method builtins.exec}\n",
" 1 0.027 0.027 0.383 0.383 <string>:1(<module>)\n",
" 1 0.000 0.000 0.357 0.357 <string>:1(<listcomp>)\n",
" 10 0.000 0.000 0.357 0.036 __init__.py:615(f)\n",
" 10 0.000 0.000 0.356 0.036 frame.py:5751(_combine_frame)\n",
" 10 0.000 0.000 0.356 0.036 __init__.py:242(dispatch_to_series)\n",
" 10 0.000 0.000 0.356 0.036 managers.py:374(operate_blockwise)\n",
" 10 0.007 0.001 0.356 0.036 ops.py:12(operate_blockwise)\n",
" 1000 0.004 0.000 0.264 0.000 array_ops.py:159(arithmetic_op)\n",
" 1000 0.003 0.000 0.243 0.000 array_ops.py:118(na_arithmetic_op)\n",
" 1000 0.001 0.000 0.240 0.000 expressions.py:214(evaluate)\n",
" 1000 0.003 0.000 0.236 0.000 expressions.py:61(_evaluate_standard)\n",
" 1000 0.225 0.000 0.225 0.000 {built-in method _operator.add}\n",
" 1000 0.013 0.000 0.036 0.000 managers.py:1290(_slice_take_blocks_ax0)\n",
" 1000 0.001 0.000 0.034 0.000 blocks.py:344(_split_op_result)\n",
" 1000 0.001 0.000 0.030 0.000 blocks.py:236(make_block)\n",
" 1000 0.002 0.000 0.028 0.000 blocks.py:2712(make_block)\n",
" 1000 0.003 0.000 0.022 0.000 blocks.py:2666(get_block_type)\n",
" 39250 0.008 0.000 0.013 0.000 {built-in method builtins.isinstance}\n",
" 4000 0.006 0.000 0.013 0.000 _ufunc_config.py:39(seterr)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%prun -l 20 -s cumtime [df2 + df2 for _ in range(10)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the functions to deal with aligning the blocks (`_slice_take_blocks_ax0` and `_split_op_result`; their cumtime can be summed since they are not nested) take about 20% of the total time. This should easily be eliminated in the case of 1D blocks, since the blocks are always aligned."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now check the performance difference with mixed dtypes (since the non-consolidated block could potentially be faster here, since the blocks don't need to be splitted to align for the operation). A few different cases:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# make a second frame where half of it are integer columns\n",
"df1_b = df1.copy()\n",
"df1_b.iloc[:, 50] = df1_b.iloc[:, 50].astype(int)\n",
"df1_b.iloc[:, 51:] = df1_b.iloc[:, 51:].astype(int)\n",
"df1_b._consolidate_inplace()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"34.3 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1 + df1_b"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"df2_b = df2.copy()\n",
"df2_b._policy = \"split\"\n",
"df2_b.iloc[:, 50] = df2_b.iloc[:, 50].astype(int)\n",
"df2_b.iloc[:, 51:] = df2_b.iloc[:, 51:].astype(int)\n",
"df2_b._consolidate_inplace()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"39.3 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df2 + df2_b"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"res1 = df1 + df1_b"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"res2 = df2 + df2_b"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"pd.testing.assert_frame_equal(res1, res2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again only a small difference (and would need to look at a profile if it could be easily avoided).\n",
"\n",
"Second example with and integer dataframe more some random columns as floats:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"arr2 = np.random.randint(0, 1000, (100_000, 100))\n",
"dfint1 = pd.DataFrame(arr2, policy=\"block\") # the current default\n",
"dfint1 = dfint1.copy()\n",
"dfint2 = pd.DataFrame(arr2, policy=\"split\")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"dfint1_a = dfint1.copy()\n",
"dfint1_a.iloc[0, [10, 15, 26]] = np.nan\n",
"\n",
"dfint1_b = dfint1.copy()\n",
"dfint1_b.iloc[0, [13, 52]] = np.nan"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"53 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit dfint1_a + dfint1_b"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"dfint2_a = dfint2.copy()\n",
"dfint2_a._policy = \"split\"\n",
"dfint2_a.iloc[0, [10, 15, 26]] = np.nan\n",
"\n",
"dfint2_b = dfint2.copy()\n",
"dfint2_b._policy = \"split\"\n",
"dfint2_b.iloc[0, [13, 52]] = np.nan"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"37 ms ± 430 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit dfint2_a + dfint2_b"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, column-wise is actually faster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Column-wise reduction"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"72.7 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1.sum()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"22.6 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df2.sum()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"52.4 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1_orig.sum() # the dataframe with the original array layout memory"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fact that column-wise is faster here (and a lot) is a bit strange (also the fact that `df1_orig.sum()` is faster is strange, that might indicate a performance bug in the current implementation). Based on a quick profile with `%snakeviz`, it seems that the `_get_values` function in `nanops.py` is taking a larger portion for the block case (so something to investigate if we keep the block consolidation)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Additionally, we can also check the extension array version, which takes a more efficient column-wise code path, and has a more efficient sum implementation, which together gives a large speedup compared to the original block-wise dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"df3 = df2.astype(\"Float64\")\n",
"df3._policy = \"split\""
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"9.78 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%timeit df3.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Row-wise reduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we look at a row-wise reduction instead, the block-wise dataframe is of course faster:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"64.6 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1.sum(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"93 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df2.sum(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, if we wanted, this is probably also something that could be optimized, if we think that is worth it (it's certainly a much less common operation as column-wise reductions). \n",
"\n",
"A very crude appraoch for this specific case of sum gives the same result but is much faster:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7.92 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%%timeit \n",
"arrays = df2._iter_column_arrays()\n",
"res = next(arrays)\n",
"\n",
"for a in arrays:\n",
" res = res + a\n",
"\n",
"res = pd.Series(res, index=df2.index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's ensure it gives the same result:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"arrays = df2._iter_column_arrays()\n",
"res = next(arrays)\n",
"\n",
"for a in arrays:\n",
" res = res + a\n",
"\n",
"res = pd.Series(res, index=df2.index)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"pd.testing.assert_series_equal(df2.sum(axis=1), res)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Slicing rows"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"21 µs ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
]
}
],
"source": [
"%timeit df1.iloc[0:1000]"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"243 µs ± 9.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
]
}
],
"source": [
"%timeit df2.iloc[0:1000]"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
"*** Profile stats marshalled to file '/tmp/tmph1qm7s01'. \n",
"Opening SnakeViz in a new tab...\n"
]
}
],
"source": [
"%snakeviz -t df2.iloc[0:1000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a lot slower (at least relatively, it are also much smaller timings as the other operations in this notebook).\n",
"\n",
"But, let's implement a simple version of this to see why this is the case:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"def slice_rows(df):\n",
" from pandas.core.internals import create_block_manager_from_blocks\n",
"\n",
" res = []\n",
"\n",
" for block in df._data.blocks:\n",
" new_values = block.values[:, 0:1000]\n",
" new_block = block.make_block_same_class(new_values)\n",
" res.append(new_block)\n",
"\n",
" new_index = df.index[0:1000]\n",
" mgr = create_block_manager_from_blocks(res, [df.columns, new_index], policy=\"split\")\n",
" return pd.DataFrame(mgr)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"226 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
]
}
],
"source": [
"%timeit slice_rows(df2)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Timer unit: 1e-06 s\n",
"\n",
"Total time: 0.930681 s\n",
"File: <ipython-input-45-e7729d6d7e88>\n",
"Function: slice_rows at line 1\n",
"\n",
"Line # Hits Time Per Hit % Time Line Contents\n",
"==============================================================\n",
" 1 def slice_rows(df):\n",
" 2 1000 4093.0 4.1 0.4 from pandas.core.internals import create_block_manager_from_blocks\n",
" 3 \n",
" 4 1000 440.0 0.4 0.0 res = []\n",
" 5 \n",
" 6 101000 36873.0 0.4 4.0 for block in df._data.blocks:\n",
" 7 100000 119689.0 1.2 12.9 new_values = block.values[:, 0:1000]\n",
" 8 100000 399894.0 4.0 43.0 new_block = block.make_block_same_class(new_values)\n",
" 9 100000 48970.0 0.5 5.3 res.append(new_block)\n",
" 10 \n",
" 11 1000 11357.0 11.4 1.2 new_index = df.index[0:1000]\n",
" 12 1000 300365.0 300.4 32.3 mgr = create_block_manager_from_blocks(res, [df.columns, new_index], policy=\"split\")\n",
" 13 1000 9000.0 9.0 1.0 return pd.DataFrame(mgr)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%lprun -f slice_rows [slice_rows(df2) for _ in range(1000)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So we notice that the actual slicing operation on all columns (`new_values = block.values[:, 0:1000]`) is only ca 10-15% of the overall timing. If we can trim down the overhead on creating new blocks / creating a blockmanager from those blocks (which should be possible in a simple case like this where the dtypes or the number of columns doesn't change), we should also be able to get this performance closer to the block-wise slicing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adding a column"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I am including the `_consolidate_inplace` here, as this is something that would otherwise often happen in a next operation:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"23.4 ms ± 619 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%%timeit df = pd.DataFrame(arr, policy=\"block\")\n",
"df[100] = 1.0\n",
"df._consolidate_inplace()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"108 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
]
}
],
"source": [
"%%timeit df = pd.DataFrame(arr, policy=\"split\")\n",
"df[100] = 1.0\n",
"df._consolidate_inplace()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When including the consolidation, it is of course much cheaper to add a column for a non-consolidated dataframe. \n",
"It's not necessarily fully fair comparison, as you typically only add a column once / a few times, and may do many operations on the dataframe afterwards (so the consolidation might pay off), but it's nevertheless a use case that would improve a lot with non-consolidated dataframes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (dev)",
"language": "python",
"name": "dev"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment