Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save jorisvandenbossche/f917d4301d21069e2be2e3b7c7aa4d07 to your computer and use it in GitHub Desktop.

Select an option

Save jorisvandenbossche/f917d4301d21069e2be2e3b7c7aa4d07 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# POC 1D ArrayManager vs consolidated BlockManager"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See https://github.com/pandas-dev/pandas/issues/10556 and mailing list discussion at https://mail.python.org/pipermail/pandas-dev/2020-May/001219.html for context. \n",
"\n",
"This notebook is using a branch of pandas that adds an experimental \"array manager\".\n",
"\n",
"**Note** This notebook includes some benchmarks. However, the results of the timings are highly dependent on the characteristics of the data that is being used. I am now using a DataFrame with 100 columns and a 1000:1 rows/column ratio, which I think is a quite typical scenario. But when tweaking this (more or less columns, different ratio), this easily changes the results / interpretation of the benchmarks."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"pd.options.compute.use_numexpr = False\n",
"pd.options.compute.use_bottleneck = False"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"%load_ext snakeviz"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"%load_ext line_profiler"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"arr = np.random.randn(100_000, 100)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"df1_orig = pd.DataFrame(arr, manager=\"block\") # the current default\n",
"# additional copy to ensure the memory is column-aligned\n",
"df1 = df1_orig.copy()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>90</th>\n",
" <th>91</th>\n",
" <th>92</th>\n",
" <th>93</th>\n",
" <th>94</th>\n",
" <th>95</th>\n",
" <th>96</th>\n",
" <th>97</th>\n",
" <th>98</th>\n",
" <th>99</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.677230</td>\n",
" <td>0.682754</td>\n",
" <td>-0.315458</td>\n",
" <td>1.432242</td>\n",
" <td>2.391720</td>\n",
" <td>0.644951</td>\n",
" <td>1.791661</td>\n",
" <td>0.491711</td>\n",
" <td>-0.358679</td>\n",
" <td>-1.394864</td>\n",
" <td>...</td>\n",
" <td>0.407030</td>\n",
" <td>-0.713274</td>\n",
" <td>1.718921</td>\n",
" <td>0.345061</td>\n",
" <td>-2.146100</td>\n",
" <td>-0.738093</td>\n",
" <td>-0.159594</td>\n",
" <td>0.520909</td>\n",
" <td>1.133554</td>\n",
" <td>0.585371</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-0.559448</td>\n",
" <td>-1.280697</td>\n",
" <td>0.402636</td>\n",
" <td>-0.583523</td>\n",
" <td>-0.266790</td>\n",
" <td>0.418998</td>\n",
" <td>0.994489</td>\n",
" <td>-0.976886</td>\n",
" <td>-1.696476</td>\n",
" <td>0.926231</td>\n",
" <td>...</td>\n",
" <td>-0.855396</td>\n",
" <td>-2.048210</td>\n",
" <td>-1.265783</td>\n",
" <td>2.394670</td>\n",
" <td>-2.258890</td>\n",
" <td>1.229371</td>\n",
" <td>1.750581</td>\n",
" <td>0.139859</td>\n",
" <td>0.636832</td>\n",
" <td>0.385954</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-0.104786</td>\n",
" <td>-0.235949</td>\n",
" <td>0.001328</td>\n",
" <td>0.771150</td>\n",
" <td>-0.003778</td>\n",
" <td>-1.039505</td>\n",
" <td>0.879596</td>\n",
" <td>-0.625585</td>\n",
" <td>-0.979214</td>\n",
" <td>0.022012</td>\n",
" <td>...</td>\n",
" <td>-0.994192</td>\n",
" <td>0.661030</td>\n",
" <td>-0.010726</td>\n",
" <td>-0.468801</td>\n",
" <td>-0.605885</td>\n",
" <td>1.311313</td>\n",
" <td>0.339141</td>\n",
" <td>-0.211478</td>\n",
" <td>-0.717916</td>\n",
" <td>1.467792</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-0.062078</td>\n",
" <td>0.095454</td>\n",
" <td>-0.557974</td>\n",
" <td>0.701680</td>\n",
" <td>0.193023</td>\n",
" <td>-1.002278</td>\n",
" <td>-0.223521</td>\n",
" <td>-0.926172</td>\n",
" <td>-0.494398</td>\n",
" <td>-0.363959</td>\n",
" <td>...</td>\n",
" <td>0.020915</td>\n",
" <td>-0.800234</td>\n",
" <td>-0.047158</td>\n",
" <td>-0.410241</td>\n",
" <td>-1.296799</td>\n",
" <td>-1.308874</td>\n",
" <td>0.467894</td>\n",
" <td>2.335768</td>\n",
" <td>0.298389</td>\n",
" <td>-0.855592</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1.504247</td>\n",
" <td>0.312447</td>\n",
" <td>-0.698603</td>\n",
" <td>-0.616552</td>\n",
" <td>1.530180</td>\n",
" <td>-0.647969</td>\n",
" <td>1.494239</td>\n",
" <td>-0.680018</td>\n",
" <td>-0.205344</td>\n",
" <td>0.757280</td>\n",
" <td>...</td>\n",
" <td>0.468555</td>\n",
" <td>1.204949</td>\n",
" <td>1.374184</td>\n",
" <td>0.153511</td>\n",
" <td>-0.450651</td>\n",
" <td>0.877156</td>\n",
" <td>0.892300</td>\n",
" <td>1.151173</td>\n",
" <td>-0.230626</td>\n",
" <td>0.605697</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99995</th>\n",
" <td>0.539733</td>\n",
" <td>2.411698</td>\n",
" <td>-0.455420</td>\n",
" <td>0.538498</td>\n",
" <td>-0.956580</td>\n",
" <td>-0.384449</td>\n",
" <td>-1.338329</td>\n",
" <td>1.211452</td>\n",
" <td>-0.462413</td>\n",
" <td>-1.330185</td>\n",
" <td>...</td>\n",
" <td>1.651992</td>\n",
" <td>0.476358</td>\n",
" <td>0.441226</td>\n",
" <td>0.704206</td>\n",
" <td>-0.270060</td>\n",
" <td>1.247129</td>\n",
" <td>1.247639</td>\n",
" <td>-0.143032</td>\n",
" <td>0.044815</td>\n",
" <td>-0.631406</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99996</th>\n",
" <td>0.104250</td>\n",
" <td>1.415562</td>\n",
" <td>0.139757</td>\n",
" <td>-0.621568</td>\n",
" <td>-0.249787</td>\n",
" <td>0.358591</td>\n",
" <td>0.238794</td>\n",
" <td>1.232123</td>\n",
" <td>0.139006</td>\n",
" <td>0.684821</td>\n",
" <td>...</td>\n",
" <td>-0.207111</td>\n",
" <td>0.549576</td>\n",
" <td>2.155070</td>\n",
" <td>0.780223</td>\n",
" <td>0.281913</td>\n",
" <td>0.053108</td>\n",
" <td>0.433405</td>\n",
" <td>1.200421</td>\n",
" <td>1.526589</td>\n",
" <td>-0.269244</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99997</th>\n",
" <td>1.014513</td>\n",
" <td>0.349818</td>\n",
" <td>-0.482681</td>\n",
" <td>0.130529</td>\n",
" <td>-0.084422</td>\n",
" <td>1.623366</td>\n",
" <td>-1.069759</td>\n",
" <td>0.403931</td>\n",
" <td>-0.946770</td>\n",
" <td>-0.127743</td>\n",
" <td>...</td>\n",
" <td>0.245094</td>\n",
" <td>-0.188509</td>\n",
" <td>-0.425818</td>\n",
" <td>0.425148</td>\n",
" <td>-1.733082</td>\n",
" <td>0.366996</td>\n",
" <td>0.278778</td>\n",
" <td>-0.457803</td>\n",
" <td>-1.976963</td>\n",
" <td>1.287677</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99998</th>\n",
" <td>1.280127</td>\n",
" <td>2.488734</td>\n",
" <td>-1.073581</td>\n",
" <td>-0.636677</td>\n",
" <td>-0.909728</td>\n",
" <td>-1.016732</td>\n",
" <td>1.018966</td>\n",
" <td>-1.430341</td>\n",
" <td>-0.922263</td>\n",
" <td>1.865314</td>\n",
" <td>...</td>\n",
" <td>-0.027662</td>\n",
" <td>0.511552</td>\n",
" <td>-0.559474</td>\n",
" <td>-2.757027</td>\n",
" <td>-1.092821</td>\n",
" <td>0.437376</td>\n",
" <td>-1.051840</td>\n",
" <td>1.263932</td>\n",
" <td>0.042789</td>\n",
" <td>0.152446</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99999</th>\n",
" <td>0.782564</td>\n",
" <td>-0.388404</td>\n",
" <td>-0.247960</td>\n",
" <td>1.381064</td>\n",
" <td>-1.605046</td>\n",
" <td>0.974790</td>\n",
" <td>-1.178229</td>\n",
" <td>-0.396106</td>\n",
" <td>-0.368758</td>\n",
" <td>-1.187883</td>\n",
" <td>...</td>\n",
" <td>-1.746631</td>\n",
" <td>0.410091</td>\n",
" <td>-0.249438</td>\n",
" <td>0.042342</td>\n",
" <td>-0.597671</td>\n",
" <td>0.467246</td>\n",
" <td>1.492825</td>\n",
" <td>2.039514</td>\n",
" <td>-1.024533</td>\n",
" <td>2.571223</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>100000 rows × 100 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 \\\n",
"0 -0.677230 0.682754 -0.315458 1.432242 2.391720 0.644951 1.791661 \n",
"1 -0.559448 -1.280697 0.402636 -0.583523 -0.266790 0.418998 0.994489 \n",
"2 -0.104786 -0.235949 0.001328 0.771150 -0.003778 -1.039505 0.879596 \n",
"3 -0.062078 0.095454 -0.557974 0.701680 0.193023 -1.002278 -0.223521 \n",
"4 1.504247 0.312447 -0.698603 -0.616552 1.530180 -0.647969 1.494239 \n",
"... ... ... ... ... ... ... ... \n",
"99995 0.539733 2.411698 -0.455420 0.538498 -0.956580 -0.384449 -1.338329 \n",
"99996 0.104250 1.415562 0.139757 -0.621568 -0.249787 0.358591 0.238794 \n",
"99997 1.014513 0.349818 -0.482681 0.130529 -0.084422 1.623366 -1.069759 \n",
"99998 1.280127 2.488734 -1.073581 -0.636677 -0.909728 -1.016732 1.018966 \n",
"99999 0.782564 -0.388404 -0.247960 1.381064 -1.605046 0.974790 -1.178229 \n",
"\n",
" 7 8 9 ... 90 91 92 \\\n",
"0 0.491711 -0.358679 -1.394864 ... 0.407030 -0.713274 1.718921 \n",
"1 -0.976886 -1.696476 0.926231 ... -0.855396 -2.048210 -1.265783 \n",
"2 -0.625585 -0.979214 0.022012 ... -0.994192 0.661030 -0.010726 \n",
"3 -0.926172 -0.494398 -0.363959 ... 0.020915 -0.800234 -0.047158 \n",
"4 -0.680018 -0.205344 0.757280 ... 0.468555 1.204949 1.374184 \n",
"... ... ... ... ... ... ... ... \n",
"99995 1.211452 -0.462413 -1.330185 ... 1.651992 0.476358 0.441226 \n",
"99996 1.232123 0.139006 0.684821 ... -0.207111 0.549576 2.155070 \n",
"99997 0.403931 -0.946770 -0.127743 ... 0.245094 -0.188509 -0.425818 \n",
"99998 -1.430341 -0.922263 1.865314 ... -0.027662 0.511552 -0.559474 \n",
"99999 -0.396106 -0.368758 -1.187883 ... -1.746631 0.410091 -0.249438 \n",
"\n",
" 93 94 95 96 97 98 99 \n",
"0 0.345061 -2.146100 -0.738093 -0.159594 0.520909 1.133554 0.585371 \n",
"1 2.394670 -2.258890 1.229371 1.750581 0.139859 0.636832 0.385954 \n",
"2 -0.468801 -0.605885 1.311313 0.339141 -0.211478 -0.717916 1.467792 \n",
"3 -0.410241 -1.296799 -1.308874 0.467894 2.335768 0.298389 -0.855592 \n",
"4 0.153511 -0.450651 0.877156 0.892300 1.151173 -0.230626 0.605697 \n",
"... ... ... ... ... ... ... ... \n",
"99995 0.704206 -0.270060 1.247129 1.247639 -0.143032 0.044815 -0.631406 \n",
"99996 0.780223 0.281913 0.053108 0.433405 1.200421 1.526589 -0.269244 \n",
"99997 0.425148 -1.733082 0.366996 0.278778 -0.457803 -1.976963 1.287677 \n",
"99998 -2.757027 -1.092821 0.437376 -1.051840 1.263932 0.042789 0.152446 \n",
"99999 0.042342 -0.597671 0.467246 1.492825 2.039514 -1.024533 2.571223 \n",
"\n",
"[100000 rows x 100 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df1._data.blocks)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"df2 = pd.DataFrame(arr, manager=\"array\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>90</th>\n",
" <th>91</th>\n",
" <th>92</th>\n",
" <th>93</th>\n",
" <th>94</th>\n",
" <th>95</th>\n",
" <th>96</th>\n",
" <th>97</th>\n",
" <th>98</th>\n",
" <th>99</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.677230</td>\n",
" <td>0.682754</td>\n",
" <td>-0.315458</td>\n",
" <td>1.432242</td>\n",
" <td>2.391720</td>\n",
" <td>0.644951</td>\n",
" <td>1.791661</td>\n",
" <td>0.491711</td>\n",
" <td>-0.358679</td>\n",
" <td>-1.394864</td>\n",
" <td>...</td>\n",
" <td>0.407030</td>\n",
" <td>-0.713274</td>\n",
" <td>1.718921</td>\n",
" <td>0.345061</td>\n",
" <td>-2.146100</td>\n",
" <td>-0.738093</td>\n",
" <td>-0.159594</td>\n",
" <td>0.520909</td>\n",
" <td>1.133554</td>\n",
" <td>0.585371</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-0.559448</td>\n",
" <td>-1.280697</td>\n",
" <td>0.402636</td>\n",
" <td>-0.583523</td>\n",
" <td>-0.266790</td>\n",
" <td>0.418998</td>\n",
" <td>0.994489</td>\n",
" <td>-0.976886</td>\n",
" <td>-1.696476</td>\n",
" <td>0.926231</td>\n",
" <td>...</td>\n",
" <td>-0.855396</td>\n",
" <td>-2.048210</td>\n",
" <td>-1.265783</td>\n",
" <td>2.394670</td>\n",
" <td>-2.258890</td>\n",
" <td>1.229371</td>\n",
" <td>1.750581</td>\n",
" <td>0.139859</td>\n",
" <td>0.636832</td>\n",
" <td>0.385954</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-0.104786</td>\n",
" <td>-0.235949</td>\n",
" <td>0.001328</td>\n",
" <td>0.771150</td>\n",
" <td>-0.003778</td>\n",
" <td>-1.039505</td>\n",
" <td>0.879596</td>\n",
" <td>-0.625585</td>\n",
" <td>-0.979214</td>\n",
" <td>0.022012</td>\n",
" <td>...</td>\n",
" <td>-0.994192</td>\n",
" <td>0.661030</td>\n",
" <td>-0.010726</td>\n",
" <td>-0.468801</td>\n",
" <td>-0.605885</td>\n",
" <td>1.311313</td>\n",
" <td>0.339141</td>\n",
" <td>-0.211478</td>\n",
" <td>-0.717916</td>\n",
" <td>1.467792</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-0.062078</td>\n",
" <td>0.095454</td>\n",
" <td>-0.557974</td>\n",
" <td>0.701680</td>\n",
" <td>0.193023</td>\n",
" <td>-1.002278</td>\n",
" <td>-0.223521</td>\n",
" <td>-0.926172</td>\n",
" <td>-0.494398</td>\n",
" <td>-0.363959</td>\n",
" <td>...</td>\n",
" <td>0.020915</td>\n",
" <td>-0.800234</td>\n",
" <td>-0.047158</td>\n",
" <td>-0.410241</td>\n",
" <td>-1.296799</td>\n",
" <td>-1.308874</td>\n",
" <td>0.467894</td>\n",
" <td>2.335768</td>\n",
" <td>0.298389</td>\n",
" <td>-0.855592</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1.504247</td>\n",
" <td>0.312447</td>\n",
" <td>-0.698603</td>\n",
" <td>-0.616552</td>\n",
" <td>1.530180</td>\n",
" <td>-0.647969</td>\n",
" <td>1.494239</td>\n",
" <td>-0.680018</td>\n",
" <td>-0.205344</td>\n",
" <td>0.757280</td>\n",
" <td>...</td>\n",
" <td>0.468555</td>\n",
" <td>1.204949</td>\n",
" <td>1.374184</td>\n",
" <td>0.153511</td>\n",
" <td>-0.450651</td>\n",
" <td>0.877156</td>\n",
" <td>0.892300</td>\n",
" <td>1.151173</td>\n",
" <td>-0.230626</td>\n",
" <td>0.605697</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99995</th>\n",
" <td>0.539733</td>\n",
" <td>2.411698</td>\n",
" <td>-0.455420</td>\n",
" <td>0.538498</td>\n",
" <td>-0.956580</td>\n",
" <td>-0.384449</td>\n",
" <td>-1.338329</td>\n",
" <td>1.211452</td>\n",
" <td>-0.462413</td>\n",
" <td>-1.330185</td>\n",
" <td>...</td>\n",
" <td>1.651992</td>\n",
" <td>0.476358</td>\n",
" <td>0.441226</td>\n",
" <td>0.704206</td>\n",
" <td>-0.270060</td>\n",
" <td>1.247129</td>\n",
" <td>1.247639</td>\n",
" <td>-0.143032</td>\n",
" <td>0.044815</td>\n",
" <td>-0.631406</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99996</th>\n",
" <td>0.104250</td>\n",
" <td>1.415562</td>\n",
" <td>0.139757</td>\n",
" <td>-0.621568</td>\n",
" <td>-0.249787</td>\n",
" <td>0.358591</td>\n",
" <td>0.238794</td>\n",
" <td>1.232123</td>\n",
" <td>0.139006</td>\n",
" <td>0.684821</td>\n",
" <td>...</td>\n",
" <td>-0.207111</td>\n",
" <td>0.549576</td>\n",
" <td>2.155070</td>\n",
" <td>0.780223</td>\n",
" <td>0.281913</td>\n",
" <td>0.053108</td>\n",
" <td>0.433405</td>\n",
" <td>1.200421</td>\n",
" <td>1.526589</td>\n",
" <td>-0.269244</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99997</th>\n",
" <td>1.014513</td>\n",
" <td>0.349818</td>\n",
" <td>-0.482681</td>\n",
" <td>0.130529</td>\n",
" <td>-0.084422</td>\n",
" <td>1.623366</td>\n",
" <td>-1.069759</td>\n",
" <td>0.403931</td>\n",
" <td>-0.946770</td>\n",
" <td>-0.127743</td>\n",
" <td>...</td>\n",
" <td>0.245094</td>\n",
" <td>-0.188509</td>\n",
" <td>-0.425818</td>\n",
" <td>0.425148</td>\n",
" <td>-1.733082</td>\n",
" <td>0.366996</td>\n",
" <td>0.278778</td>\n",
" <td>-0.457803</td>\n",
" <td>-1.976963</td>\n",
" <td>1.287677</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99998</th>\n",
" <td>1.280127</td>\n",
" <td>2.488734</td>\n",
" <td>-1.073581</td>\n",
" <td>-0.636677</td>\n",
" <td>-0.909728</td>\n",
" <td>-1.016732</td>\n",
" <td>1.018966</td>\n",
" <td>-1.430341</td>\n",
" <td>-0.922263</td>\n",
" <td>1.865314</td>\n",
" <td>...</td>\n",
" <td>-0.027662</td>\n",
" <td>0.511552</td>\n",
" <td>-0.559474</td>\n",
" <td>-2.757027</td>\n",
" <td>-1.092821</td>\n",
" <td>0.437376</td>\n",
" <td>-1.051840</td>\n",
" <td>1.263932</td>\n",
" <td>0.042789</td>\n",
" <td>0.152446</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99999</th>\n",
" <td>0.782564</td>\n",
" <td>-0.388404</td>\n",
" <td>-0.247960</td>\n",
" <td>1.381064</td>\n",
" <td>-1.605046</td>\n",
" <td>0.974790</td>\n",
" <td>-1.178229</td>\n",
" <td>-0.396106</td>\n",
" <td>-0.368758</td>\n",
" <td>-1.187883</td>\n",
" <td>...</td>\n",
" <td>-1.746631</td>\n",
" <td>0.410091</td>\n",
" <td>-0.249438</td>\n",
" <td>0.042342</td>\n",
" <td>-0.597671</td>\n",
" <td>0.467246</td>\n",
" <td>1.492825</td>\n",
" <td>2.039514</td>\n",
" <td>-1.024533</td>\n",
" <td>2.571223</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>100000 rows × 100 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 \\\n",
"0 -0.677230 0.682754 -0.315458 1.432242 2.391720 0.644951 1.791661 \n",
"1 -0.559448 -1.280697 0.402636 -0.583523 -0.266790 0.418998 0.994489 \n",
"2 -0.104786 -0.235949 0.001328 0.771150 -0.003778 -1.039505 0.879596 \n",
"3 -0.062078 0.095454 -0.557974 0.701680 0.193023 -1.002278 -0.223521 \n",
"4 1.504247 0.312447 -0.698603 -0.616552 1.530180 -0.647969 1.494239 \n",
"... ... ... ... ... ... ... ... \n",
"99995 0.539733 2.411698 -0.455420 0.538498 -0.956580 -0.384449 -1.338329 \n",
"99996 0.104250 1.415562 0.139757 -0.621568 -0.249787 0.358591 0.238794 \n",
"99997 1.014513 0.349818 -0.482681 0.130529 -0.084422 1.623366 -1.069759 \n",
"99998 1.280127 2.488734 -1.073581 -0.636677 -0.909728 -1.016732 1.018966 \n",
"99999 0.782564 -0.388404 -0.247960 1.381064 -1.605046 0.974790 -1.178229 \n",
"\n",
" 7 8 9 ... 90 91 92 \\\n",
"0 0.491711 -0.358679 -1.394864 ... 0.407030 -0.713274 1.718921 \n",
"1 -0.976886 -1.696476 0.926231 ... -0.855396 -2.048210 -1.265783 \n",
"2 -0.625585 -0.979214 0.022012 ... -0.994192 0.661030 -0.010726 \n",
"3 -0.926172 -0.494398 -0.363959 ... 0.020915 -0.800234 -0.047158 \n",
"4 -0.680018 -0.205344 0.757280 ... 0.468555 1.204949 1.374184 \n",
"... ... ... ... ... ... ... ... \n",
"99995 1.211452 -0.462413 -1.330185 ... 1.651992 0.476358 0.441226 \n",
"99996 1.232123 0.139006 0.684821 ... -0.207111 0.549576 2.155070 \n",
"99997 0.403931 -0.946770 -0.127743 ... 0.245094 -0.188509 -0.425818 \n",
"99998 -1.430341 -0.922263 1.865314 ... -0.027662 0.511552 -0.559474 \n",
"99999 -0.396106 -0.368758 -1.187883 ... -1.746631 0.410091 -0.249438 \n",
"\n",
" 93 94 95 96 97 98 99 \n",
"0 0.345061 -2.146100 -0.738093 -0.159594 0.520909 1.133554 0.585371 \n",
"1 2.394670 -2.258890 1.229371 1.750581 0.139859 0.636832 0.385954 \n",
"2 -0.468801 -0.605885 1.311313 0.339141 -0.211478 -0.717916 1.467792 \n",
"3 -0.410241 -1.296799 -1.308874 0.467894 2.335768 0.298389 -0.855592 \n",
"4 0.153511 -0.450651 0.877156 0.892300 1.151173 -0.230626 0.605697 \n",
"... ... ... ... ... ... ... ... \n",
"99995 0.704206 -0.270060 1.247129 1.247639 -0.143032 0.044815 -0.631406 \n",
"99996 0.780223 0.281913 0.053108 0.433405 1.200421 1.526589 -0.269244 \n",
"99997 0.425148 -1.733082 0.366996 0.278778 -0.457803 -1.976963 1.287677 \n",
"99998 -2.757027 -1.092821 0.437376 -1.051840 1.263932 0.042789 0.152446 \n",
"99999 0.042342 -0.597671 0.467246 1.492825 2.039514 -1.024533 2.571223 \n",
"\n",
"[100000 rows x 100 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df2._data.arrays)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Arithmetic with a scalar"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check that this operation preserves the (non-)consolidation of the dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res1 = df1 + 1\n",
"len(res1._data.blocks)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res2 = df2 + 1\n",
"len(res2._data.arrays)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Timing:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"28.3 ms ± 604 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1 + 1"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"30.5 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df2 + 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Arithmetic with other dataframes"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len((df1 + df1)._data.blocks)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len((df2 + df2)._data.arrays)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"30.8 ms ± 4.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1 + df1"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"33.5 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df2 + df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now check the performance difference with mixed dtypes (since the non-consolidated block could potentially be faster here, since the blocks don't need to be splitted to align for the operation). A few different cases:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# make a second frame where half of it are integer columns\n",
"df1_b = df1.copy()\n",
"df1_b.iloc[:, 50] = df1_b.iloc[:, 50].astype(int)\n",
"df1_b.iloc[:, 51:] = df1_b.iloc[:, 51:].astype(int)\n",
"df1_b._consolidate_inplace()\n",
"len(df1_b._data.blocks)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"32.7 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1 + df1_b"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2_b = df2.copy()\n",
"df2_b.iloc[:, 50] = df2_b.iloc[:, 50].astype(int)\n",
"df2_b.iloc[:, 51:] = df2_b.iloc[:, 51:].astype(int)\n",
"df2_b._consolidate_inplace()\n",
"len(df2_b._data.arrays)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"30.1 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df2 + df2_b"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"res1 = df1 + df1_b\n",
"res2 = df2 + df2_b\n",
"pd.testing.assert_frame_equal(res1, res2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Second example with and integer dataframe more some random columns as floats:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"arr2 = np.random.randint(0, 1000, (100_000, 100))\n",
"dfint1 = pd.DataFrame(arr2, manager=\"block\") # the current default\n",
"dfint1 = dfint1.copy()\n",
"dfint2 = pd.DataFrame(arr2, manager=\"array\")"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"dfint1_a = dfint1.copy()\n",
"dfint1_a.iloc[0, [10, 15, 26]] = np.nan\n",
"\n",
"dfint1_b = dfint1.copy()\n",
"dfint1_b.iloc[0, [13, 52]] = np.nan"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"53.9 ms ± 5.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit dfint1_a + dfint1_b"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"dfint2_a = dfint2.copy()\n",
"dfint2_a.iloc[0, [10, 15, 26]] = np.nan\n",
"\n",
"dfint2_b = dfint2.copy()\n",
"dfint2_b.iloc[0, [13, 52]] = np.nan"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"34.5 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit dfint2_a + dfint2_b"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, column-wise is significantly (not within error range) faster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Column-wise reduction"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"67.5 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1.sum()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"24.1 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df2.sum()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"pd.testing.assert_series_equal(df1.sum(), df2.sum())"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"59.4 ms ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1_orig.sum() # the dataframe with the original array layout memory"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fact that column-wise is faster here (and a lot) is a bit strange (also the fact that `df1_orig.sum()` is faster is strange, that might indicate a performance bug in the current implementation). Based on a quick profile with `%snakeviz`, it seems that the `_get_values` function in `nanops.py` is taking a larger portion for the block case (so something to investigate if we keep the block consolidation)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Row-wise reduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we look at a row-wise reduction instead, the block-wise dataframe is of course faster:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"73.3 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit df1.sum(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"189 ms ± 18.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit df2.sum(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, if we wanted, this is probably also something that could be optimized, if we think that is worth it (it's certainly a much less common operation as column-wise reductions). \n",
"\n",
"A very crude appraoch for this specific case of sum gives the same result but is much faster:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"9.81 ms ± 810 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%%timeit \n",
"arrays = df2._iter_column_arrays()\n",
"res = next(arrays)\n",
"\n",
"for a in arrays:\n",
" res = res + a\n",
"\n",
"res = pd.Series(res, index=df2.index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's ensure it gives the same result:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"arrays = df2._iter_column_arrays()\n",
"res = next(arrays)\n",
"\n",
"for a in arrays:\n",
" res = res + a\n",
"\n",
"res = pd.Series(res, index=df2.index)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"pd.testing.assert_series_equal(df2.sum(axis=1), res)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Slicing rows"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"16.9 µs ± 4.13 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
]
}
],
"source": [
"%timeit df1.iloc[0:1000]"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"33.5 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
]
}
],
"source": [
"%timeit df2.iloc[0:1000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here column-wise is a slower (as expected, since it needs to slice multiple arrays)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adding a column"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I am including the `_consolidate_inplace` here, as this is something that would otherwise often happen in a next operation:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"246 ms ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit df = pd.DataFrame(arr, manager=\"block\")\n",
"df[100] = 1.0\n",
"df._consolidate_inplace()"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"68.6 µs ± 4.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
]
}
],
"source": [
"%%timeit df = pd.DataFrame(arr, manager=\"array\")\n",
"df[100] = 1.0\n",
"df._consolidate_inplace()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When including the consolidation, it is of course much cheaper to add a column for a non-consolidated dataframe. \n",
"It's not necessarily fully fair comparison, as you typically only add a column once / a few times, and may do many operations on the dataframe afterwards (so the consolidation might pay off), but it's nevertheless a use case that would improve a lot with non-consolidated dataframes."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"124 µs ± 7.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
]
}
],
"source": [
"%%timeit df = pd.DataFrame(arr, manager=\"block\")\n",
"df[100] = 1.0"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (dev)",
"language": "python",
"name": "dev"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment