Created
August 31, 2020 14:16
-
-
Save jorisvandenbossche/f917d4301d21069e2be2e3b7c7aa4d07 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# POC 1D ArrayManager vs consolidated BlockManager" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "See https://github.com/pandas-dev/pandas/issues/10556 and mailing list discussion at https://mail.python.org/pipermail/pandas-dev/2020-May/001219.html for context. \n", | |
| "\n", | |
| "This notebook is using a branch of pandas that adds an experimental \"array manager\".\n", | |
| "\n", | |
| "**Note** This notebook includes some benchmarks. However, the results of the timings are highly dependent on the characteristics of the data that is being used. I am now using a DataFrame with 100 columns and a 1000:1 rows/column ratio, which I think is a quite typical scenario. But when tweaking this (more or less columns, different ratio), this easily changes the results / interpretation of the benchmarks." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import pandas as pd" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 2, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "pd.options.compute.use_numexpr = False\n", | |
| "pd.options.compute.use_bottleneck = False" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "%load_ext snakeviz" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "%load_ext line_profiler" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "arr = np.random.randn(100_000, 100)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "df1_orig = pd.DataFrame(arr, manager=\"block\") # the current default\n", | |
| "# additional copy to ensure the memory is column-aligned\n", | |
| "df1 = df1_orig.copy()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<div>\n", | |
| "<style scoped>\n", | |
| " .dataframe tbody tr th:only-of-type {\n", | |
| " vertical-align: middle;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe tbody tr th {\n", | |
| " vertical-align: top;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe thead th {\n", | |
| " text-align: right;\n", | |
| " }\n", | |
| "</style>\n", | |
| "<table border=\"1\" class=\"dataframe\">\n", | |
| " <thead>\n", | |
| " <tr style=\"text-align: right;\">\n", | |
| " <th></th>\n", | |
| " <th>0</th>\n", | |
| " <th>1</th>\n", | |
| " <th>2</th>\n", | |
| " <th>3</th>\n", | |
| " <th>4</th>\n", | |
| " <th>5</th>\n", | |
| " <th>6</th>\n", | |
| " <th>7</th>\n", | |
| " <th>8</th>\n", | |
| " <th>9</th>\n", | |
| " <th>...</th>\n", | |
| " <th>90</th>\n", | |
| " <th>91</th>\n", | |
| " <th>92</th>\n", | |
| " <th>93</th>\n", | |
| " <th>94</th>\n", | |
| " <th>95</th>\n", | |
| " <th>96</th>\n", | |
| " <th>97</th>\n", | |
| " <th>98</th>\n", | |
| " <th>99</th>\n", | |
| " </tr>\n", | |
| " </thead>\n", | |
| " <tbody>\n", | |
| " <tr>\n", | |
| " <th>0</th>\n", | |
| " <td>-0.677230</td>\n", | |
| " <td>0.682754</td>\n", | |
| " <td>-0.315458</td>\n", | |
| " <td>1.432242</td>\n", | |
| " <td>2.391720</td>\n", | |
| " <td>0.644951</td>\n", | |
| " <td>1.791661</td>\n", | |
| " <td>0.491711</td>\n", | |
| " <td>-0.358679</td>\n", | |
| " <td>-1.394864</td>\n", | |
| " <td>...</td>\n", | |
| " <td>0.407030</td>\n", | |
| " <td>-0.713274</td>\n", | |
| " <td>1.718921</td>\n", | |
| " <td>0.345061</td>\n", | |
| " <td>-2.146100</td>\n", | |
| " <td>-0.738093</td>\n", | |
| " <td>-0.159594</td>\n", | |
| " <td>0.520909</td>\n", | |
| " <td>1.133554</td>\n", | |
| " <td>0.585371</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1</th>\n", | |
| " <td>-0.559448</td>\n", | |
| " <td>-1.280697</td>\n", | |
| " <td>0.402636</td>\n", | |
| " <td>-0.583523</td>\n", | |
| " <td>-0.266790</td>\n", | |
| " <td>0.418998</td>\n", | |
| " <td>0.994489</td>\n", | |
| " <td>-0.976886</td>\n", | |
| " <td>-1.696476</td>\n", | |
| " <td>0.926231</td>\n", | |
| " <td>...</td>\n", | |
| " <td>-0.855396</td>\n", | |
| " <td>-2.048210</td>\n", | |
| " <td>-1.265783</td>\n", | |
| " <td>2.394670</td>\n", | |
| " <td>-2.258890</td>\n", | |
| " <td>1.229371</td>\n", | |
| " <td>1.750581</td>\n", | |
| " <td>0.139859</td>\n", | |
| " <td>0.636832</td>\n", | |
| " <td>0.385954</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>2</th>\n", | |
| " <td>-0.104786</td>\n", | |
| " <td>-0.235949</td>\n", | |
| " <td>0.001328</td>\n", | |
| " <td>0.771150</td>\n", | |
| " <td>-0.003778</td>\n", | |
| " <td>-1.039505</td>\n", | |
| " <td>0.879596</td>\n", | |
| " <td>-0.625585</td>\n", | |
| " <td>-0.979214</td>\n", | |
| " <td>0.022012</td>\n", | |
| " <td>...</td>\n", | |
| " <td>-0.994192</td>\n", | |
| " <td>0.661030</td>\n", | |
| " <td>-0.010726</td>\n", | |
| " <td>-0.468801</td>\n", | |
| " <td>-0.605885</td>\n", | |
| " <td>1.311313</td>\n", | |
| " <td>0.339141</td>\n", | |
| " <td>-0.211478</td>\n", | |
| " <td>-0.717916</td>\n", | |
| " <td>1.467792</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>3</th>\n", | |
| " <td>-0.062078</td>\n", | |
| " <td>0.095454</td>\n", | |
| " <td>-0.557974</td>\n", | |
| " <td>0.701680</td>\n", | |
| " <td>0.193023</td>\n", | |
| " <td>-1.002278</td>\n", | |
| " <td>-0.223521</td>\n", | |
| " <td>-0.926172</td>\n", | |
| " <td>-0.494398</td>\n", | |
| " <td>-0.363959</td>\n", | |
| " <td>...</td>\n", | |
| " <td>0.020915</td>\n", | |
| " <td>-0.800234</td>\n", | |
| " <td>-0.047158</td>\n", | |
| " <td>-0.410241</td>\n", | |
| " <td>-1.296799</td>\n", | |
| " <td>-1.308874</td>\n", | |
| " <td>0.467894</td>\n", | |
| " <td>2.335768</td>\n", | |
| " <td>0.298389</td>\n", | |
| " <td>-0.855592</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>4</th>\n", | |
| " <td>1.504247</td>\n", | |
| " <td>0.312447</td>\n", | |
| " <td>-0.698603</td>\n", | |
| " <td>-0.616552</td>\n", | |
| " <td>1.530180</td>\n", | |
| " <td>-0.647969</td>\n", | |
| " <td>1.494239</td>\n", | |
| " <td>-0.680018</td>\n", | |
| " <td>-0.205344</td>\n", | |
| " <td>0.757280</td>\n", | |
| " <td>...</td>\n", | |
| " <td>0.468555</td>\n", | |
| " <td>1.204949</td>\n", | |
| " <td>1.374184</td>\n", | |
| " <td>0.153511</td>\n", | |
| " <td>-0.450651</td>\n", | |
| " <td>0.877156</td>\n", | |
| " <td>0.892300</td>\n", | |
| " <td>1.151173</td>\n", | |
| " <td>-0.230626</td>\n", | |
| " <td>0.605697</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>...</th>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>99995</th>\n", | |
| " <td>0.539733</td>\n", | |
| " <td>2.411698</td>\n", | |
| " <td>-0.455420</td>\n", | |
| " <td>0.538498</td>\n", | |
| " <td>-0.956580</td>\n", | |
| " <td>-0.384449</td>\n", | |
| " <td>-1.338329</td>\n", | |
| " <td>1.211452</td>\n", | |
| " <td>-0.462413</td>\n", | |
| " <td>-1.330185</td>\n", | |
| " <td>...</td>\n", | |
| " <td>1.651992</td>\n", | |
| " <td>0.476358</td>\n", | |
| " <td>0.441226</td>\n", | |
| " <td>0.704206</td>\n", | |
| " <td>-0.270060</td>\n", | |
| " <td>1.247129</td>\n", | |
| " <td>1.247639</td>\n", | |
| " <td>-0.143032</td>\n", | |
| " <td>0.044815</td>\n", | |
| " <td>-0.631406</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>99996</th>\n", | |
| " <td>0.104250</td>\n", | |
| " <td>1.415562</td>\n", | |
| " <td>0.139757</td>\n", | |
| " <td>-0.621568</td>\n", | |
| " <td>-0.249787</td>\n", | |
| " <td>0.358591</td>\n", | |
| " <td>0.238794</td>\n", | |
| " <td>1.232123</td>\n", | |
| " <td>0.139006</td>\n", | |
| " <td>0.684821</td>\n", | |
| " <td>...</td>\n", | |
| " <td>-0.207111</td>\n", | |
| " <td>0.549576</td>\n", | |
| " <td>2.155070</td>\n", | |
| " <td>0.780223</td>\n", | |
| " <td>0.281913</td>\n", | |
| " <td>0.053108</td>\n", | |
| " <td>0.433405</td>\n", | |
| " <td>1.200421</td>\n", | |
| " <td>1.526589</td>\n", | |
| " <td>-0.269244</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>99997</th>\n", | |
| " <td>1.014513</td>\n", | |
| " <td>0.349818</td>\n", | |
| " <td>-0.482681</td>\n", | |
| " <td>0.130529</td>\n", | |
| " <td>-0.084422</td>\n", | |
| " <td>1.623366</td>\n", | |
| " <td>-1.069759</td>\n", | |
| " <td>0.403931</td>\n", | |
| " <td>-0.946770</td>\n", | |
| " <td>-0.127743</td>\n", | |
| " <td>...</td>\n", | |
| " <td>0.245094</td>\n", | |
| " <td>-0.188509</td>\n", | |
| " <td>-0.425818</td>\n", | |
| " <td>0.425148</td>\n", | |
| " <td>-1.733082</td>\n", | |
| " <td>0.366996</td>\n", | |
| " <td>0.278778</td>\n", | |
| " <td>-0.457803</td>\n", | |
| " <td>-1.976963</td>\n", | |
| " <td>1.287677</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>99998</th>\n", | |
| " <td>1.280127</td>\n", | |
| " <td>2.488734</td>\n", | |
| " <td>-1.073581</td>\n", | |
| " <td>-0.636677</td>\n", | |
| " <td>-0.909728</td>\n", | |
| " <td>-1.016732</td>\n", | |
| " <td>1.018966</td>\n", | |
| " <td>-1.430341</td>\n", | |
| " <td>-0.922263</td>\n", | |
| " <td>1.865314</td>\n", | |
| " <td>...</td>\n", | |
| " <td>-0.027662</td>\n", | |
| " <td>0.511552</td>\n", | |
| " <td>-0.559474</td>\n", | |
| " <td>-2.757027</td>\n", | |
| " <td>-1.092821</td>\n", | |
| " <td>0.437376</td>\n", | |
| " <td>-1.051840</td>\n", | |
| " <td>1.263932</td>\n", | |
| " <td>0.042789</td>\n", | |
| " <td>0.152446</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>99999</th>\n", | |
| " <td>0.782564</td>\n", | |
| " <td>-0.388404</td>\n", | |
| " <td>-0.247960</td>\n", | |
| " <td>1.381064</td>\n", | |
| " <td>-1.605046</td>\n", | |
| " <td>0.974790</td>\n", | |
| " <td>-1.178229</td>\n", | |
| " <td>-0.396106</td>\n", | |
| " <td>-0.368758</td>\n", | |
| " <td>-1.187883</td>\n", | |
| " <td>...</td>\n", | |
| " <td>-1.746631</td>\n", | |
| " <td>0.410091</td>\n", | |
| " <td>-0.249438</td>\n", | |
| " <td>0.042342</td>\n", | |
| " <td>-0.597671</td>\n", | |
| " <td>0.467246</td>\n", | |
| " <td>1.492825</td>\n", | |
| " <td>2.039514</td>\n", | |
| " <td>-1.024533</td>\n", | |
| " <td>2.571223</td>\n", | |
| " </tr>\n", | |
| " </tbody>\n", | |
| "</table>\n", | |
| "<p>100000 rows × 100 columns</p>\n", | |
| "</div>" | |
| ], | |
| "text/plain": [ | |
| " 0 1 2 3 4 5 6 \\\n", | |
| "0 -0.677230 0.682754 -0.315458 1.432242 2.391720 0.644951 1.791661 \n", | |
| "1 -0.559448 -1.280697 0.402636 -0.583523 -0.266790 0.418998 0.994489 \n", | |
| "2 -0.104786 -0.235949 0.001328 0.771150 -0.003778 -1.039505 0.879596 \n", | |
| "3 -0.062078 0.095454 -0.557974 0.701680 0.193023 -1.002278 -0.223521 \n", | |
| "4 1.504247 0.312447 -0.698603 -0.616552 1.530180 -0.647969 1.494239 \n", | |
| "... ... ... ... ... ... ... ... \n", | |
| "99995 0.539733 2.411698 -0.455420 0.538498 -0.956580 -0.384449 -1.338329 \n", | |
| "99996 0.104250 1.415562 0.139757 -0.621568 -0.249787 0.358591 0.238794 \n", | |
| "99997 1.014513 0.349818 -0.482681 0.130529 -0.084422 1.623366 -1.069759 \n", | |
| "99998 1.280127 2.488734 -1.073581 -0.636677 -0.909728 -1.016732 1.018966 \n", | |
| "99999 0.782564 -0.388404 -0.247960 1.381064 -1.605046 0.974790 -1.178229 \n", | |
| "\n", | |
| " 7 8 9 ... 90 91 92 \\\n", | |
| "0 0.491711 -0.358679 -1.394864 ... 0.407030 -0.713274 1.718921 \n", | |
| "1 -0.976886 -1.696476 0.926231 ... -0.855396 -2.048210 -1.265783 \n", | |
| "2 -0.625585 -0.979214 0.022012 ... -0.994192 0.661030 -0.010726 \n", | |
| "3 -0.926172 -0.494398 -0.363959 ... 0.020915 -0.800234 -0.047158 \n", | |
| "4 -0.680018 -0.205344 0.757280 ... 0.468555 1.204949 1.374184 \n", | |
| "... ... ... ... ... ... ... ... \n", | |
| "99995 1.211452 -0.462413 -1.330185 ... 1.651992 0.476358 0.441226 \n", | |
| "99996 1.232123 0.139006 0.684821 ... -0.207111 0.549576 2.155070 \n", | |
| "99997 0.403931 -0.946770 -0.127743 ... 0.245094 -0.188509 -0.425818 \n", | |
| "99998 -1.430341 -0.922263 1.865314 ... -0.027662 0.511552 -0.559474 \n", | |
| "99999 -0.396106 -0.368758 -1.187883 ... -1.746631 0.410091 -0.249438 \n", | |
| "\n", | |
| " 93 94 95 96 97 98 99 \n", | |
| "0 0.345061 -2.146100 -0.738093 -0.159594 0.520909 1.133554 0.585371 \n", | |
| "1 2.394670 -2.258890 1.229371 1.750581 0.139859 0.636832 0.385954 \n", | |
| "2 -0.468801 -0.605885 1.311313 0.339141 -0.211478 -0.717916 1.467792 \n", | |
| "3 -0.410241 -1.296799 -1.308874 0.467894 2.335768 0.298389 -0.855592 \n", | |
| "4 0.153511 -0.450651 0.877156 0.892300 1.151173 -0.230626 0.605697 \n", | |
| "... ... ... ... ... ... ... ... \n", | |
| "99995 0.704206 -0.270060 1.247129 1.247639 -0.143032 0.044815 -0.631406 \n", | |
| "99996 0.780223 0.281913 0.053108 0.433405 1.200421 1.526589 -0.269244 \n", | |
| "99997 0.425148 -1.733082 0.366996 0.278778 -0.457803 -1.976963 1.287677 \n", | |
| "99998 -2.757027 -1.092821 0.437376 -1.051840 1.263932 0.042789 0.152446 \n", | |
| "99999 0.042342 -0.597671 0.467246 1.492825 2.039514 -1.024533 2.571223 \n", | |
| "\n", | |
| "[100000 rows x 100 columns]" | |
| ] | |
| }, | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "df1" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "1" | |
| ] | |
| }, | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "len(df1._data.blocks)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "df2 = pd.DataFrame(arr, manager=\"array\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<div>\n", | |
| "<style scoped>\n", | |
| " .dataframe tbody tr th:only-of-type {\n", | |
| " vertical-align: middle;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe tbody tr th {\n", | |
| " vertical-align: top;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe thead th {\n", | |
| " text-align: right;\n", | |
| " }\n", | |
| "</style>\n", | |
| "<table border=\"1\" class=\"dataframe\">\n", | |
| " <thead>\n", | |
| " <tr style=\"text-align: right;\">\n", | |
| " <th></th>\n", | |
| " <th>0</th>\n", | |
| " <th>1</th>\n", | |
| " <th>2</th>\n", | |
| " <th>3</th>\n", | |
| " <th>4</th>\n", | |
| " <th>5</th>\n", | |
| " <th>6</th>\n", | |
| " <th>7</th>\n", | |
| " <th>8</th>\n", | |
| " <th>9</th>\n", | |
| " <th>...</th>\n", | |
| " <th>90</th>\n", | |
| " <th>91</th>\n", | |
| " <th>92</th>\n", | |
| " <th>93</th>\n", | |
| " <th>94</th>\n", | |
| " <th>95</th>\n", | |
| " <th>96</th>\n", | |
| " <th>97</th>\n", | |
| " <th>98</th>\n", | |
| " <th>99</th>\n", | |
| " </tr>\n", | |
| " </thead>\n", | |
| " <tbody>\n", | |
| " <tr>\n", | |
| " <th>0</th>\n", | |
| " <td>-0.677230</td>\n", | |
| " <td>0.682754</td>\n", | |
| " <td>-0.315458</td>\n", | |
| " <td>1.432242</td>\n", | |
| " <td>2.391720</td>\n", | |
| " <td>0.644951</td>\n", | |
| " <td>1.791661</td>\n", | |
| " <td>0.491711</td>\n", | |
| " <td>-0.358679</td>\n", | |
| " <td>-1.394864</td>\n", | |
| " <td>...</td>\n", | |
| " <td>0.407030</td>\n", | |
| " <td>-0.713274</td>\n", | |
| " <td>1.718921</td>\n", | |
| " <td>0.345061</td>\n", | |
| " <td>-2.146100</td>\n", | |
| " <td>-0.738093</td>\n", | |
| " <td>-0.159594</td>\n", | |
| " <td>0.520909</td>\n", | |
| " <td>1.133554</td>\n", | |
| " <td>0.585371</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1</th>\n", | |
| " <td>-0.559448</td>\n", | |
| " <td>-1.280697</td>\n", | |
| " <td>0.402636</td>\n", | |
| " <td>-0.583523</td>\n", | |
| " <td>-0.266790</td>\n", | |
| " <td>0.418998</td>\n", | |
| " <td>0.994489</td>\n", | |
| " <td>-0.976886</td>\n", | |
| " <td>-1.696476</td>\n", | |
| " <td>0.926231</td>\n", | |
| " <td>...</td>\n", | |
| " <td>-0.855396</td>\n", | |
| " <td>-2.048210</td>\n", | |
| " <td>-1.265783</td>\n", | |
| " <td>2.394670</td>\n", | |
| " <td>-2.258890</td>\n", | |
| " <td>1.229371</td>\n", | |
| " <td>1.750581</td>\n", | |
| " <td>0.139859</td>\n", | |
| " <td>0.636832</td>\n", | |
| " <td>0.385954</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>2</th>\n", | |
| " <td>-0.104786</td>\n", | |
| " <td>-0.235949</td>\n", | |
| " <td>0.001328</td>\n", | |
| " <td>0.771150</td>\n", | |
| " <td>-0.003778</td>\n", | |
| " <td>-1.039505</td>\n", | |
| " <td>0.879596</td>\n", | |
| " <td>-0.625585</td>\n", | |
| " <td>-0.979214</td>\n", | |
| " <td>0.022012</td>\n", | |
| " <td>...</td>\n", | |
| " <td>-0.994192</td>\n", | |
| " <td>0.661030</td>\n", | |
| " <td>-0.010726</td>\n", | |
| " <td>-0.468801</td>\n", | |
| " <td>-0.605885</td>\n", | |
| " <td>1.311313</td>\n", | |
| " <td>0.339141</td>\n", | |
| " <td>-0.211478</td>\n", | |
| " <td>-0.717916</td>\n", | |
| " <td>1.467792</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>3</th>\n", | |
| " <td>-0.062078</td>\n", | |
| " <td>0.095454</td>\n", | |
| " <td>-0.557974</td>\n", | |
| " <td>0.701680</td>\n", | |
| " <td>0.193023</td>\n", | |
| " <td>-1.002278</td>\n", | |
| " <td>-0.223521</td>\n", | |
| " <td>-0.926172</td>\n", | |
| " <td>-0.494398</td>\n", | |
| " <td>-0.363959</td>\n", | |
| " <td>...</td>\n", | |
| " <td>0.020915</td>\n", | |
| " <td>-0.800234</td>\n", | |
| " <td>-0.047158</td>\n", | |
| " <td>-0.410241</td>\n", | |
| " <td>-1.296799</td>\n", | |
| " <td>-1.308874</td>\n", | |
| " <td>0.467894</td>\n", | |
| " <td>2.335768</td>\n", | |
| " <td>0.298389</td>\n", | |
| " <td>-0.855592</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>4</th>\n", | |
| " <td>1.504247</td>\n", | |
| " <td>0.312447</td>\n", | |
| " <td>-0.698603</td>\n", | |
| " <td>-0.616552</td>\n", | |
| " <td>1.530180</td>\n", | |
| " <td>-0.647969</td>\n", | |
| " <td>1.494239</td>\n", | |
| " <td>-0.680018</td>\n", | |
| " <td>-0.205344</td>\n", | |
| " <td>0.757280</td>\n", | |
| " <td>...</td>\n", | |
| " <td>0.468555</td>\n", | |
| " <td>1.204949</td>\n", | |
| " <td>1.374184</td>\n", | |
| " <td>0.153511</td>\n", | |
| " <td>-0.450651</td>\n", | |
| " <td>0.877156</td>\n", | |
| " <td>0.892300</td>\n", | |
| " <td>1.151173</td>\n", | |
| " <td>-0.230626</td>\n", | |
| " <td>0.605697</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>...</th>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " <td>...</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>99995</th>\n", | |
| " <td>0.539733</td>\n", | |
| " <td>2.411698</td>\n", | |
| " <td>-0.455420</td>\n", | |
| " <td>0.538498</td>\n", | |
| " <td>-0.956580</td>\n", | |
| " <td>-0.384449</td>\n", | |
| " <td>-1.338329</td>\n", | |
| " <td>1.211452</td>\n", | |
| " <td>-0.462413</td>\n", | |
| " <td>-1.330185</td>\n", | |
| " <td>...</td>\n", | |
| " <td>1.651992</td>\n", | |
| " <td>0.476358</td>\n", | |
| " <td>0.441226</td>\n", | |
| " <td>0.704206</td>\n", | |
| " <td>-0.270060</td>\n", | |
| " <td>1.247129</td>\n", | |
| " <td>1.247639</td>\n", | |
| " <td>-0.143032</td>\n", | |
| " <td>0.044815</td>\n", | |
| " <td>-0.631406</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>99996</th>\n", | |
| " <td>0.104250</td>\n", | |
| " <td>1.415562</td>\n", | |
| " <td>0.139757</td>\n", | |
| " <td>-0.621568</td>\n", | |
| " <td>-0.249787</td>\n", | |
| " <td>0.358591</td>\n", | |
| " <td>0.238794</td>\n", | |
| " <td>1.232123</td>\n", | |
| " <td>0.139006</td>\n", | |
| " <td>0.684821</td>\n", | |
| " <td>...</td>\n", | |
| " <td>-0.207111</td>\n", | |
| " <td>0.549576</td>\n", | |
| " <td>2.155070</td>\n", | |
| " <td>0.780223</td>\n", | |
| " <td>0.281913</td>\n", | |
| " <td>0.053108</td>\n", | |
| " <td>0.433405</td>\n", | |
| " <td>1.200421</td>\n", | |
| " <td>1.526589</td>\n", | |
| " <td>-0.269244</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>99997</th>\n", | |
| " <td>1.014513</td>\n", | |
| " <td>0.349818</td>\n", | |
| " <td>-0.482681</td>\n", | |
| " <td>0.130529</td>\n", | |
| " <td>-0.084422</td>\n", | |
| " <td>1.623366</td>\n", | |
| " <td>-1.069759</td>\n", | |
| " <td>0.403931</td>\n", | |
| " <td>-0.946770</td>\n", | |
| " <td>-0.127743</td>\n", | |
| " <td>...</td>\n", | |
| " <td>0.245094</td>\n", | |
| " <td>-0.188509</td>\n", | |
| " <td>-0.425818</td>\n", | |
| " <td>0.425148</td>\n", | |
| " <td>-1.733082</td>\n", | |
| " <td>0.366996</td>\n", | |
| " <td>0.278778</td>\n", | |
| " <td>-0.457803</td>\n", | |
| " <td>-1.976963</td>\n", | |
| " <td>1.287677</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>99998</th>\n", | |
| " <td>1.280127</td>\n", | |
| " <td>2.488734</td>\n", | |
| " <td>-1.073581</td>\n", | |
| " <td>-0.636677</td>\n", | |
| " <td>-0.909728</td>\n", | |
| " <td>-1.016732</td>\n", | |
| " <td>1.018966</td>\n", | |
| " <td>-1.430341</td>\n", | |
| " <td>-0.922263</td>\n", | |
| " <td>1.865314</td>\n", | |
| " <td>...</td>\n", | |
| " <td>-0.027662</td>\n", | |
| " <td>0.511552</td>\n", | |
| " <td>-0.559474</td>\n", | |
| " <td>-2.757027</td>\n", | |
| " <td>-1.092821</td>\n", | |
| " <td>0.437376</td>\n", | |
| " <td>-1.051840</td>\n", | |
| " <td>1.263932</td>\n", | |
| " <td>0.042789</td>\n", | |
| " <td>0.152446</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>99999</th>\n", | |
| " <td>0.782564</td>\n", | |
| " <td>-0.388404</td>\n", | |
| " <td>-0.247960</td>\n", | |
| " <td>1.381064</td>\n", | |
| " <td>-1.605046</td>\n", | |
| " <td>0.974790</td>\n", | |
| " <td>-1.178229</td>\n", | |
| " <td>-0.396106</td>\n", | |
| " <td>-0.368758</td>\n", | |
| " <td>-1.187883</td>\n", | |
| " <td>...</td>\n", | |
| " <td>-1.746631</td>\n", | |
| " <td>0.410091</td>\n", | |
| " <td>-0.249438</td>\n", | |
| " <td>0.042342</td>\n", | |
| " <td>-0.597671</td>\n", | |
| " <td>0.467246</td>\n", | |
| " <td>1.492825</td>\n", | |
| " <td>2.039514</td>\n", | |
| " <td>-1.024533</td>\n", | |
| " <td>2.571223</td>\n", | |
| " </tr>\n", | |
| " </tbody>\n", | |
| "</table>\n", | |
| "<p>100000 rows × 100 columns</p>\n", | |
| "</div>" | |
| ], | |
| "text/plain": [ | |
| " 0 1 2 3 4 5 6 \\\n", | |
| "0 -0.677230 0.682754 -0.315458 1.432242 2.391720 0.644951 1.791661 \n", | |
| "1 -0.559448 -1.280697 0.402636 -0.583523 -0.266790 0.418998 0.994489 \n", | |
| "2 -0.104786 -0.235949 0.001328 0.771150 -0.003778 -1.039505 0.879596 \n", | |
| "3 -0.062078 0.095454 -0.557974 0.701680 0.193023 -1.002278 -0.223521 \n", | |
| "4 1.504247 0.312447 -0.698603 -0.616552 1.530180 -0.647969 1.494239 \n", | |
| "... ... ... ... ... ... ... ... \n", | |
| "99995 0.539733 2.411698 -0.455420 0.538498 -0.956580 -0.384449 -1.338329 \n", | |
| "99996 0.104250 1.415562 0.139757 -0.621568 -0.249787 0.358591 0.238794 \n", | |
| "99997 1.014513 0.349818 -0.482681 0.130529 -0.084422 1.623366 -1.069759 \n", | |
| "99998 1.280127 2.488734 -1.073581 -0.636677 -0.909728 -1.016732 1.018966 \n", | |
| "99999 0.782564 -0.388404 -0.247960 1.381064 -1.605046 0.974790 -1.178229 \n", | |
| "\n", | |
| " 7 8 9 ... 90 91 92 \\\n", | |
| "0 0.491711 -0.358679 -1.394864 ... 0.407030 -0.713274 1.718921 \n", | |
| "1 -0.976886 -1.696476 0.926231 ... -0.855396 -2.048210 -1.265783 \n", | |
| "2 -0.625585 -0.979214 0.022012 ... -0.994192 0.661030 -0.010726 \n", | |
| "3 -0.926172 -0.494398 -0.363959 ... 0.020915 -0.800234 -0.047158 \n", | |
| "4 -0.680018 -0.205344 0.757280 ... 0.468555 1.204949 1.374184 \n", | |
| "... ... ... ... ... ... ... ... \n", | |
| "99995 1.211452 -0.462413 -1.330185 ... 1.651992 0.476358 0.441226 \n", | |
| "99996 1.232123 0.139006 0.684821 ... -0.207111 0.549576 2.155070 \n", | |
| "99997 0.403931 -0.946770 -0.127743 ... 0.245094 -0.188509 -0.425818 \n", | |
| "99998 -1.430341 -0.922263 1.865314 ... -0.027662 0.511552 -0.559474 \n", | |
| "99999 -0.396106 -0.368758 -1.187883 ... -1.746631 0.410091 -0.249438 \n", | |
| "\n", | |
| " 93 94 95 96 97 98 99 \n", | |
| "0 0.345061 -2.146100 -0.738093 -0.159594 0.520909 1.133554 0.585371 \n", | |
| "1 2.394670 -2.258890 1.229371 1.750581 0.139859 0.636832 0.385954 \n", | |
| "2 -0.468801 -0.605885 1.311313 0.339141 -0.211478 -0.717916 1.467792 \n", | |
| "3 -0.410241 -1.296799 -1.308874 0.467894 2.335768 0.298389 -0.855592 \n", | |
| "4 0.153511 -0.450651 0.877156 0.892300 1.151173 -0.230626 0.605697 \n", | |
| "... ... ... ... ... ... ... ... \n", | |
| "99995 0.704206 -0.270060 1.247129 1.247639 -0.143032 0.044815 -0.631406 \n", | |
| "99996 0.780223 0.281913 0.053108 0.433405 1.200421 1.526589 -0.269244 \n", | |
| "99997 0.425148 -1.733082 0.366996 0.278778 -0.457803 -1.976963 1.287677 \n", | |
| "99998 -2.757027 -1.092821 0.437376 -1.051840 1.263932 0.042789 0.152446 \n", | |
| "99999 0.042342 -0.597671 0.467246 1.492825 2.039514 -1.024533 2.571223 \n", | |
| "\n", | |
| "[100000 rows x 100 columns]" | |
| ] | |
| }, | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "df2" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 11, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "100" | |
| ] | |
| }, | |
| "execution_count": 11, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "len(df2._data.arrays)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Arithmetic with a scalar" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Check that this operation preserves the (non-)consolidation of the dataframe:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 12, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "1" | |
| ] | |
| }, | |
| "execution_count": 12, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "res1 = df1 + 1\n", | |
| "len(res1._data.blocks)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 13, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "100" | |
| ] | |
| }, | |
| "execution_count": 13, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "res2 = df2 + 1\n", | |
| "len(res2._data.arrays)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Timing:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 14, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "28.3 ms ± 604 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df1 + 1" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 15, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "30.5 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df2 + 1" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Arithmetic with other dataframes" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 16, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "1" | |
| ] | |
| }, | |
| "execution_count": 16, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "len((df1 + df1)._data.blocks)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 17, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "100" | |
| ] | |
| }, | |
| "execution_count": 17, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "len((df2 + df2)._data.arrays)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 18, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "30.8 ms ± 4.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df1 + df1" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 19, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "33.5 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df2 + df2" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "We now check the performance difference with mixed dtypes (since the non-consolidated block could potentially be faster here, since the blocks don't need to be splitted to align for the operation). A few different cases:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 20, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "2" | |
| ] | |
| }, | |
| "execution_count": 20, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "# make a second frame where half of it are integer columns\n", | |
| "df1_b = df1.copy()\n", | |
| "df1_b.iloc[:, 50] = df1_b.iloc[:, 50].astype(int)\n", | |
| "df1_b.iloc[:, 51:] = df1_b.iloc[:, 51:].astype(int)\n", | |
| "df1_b._consolidate_inplace()\n", | |
| "len(df1_b._data.blocks)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 21, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "32.7 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df1 + df1_b" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 22, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "100" | |
| ] | |
| }, | |
| "execution_count": 22, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "df2_b = df2.copy()\n", | |
| "df2_b.iloc[:, 50] = df2_b.iloc[:, 50].astype(int)\n", | |
| "df2_b.iloc[:, 51:] = df2_b.iloc[:, 51:].astype(int)\n", | |
| "df2_b._consolidate_inplace()\n", | |
| "len(df2_b._data.arrays)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 23, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "30.1 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df2 + df2_b" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 24, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "res1 = df1 + df1_b\n", | |
| "res2 = df2 + df2_b\n", | |
| "pd.testing.assert_frame_equal(res1, res2)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Second example with and integer dataframe more some random columns as floats:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 25, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "arr2 = np.random.randint(0, 1000, (100_000, 100))\n", | |
| "dfint1 = pd.DataFrame(arr2, manager=\"block\") # the current default\n", | |
| "dfint1 = dfint1.copy()\n", | |
| "dfint2 = pd.DataFrame(arr2, manager=\"array\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 26, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "dfint1_a = dfint1.copy()\n", | |
| "dfint1_a.iloc[0, [10, 15, 26]] = np.nan\n", | |
| "\n", | |
| "dfint1_b = dfint1.copy()\n", | |
| "dfint1_b.iloc[0, [13, 52]] = np.nan" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 27, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "53.9 ms ± 5.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit dfint1_a + dfint1_b" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 28, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "dfint2_a = dfint2.copy()\n", | |
| "dfint2_a.iloc[0, [10, 15, 26]] = np.nan\n", | |
| "\n", | |
| "dfint2_b = dfint2.copy()\n", | |
| "dfint2_b.iloc[0, [13, 52]] = np.nan" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 29, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "34.5 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit dfint2_a + dfint2_b" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "In this case, column-wise is significantly (not within error range) faster." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Column-wise reduction" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 30, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "67.5 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df1.sum()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 31, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "24.1 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df2.sum()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 32, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "pd.testing.assert_series_equal(df1.sum(), df2.sum())" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 33, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "59.4 ms ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df1_orig.sum() # the dataframe with the original array layout memory" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The fact that column-wise is faster here (and a lot) is a bit strange (also the fact that `df1_orig.sum()` is faster is strange, that might indicate a performance bug in the current implementation). Based on a quick profile with `%snakeviz`, it seems that the `_get_values` function in `nanops.py` is taking a larger portion for the block case (so something to investigate if we keep the block consolidation)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Row-wise reduction" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "If we look at a row-wise reduction instead, the block-wise dataframe is of course faster:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 34, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "73.3 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df1.sum(axis=1)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 35, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "189 ms ± 18.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df2.sum(axis=1)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "However, if we wanted, this is probably also something that could be optimized, if we think that is worth it (it's certainly a much less common operation as column-wise reductions). \n", | |
| "\n", | |
| "A very crude appraoch for this specific case of sum gives the same result but is much faster:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 36, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "9.81 ms ± 810 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%%timeit \n", | |
| "arrays = df2._iter_column_arrays()\n", | |
| "res = next(arrays)\n", | |
| "\n", | |
| "for a in arrays:\n", | |
| " res = res + a\n", | |
| "\n", | |
| "res = pd.Series(res, index=df2.index)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Let's ensure it gives the same result:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 37, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "arrays = df2._iter_column_arrays()\n", | |
| "res = next(arrays)\n", | |
| "\n", | |
| "for a in arrays:\n", | |
| " res = res + a\n", | |
| "\n", | |
| "res = pd.Series(res, index=df2.index)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 38, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "pd.testing.assert_series_equal(df2.sum(axis=1), res)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Slicing rows" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 39, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "16.9 µs ± 4.13 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df1.iloc[0:1000]" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 40, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "33.5 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%timeit df2.iloc[0:1000]" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Here column-wise is a slower (as expected, since it needs to slice multiple arrays)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Adding a column" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "I am including the `_consolidate_inplace` here, as this is something that would otherwise often happen in a next operation:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 41, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "246 ms ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%%timeit df = pd.DataFrame(arr, manager=\"block\")\n", | |
| "df[100] = 1.0\n", | |
| "df._consolidate_inplace()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 42, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "68.6 µs ± 4.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%%timeit df = pd.DataFrame(arr, manager=\"array\")\n", | |
| "df[100] = 1.0\n", | |
| "df._consolidate_inplace()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "When including the consolidation, it is of course much cheaper to add a column for a non-consolidated dataframe. \n", | |
| "It's not necessarily fully fair comparison, as you typically only add a column once / a few times, and may do many operations on the dataframe afterwards (so the consolidation might pay off), but it's nevertheless a use case that would improve a lot with non-consolidated dataframes." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 43, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "124 µs ± 7.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "%%timeit df = pd.DataFrame(arr, manager=\"block\")\n", | |
| "df[100] = 1.0" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python (dev)", | |
| "language": "python", | |
| "name": "dev" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.7.6" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 4 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment