jorisvandenbossche · May 30, 2020 18:58
diff --git a/pandas-consolidation-test-frame-scalar-op.ipynb b/pandas-consolidation-test-frame-scalar-op.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the performance for consolidated vs non-consolidate DataFrame for a `df + 1` operation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I am using here the same branch as shown in https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c (and discussed in https://github.com/pandas-dev/pandas/issues/10556) that enables a choice between consolidated and non-consolidated blocks.\n",
    "\n",
    "I am testing this with a wide dataframe with 5000 rows and 5000 columns of all floats (no heterogeneous data types):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.DataFrame(1, index=range(5000), columns=range(5000), dtype=float, policy=\"block\").copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.DataFrame(1, index=range(5000), columns=range(5000), dtype=float, policy=\"split\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's time the `df + 1` operation for both dataframes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "30.8 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit df1 + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "294 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit df2 + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the non-consolidated DataFrame, we see 10x slowdown, which is of course not good.\n",
    "\n",
    "Looking at a profile (using `%snakeviz`), it becomes directly clear that there are a bunch of things we repeat for every column that are expensive, and that could in principle perfectly be done once for the full DataFrame:\n",
    "\n",
    "- we set the numpy error state (`with np.errstate(all=\"ignore\"):`) when doing the operation for each column again, while this could also be done once before performing all column-wise ops\n",
    "- we check the right-hand-side (the scalar 1) for every column up, while this again can only done once\n",
    "- we check whether `numexpr` can be used for each column, and since this is based on the number of elements in the array, this could again be checked once and then passed as a parameter to the underlying array operation\n",
    "\n",
    "I quickly disabled those checks (for an actual change we would of course need to do this once for the full DataFrame, but for the quick benchmark, this should be a good indication, as now we are calling all those checks 5000 times, so doing those 0 times (as in the benchmark below) or 1 time shouldn't matter too much performance-wise.\n",
    "\n",
    "The diff that I applied before running the code cells below:\n",
    "<details>\n",
    "\n",
    "```diff\n",
    "diff --git a/pandas/core/computation/expressions.py b/pandas/core/computation/expressions.py\n",
    "index 0e9077e6d..af6103226 100644\n",
    "--- a/pandas/core/computation/expressions.py\n",
    "+++ b/pandas/core/computation/expressions.py\n",
    "@@ -64,8 +64,8 @@ def _evaluate_standard(op, op_str, a, b):\n",
    "     \"\"\"\n",
    "     if _TEST_MODE:\n",
    "         _store_test_result(False)\n",
    "-    with np.errstate(all=\"ignore\"):\n",
    "-        return op(a, b)\n",
    "+    # with np.errstate(all=\"ignore\"):\n",
    "+    return op(a, b)\n",
    " \n",
    " \n",
    " def _can_use_numexpr(op, op_str, a, b, dtype_check):\n",
    "@@ -97,7 +97,7 @@ def _can_use_numexpr(op, op_str, a, b, dtype_check):\n",
    " def _evaluate_numexpr(op, op_str, a, b):\n",
    "     result = None\n",
    " \n",
    "-    if _can_use_numexpr(op, op_str, a, b, \"evaluate\"):\n",
    "+    if False:  # _can_use_numexpr(op, op_str, a, b, \"evaluate\"):\n",
    "         is_reversed = op.__name__.strip(\"_\").startswith(\"r\")\n",
    "         if is_reversed:\n",
    "             # we were originally called by a reversed op method\n",
    "@@ -225,7 +225,7 @@ def evaluate(op, a, b, use_numexpr: bool = True):\n",
    "     \"\"\"\n",
    "     op_str = _op_str_mapping[op]\n",
    "     if op_str is not None:\n",
    "-        use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)\n",
    "+        # use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)\n",
    "         if use_numexpr:\n",
    "             return _evaluate(op, op_str, a, b)  # type: ignore\n",
    "     return _evaluate_standard(op, op_str, a, b)\n",
    "diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py\n",
    "index 23afc7b85..ee8bcb23a 100644\n",
    "--- a/pandas/core/internals/blocks.py\n",
    "+++ b/pandas/core/internals/blocks.py\n",
    "@@ -336,8 +336,8 @@ class Block(PandasObject):\n",
    "         apply the function to my values; return a block if we are not\n",
    "         one\n",
    "         \"\"\"\n",
    "-        with np.errstate(all=\"ignore\"):\n",
    "-            result = func(self.values, **kwargs)\n",
    "+        # with np.errstate(all=\"ignore\"):\n",
    "+        result = func(self.values, **kwargs)\n",
    " \n",
    "         return self._split_op_result(result)\n",
    " \n",
    "diff --git a/pandas/core/ops/array_ops.py b/pandas/core/ops/array_ops.py\n",
    "index 3379ee56b..a21ea90cd 100644\n",
    "--- a/pandas/core/ops/array_ops.py\n",
    "+++ b/pandas/core/ops/array_ops.py\n",
    "@@ -29,6 +29,7 @@ from pandas.core.dtypes.common import (\n",
    " from pandas.core.dtypes.generic import ABCExtensionArray, ABCIndex, ABCSeries\n",
    " from pandas.core.dtypes.missing import isna, notna\n",
    " \n",
    "+import pandas.core.computation.expressions as expressions\n",
    " from pandas.core.ops import missing\n",
    " from pandas.core.ops.dispatch import should_extension_dispatch\n",
    " from pandas.core.ops.invalid import invalid_comparison\n",
    "@@ -136,8 +137,6 @@ def na_arithmetic_op(left, right, op, is_cmp: bool = False):\n",
    "     ------\n",
    "     TypeError : invalid operation\n",
    "     \"\"\"\n",
    "-    import pandas.core.computation.expressions as expressions\n",
    "-\n",
    "     try:\n",
    "         result = expressions.evaluate(op, left, right)\n",
    "     except TypeError:\n",
    "@@ -176,17 +175,19 @@ def arithmetic_op(left: ArrayLike, right: Any, op):\n",
    " \n",
    "     # NB: We assume that extract_array has already been called\n",
    "     #  on `left` and `right`.\n",
    "-    lvalues = maybe_upcast_datetimelike_array(left)\n",
    "-    rvalues = maybe_upcast_datetimelike_array(right)\n",
    "-    rvalues = maybe_upcast_for_op(rvalues, lvalues.shape)\n",
    "+    # lvalues = maybe_upcast_datetimelike_array(left)\n",
    "+    # rvalues = maybe_upcast_datetimelike_array(right)\n",
    "+    # rvalues = maybe_upcast_for_op(rvalues, lvalues.shape)\n",
    "+    lvalues = left\n",
    "+    rvalues = right\n",
    " \n",
    "-    if should_extension_dispatch(lvalues, rvalues) or isinstance(rvalues, Timedelta):\n",
    "-        # Timedelta is included because numexpr will fail on it, see GH#31457\n",
    "-        res_values = op(lvalues, rvalues)\n",
    "+    # if should_extension_dispatch(lvalues, rvalues) or isinstance(rvalues, Timedelta):\n",
    "+    #     # Timedelta is included because numexpr will fail on it, see GH#31457\n",
    "+    #     res_values = op(lvalues, rvalues)\n",
    " \n",
    "-    else:\n",
    "-        with np.errstate(all=\"ignore\"):\n",
    "-            res_values = na_arithmetic_op(lvalues, rvalues, op)\n",
    "+    # else:\n",
    "+    #     # with np.errstate(all=\"ignore\"):\n",
    "+    res_values = na_arithmetic_op(lvalues, rvalues, op)\n",
    " \n",
    "     return res_values\n",
    "```\n",
    "\n",
    "</details>\n",
    "\n",
    "\n",
    "\n",
    "So with this small change, let's try again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "129 ms ± 3.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit df2 + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So with just those changes, the runtime was cut in more then half. Of course, it' still 4-5x slower as the block-wise operation. \n",
    "\n",
    "Let's look at a profile to check what could be done further:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " "
     ]
    },
    {
     "data": {
      "text/plain": [
       "         5250973 function calls (5250933 primitive calls) in 2.141 seconds\n",
       "\n",
       "   Ordered by: cumulative time\n",
       "   List reduced from 79 to 20 due to restriction <20>\n",
       "\n",
       "   ncalls  tottime  percall  cumtime  percall filename:lineno(function)\n",
       "        1    0.000    0.000    2.141    2.141 {built-in method builtins.exec}\n",
       "        1    0.049    0.049    2.141    2.141 <string>:3(<module>)\n",
       "       10    0.000    0.000    2.092    0.209 __init__.py:615(f)\n",
       "       10    0.000    0.000    2.092    0.209 __init__.py:242(dispatch_to_series)\n",
       "       10    0.051    0.005    2.091    0.209 managers.py:380(apply)\n",
       "    50000    0.042    0.000    1.951    0.000 blocks.py:334(apply)\n",
       "    50000    0.040    0.000    1.239    0.000 blocks.py:345(_split_op_result)\n",
       "    50000    0.034    0.000    1.108    0.000 blocks.py:236(make_block)\n",
       "    50000    0.062    0.000    1.067    0.000 blocks.py:2713(make_block)\n",
       "    50000    0.105    0.000    0.847    0.000 blocks.py:2667(get_block_type)\n",
       "    50000    0.017    0.000    0.670    0.000 array_ops.py:158(arithmetic_op)\n",
       "    50000    0.040    0.000    0.653    0.000 array_ops.py:119(na_arithmetic_op)\n",
       "    50000    0.022    0.000    0.593    0.000 expressions.py:214(evaluate)\n",
       "    50000    0.019    0.000    0.571    0.000 expressions.py:97(_evaluate_numexpr)\n",
       "    50000    0.015    0.000    0.552    0.000 expressions.py:61(_evaluate_standard)\n",
       "    50000    0.537    0.000    0.537    0.000 {built-in method _operator.add}\n",
       "  1200170    0.228    0.000    0.397    0.000 {built-in method builtins.isinstance}\n",
       "   200000    0.086    0.000    0.380    0.000 base.py:256(is_dtype)\n",
       "   650070    0.125    0.000    0.170    0.000 generic.py:10(_check)\n",
       "   100000    0.060    0.000    0.160    0.000 common.py:1462(is_extension_array_dtype)"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%prun -s cumtime -l 20\n",
    "# repeat 10 times to get more stable results\n",
    "for _ in range(10):\n",
    "    df2 + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see here that a large part of the remaining time is spent in `blocks.py` (`_split_op_result`, `make_block`, `get_block_type`). This take a lot of time now because the consolidated Block could potentially be split up by the operation, could have changed of Block type, etc. But with all 1D blocks without different block types, all this would not be needed. \n",
    "\n",
    "So again as a way to benchmark it (it's not robust with the actual code, although it works for this `df2 + 1` operation, but it should indicative of what could be done in a simplified internals), I am applying a small patch:\n",
    "\n",
    "```diff\n",
    "diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py\n",
    "index ee8bcb23a..08435769a 100644\n",
    "--- a/pandas/core/internals/blocks.py\n",
    "+++ b/pandas/core/internals/blocks.py\n",
    "@@ -339,7 +339,8 @@ class Block(PandasObject):\n",
    "         # with np.errstate(all=\"ignore\"):\n",
    "         result = func(self.values, **kwargs)\n",
    " \n",
    "-        return self._split_op_result(result)\n",
    "+        return self.make_block_same_class(result)\n",
    " \n",
    "     def _split_op_result(self, result) -> List[\"Block\"]:\n",
    "```\n",
    "\n",
    "With that patch, let's time again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "79 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit df2 + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So reduced with another third. Now it is still 2-3x times slower as the block-wise operation. Let's take another look at a profile:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " "
     ]
    },
    {
     "data": {
      "text/plain": [
       "         13009703 function calls (13009303 primitive calls) in 10.783 seconds\n",
       "\n",
       "   Ordered by: cumulative time\n",
       "   List reduced from 63 to 20 due to restriction <20>\n",
       "\n",
       "   ncalls  tottime  percall  cumtime  percall filename:lineno(function)\n",
       "        1    0.000    0.000   10.783   10.783 {built-in method builtins.exec}\n",
       "        1    0.526    0.526   10.783   10.783 <string>:3(<module>)\n",
       "      100    0.001    0.000   10.257    0.103 __init__.py:615(f)\n",
       "      100    0.003    0.000   10.250    0.103 __init__.py:242(dispatch_to_series)\n",
       "      100    0.532    0.005   10.245    0.102 managers.py:380(apply)\n",
       "   500000    0.448    0.000    8.820    0.000 blocks.py:334(apply)\n",
       "   500000    0.159    0.000    6.377    0.000 array_ops.py:158(arithmetic_op)\n",
       "   500000    0.407    0.000    6.218    0.000 array_ops.py:119(na_arithmetic_op)\n",
       "   500000    0.213    0.000    5.618    0.000 expressions.py:214(evaluate)\n",
       "   500000    0.195    0.000    5.405    0.000 expressions.py:97(_evaluate_numexpr)\n",
       "   500000    0.129    0.000    5.210    0.000 expressions.py:61(_evaluate_standard)\n",
       "   500000    5.081    0.000    5.081    0.000 {built-in method _operator.add}\n",
       "   500000    0.747    0.000    1.995    0.000 blocks.py:248(make_block_same_class)\n",
       "   500000    0.584    0.000    1.170    0.000 blocks.py:112(__init__)\n",
       "   500000    0.268    0.000    0.447    0.000 blocks.py:2740(_extend_blocks)\n",
       "      100    0.000    0.000    0.400    0.004 managers.py:152(from_blocks)\n",
       "      100    0.083    0.001    0.400    0.004 managers.py:123(__init__)\n",
       "   500000    0.177    0.000    0.239    0.000 blocks.py:229(mgr_locs)\n",
       "      100    0.134    0.001    0.221    0.002 managers.py:1892(_split_blocks)\n",
       "1501100/1500700    0.218    0.000    0.218    0.000 {built-in method builtins.len}"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%prun -s cumtime -l 20\n",
    "# repeat 100 times to get more stable results\n",
    "for _ in range(100):\n",
    "    df2 + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that the `make_block_same_class` takes ca 20% of the time, and when not having the different (non)-consolidated block classes and complicated \"block placement\", we should also be able to almost fully eliminate. \n",
    "\n",
    "With the above, we already get closer to the block-wise performance (starting from a 10x slowdown). And this is only with some changes that should be relatively straightforward to implement within a simplified BlockManager in pure Python. We should be able to look for further optimizations.\n",
    "\n",
    "And also note that things rapidly change with eg only taking 10x the number of rows, where now the non-consolidated frame becomes as fast as the consolidated one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.DataFrame(1, index=range(50000), columns=range(5000), dtype=float, policy=\"block\").copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.DataFrame(1, index=range(50000), columns=range(5000), dtype=float, policy=\"split\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's time the `df + 1` operation for both dataframes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "694 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit df1 + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "485 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit df2 + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (dev)",
   "language": "python",
   "name": "dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Exploring the performance for consolidated vs non-consolidate DataFrame for a `df + 1` operation"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I am using here the same branch as shown in https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c (and discussed in https://github.com/pandas-dev/pandas/issues/10556) that enables a choice between consolidated and non-consolidated blocks.\n",
	"\n",
	"I am testing this with a wide dataframe with 5000 rows and 5000 columns of all floats (no heterogeneous data types):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"df1 = pd.DataFrame(1, index=range(5000), columns=range(5000), dtype=float, policy=\"block\").copy()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"df2 = pd.DataFrame(1, index=range(5000), columns=range(5000), dtype=float, policy=\"split\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's time the `df + 1` operation for both dataframes:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"30.8 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
	]
	}
	],
	"source": [
	"%timeit df1 + 1"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"294 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%timeit df2 + 1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"With the non-consolidated DataFrame, we see 10x slowdown, which is of course not good.\n",
	"\n",
	"Looking at a profile (using `%snakeviz`), it becomes directly clear that there are a bunch of things we repeat for every column that are expensive, and that could in principle perfectly be done once for the full DataFrame:\n",
	"\n",
	"- we set the numpy error state (`with np.errstate(all=\"ignore\"):`) when doing the operation for each column again, while this could also be done once before performing all column-wise ops\n",
	"- we check the right-hand-side (the scalar 1) for every column up, while this again can only done once\n",
	"- we check whether `numexpr` can be used for each column, and since this is based on the number of elements in the array, this could again be checked once and then passed as a parameter to the underlying array operation\n",
	"\n",
	"I quickly disabled those checks (for an actual change we would of course need to do this once for the full DataFrame, but for the quick benchmark, this should be a good indication, as now we are calling all those checks 5000 times, so doing those 0 times (as in the benchmark below) or 1 time shouldn't matter too much performance-wise.\n",
	"\n",
	"The diff that I applied before running the code cells below:\n",
	"<details>\n",
	"\n",
	"```diff\n",
	"diff --git a/pandas/core/computation/expressions.py b/pandas/core/computation/expressions.py\n",
	"index 0e9077e6d..af6103226 100644\n",
	"--- a/pandas/core/computation/expressions.py\n",
	"+++ b/pandas/core/computation/expressions.py\n",
	"@@ -64,8 +64,8 @@ def _evaluate_standard(op, op_str, a, b):\n",
	" \"\"\"\n",
	" if _TEST_MODE:\n",
	" _store_test_result(False)\n",
	"- with np.errstate(all=\"ignore\"):\n",
	"- return op(a, b)\n",
	"+ # with np.errstate(all=\"ignore\"):\n",
	"+ return op(a, b)\n",
	" \n",
	" \n",
	" def _can_use_numexpr(op, op_str, a, b, dtype_check):\n",
	"@@ -97,7 +97,7 @@ def _can_use_numexpr(op, op_str, a, b, dtype_check):\n",
	" def _evaluate_numexpr(op, op_str, a, b):\n",
	" result = None\n",
	" \n",
	"- if _can_use_numexpr(op, op_str, a, b, \"evaluate\"):\n",
	"+ if False: # _can_use_numexpr(op, op_str, a, b, \"evaluate\"):\n",
	" is_reversed = op.__name__.strip(\"_\").startswith(\"r\")\n",
	" if is_reversed:\n",
	" # we were originally called by a reversed op method\n",
	"@@ -225,7 +225,7 @@ def evaluate(op, a, b, use_numexpr: bool = True):\n",
	" \"\"\"\n",
	" op_str = _op_str_mapping[op]\n",
	" if op_str is not None:\n",
	"- use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)\n",
	"+ # use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)\n",
	" if use_numexpr:\n",
	" return _evaluate(op, op_str, a, b) # type: ignore\n",
	" return _evaluate_standard(op, op_str, a, b)\n",
	"diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py\n",
	"index 23afc7b85..ee8bcb23a 100644\n",
	"--- a/pandas/core/internals/blocks.py\n",
	"+++ b/pandas/core/internals/blocks.py\n",
	"@@ -336,8 +336,8 @@ class Block(PandasObject):\n",
	" apply the function to my values; return a block if we are not\n",
	" one\n",
	" \"\"\"\n",
	"- with np.errstate(all=\"ignore\"):\n",
	"- result = func(self.values, **kwargs)\n",
	"+ # with np.errstate(all=\"ignore\"):\n",
	"+ result = func(self.values, **kwargs)\n",
	" \n",
	" return self._split_op_result(result)\n",
	" \n",
	"diff --git a/pandas/core/ops/array_ops.py b/pandas/core/ops/array_ops.py\n",
	"index 3379ee56b..a21ea90cd 100644\n",
	"--- a/pandas/core/ops/array_ops.py\n",
	"+++ b/pandas/core/ops/array_ops.py\n",
	"@@ -29,6 +29,7 @@ from pandas.core.dtypes.common import (\n",
	" from pandas.core.dtypes.generic import ABCExtensionArray, ABCIndex, ABCSeries\n",
	" from pandas.core.dtypes.missing import isna, notna\n",
	" \n",
	"+import pandas.core.computation.expressions as expressions\n",
	" from pandas.core.ops import missing\n",
	" from pandas.core.ops.dispatch import should_extension_dispatch\n",
	" from pandas.core.ops.invalid import invalid_comparison\n",
	"@@ -136,8 +137,6 @@ def na_arithmetic_op(left, right, op, is_cmp: bool = False):\n",
	" ------\n",
	" TypeError : invalid operation\n",
	" \"\"\"\n",
	"- import pandas.core.computation.expressions as expressions\n",
	"-\n",
	" try:\n",
	" result = expressions.evaluate(op, left, right)\n",
	" except TypeError:\n",
	"@@ -176,17 +175,19 @@ def arithmetic_op(left: ArrayLike, right: Any, op):\n",
	" \n",
	" # NB: We assume that extract_array has already been called\n",
	" # on `left` and `right`.\n",
	"- lvalues = maybe_upcast_datetimelike_array(left)\n",
	"- rvalues = maybe_upcast_datetimelike_array(right)\n",
	"- rvalues = maybe_upcast_for_op(rvalues, lvalues.shape)\n",
	"+ # lvalues = maybe_upcast_datetimelike_array(left)\n",
	"+ # rvalues = maybe_upcast_datetimelike_array(right)\n",
	"+ # rvalues = maybe_upcast_for_op(rvalues, lvalues.shape)\n",
	"+ lvalues = left\n",
	"+ rvalues = right\n",
	" \n",
	"- if should_extension_dispatch(lvalues, rvalues) or isinstance(rvalues, Timedelta):\n",
	"- # Timedelta is included because numexpr will fail on it, see GH#31457\n",
	"- res_values = op(lvalues, rvalues)\n",
	"+ # if should_extension_dispatch(lvalues, rvalues) or isinstance(rvalues, Timedelta):\n",
	"+ # # Timedelta is included because numexpr will fail on it, see GH#31457\n",
	"+ # res_values = op(lvalues, rvalues)\n",
	" \n",
	"- else:\n",
	"- with np.errstate(all=\"ignore\"):\n",
	"- res_values = na_arithmetic_op(lvalues, rvalues, op)\n",
	"+ # else:\n",
	"+ # # with np.errstate(all=\"ignore\"):\n",
	"+ res_values = na_arithmetic_op(lvalues, rvalues, op)\n",
	" \n",
	" return res_values\n",
	"```\n",
	"\n",
	"</details>\n",
	"\n",
	"\n",
	"\n",
	"So with this small change, let's try again:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"129 ms ± 3.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
	]
	}
	],
	"source": [
	"%timeit df2 + 1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So with just those changes, the runtime was cut in more then half. Of course, it' still 4-5x slower as the block-wise operation. \n",
	"\n",
	"Let's look at a profile to check what could be done further:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" "
	]
	},
	{
	"data": {
	"text/plain": [
	" 5250973 function calls (5250933 primitive calls) in 2.141 seconds\n",
	"\n",
	" Ordered by: cumulative time\n",
	" List reduced from 79 to 20 due to restriction <20>\n",
	"\n",
	" ncalls tottime percall cumtime percall filename:lineno(function)\n",
	" 1 0.000 0.000 2.141 2.141 {built-in method builtins.exec}\n",
	" 1 0.049 0.049 2.141 2.141 <string>:3(<module>)\n",
	" 10 0.000 0.000 2.092 0.209 __init__.py:615(f)\n",
	" 10 0.000 0.000 2.092 0.209 __init__.py:242(dispatch_to_series)\n",
	" 10 0.051 0.005 2.091 0.209 managers.py:380(apply)\n",
	" 50000 0.042 0.000 1.951 0.000 blocks.py:334(apply)\n",
	" 50000 0.040 0.000 1.239 0.000 blocks.py:345(_split_op_result)\n",
	" 50000 0.034 0.000 1.108 0.000 blocks.py:236(make_block)\n",
	" 50000 0.062 0.000 1.067 0.000 blocks.py:2713(make_block)\n",
	" 50000 0.105 0.000 0.847 0.000 blocks.py:2667(get_block_type)\n",
	" 50000 0.017 0.000 0.670 0.000 array_ops.py:158(arithmetic_op)\n",
	" 50000 0.040 0.000 0.653 0.000 array_ops.py:119(na_arithmetic_op)\n",
	" 50000 0.022 0.000 0.593 0.000 expressions.py:214(evaluate)\n",
	" 50000 0.019 0.000 0.571 0.000 expressions.py:97(_evaluate_numexpr)\n",
	" 50000 0.015 0.000 0.552 0.000 expressions.py:61(_evaluate_standard)\n",
	" 50000 0.537 0.000 0.537 0.000 {built-in method _operator.add}\n",
	" 1200170 0.228 0.000 0.397 0.000 {built-in method builtins.isinstance}\n",
	" 200000 0.086 0.000 0.380 0.000 base.py:256(is_dtype)\n",
	" 650070 0.125 0.000 0.170 0.000 generic.py:10(_check)\n",
	" 100000 0.060 0.000 0.160 0.000 common.py:1462(is_extension_array_dtype)"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	}
	],
	"source": [
	"%%prun -s cumtime -l 20\n",
	"# repeat 10 times to get more stable results\n",
	"for _ in range(10):\n",
	" df2 + 1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can see here that a large part of the remaining time is spent in `blocks.py` (`_split_op_result`, `make_block`, `get_block_type`). This take a lot of time now because the consolidated Block could potentially be split up by the operation, could have changed of Block type, etc. But with all 1D blocks without different block types, all this would not be needed. \n",
	"\n",
	"So again as a way to benchmark it (it's not robust with the actual code, although it works for this `df2 + 1` operation, but it should indicative of what could be done in a simplified internals), I am applying a small patch:\n",
	"\n",
	"```diff\n",
	"diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py\n",
	"index ee8bcb23a..08435769a 100644\n",
	"--- a/pandas/core/internals/blocks.py\n",
	"+++ b/pandas/core/internals/blocks.py\n",
	"@@ -339,7 +339,8 @@ class Block(PandasObject):\n",
	" # with np.errstate(all=\"ignore\"):\n",
	" result = func(self.values, **kwargs)\n",
	" \n",
	"- return self._split_op_result(result)\n",
	"+ return self.make_block_same_class(result)\n",
	" \n",
	" def _split_op_result(self, result) -> List[\"Block\"]:\n",
	"```\n",
	"\n",
	"With that patch, let's time again:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"79 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
	]
	}
	],
	"source": [
	"%timeit df2 + 1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So reduced with another third. Now it is still 2-3x times slower as the block-wise operation. Let's take another look at a profile:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" "
	]
	},
	{
	"data": {
	"text/plain": [
	" 13009703 function calls (13009303 primitive calls) in 10.783 seconds\n",
	"\n",
	" Ordered by: cumulative time\n",
	" List reduced from 63 to 20 due to restriction <20>\n",
	"\n",
	" ncalls tottime percall cumtime percall filename:lineno(function)\n",
	" 1 0.000 0.000 10.783 10.783 {built-in method builtins.exec}\n",
	" 1 0.526 0.526 10.783 10.783 <string>:3(<module>)\n",
	" 100 0.001 0.000 10.257 0.103 __init__.py:615(f)\n",
	" 100 0.003 0.000 10.250 0.103 __init__.py:242(dispatch_to_series)\n",
	" 100 0.532 0.005 10.245 0.102 managers.py:380(apply)\n",
	" 500000 0.448 0.000 8.820 0.000 blocks.py:334(apply)\n",
	" 500000 0.159 0.000 6.377 0.000 array_ops.py:158(arithmetic_op)\n",
	" 500000 0.407 0.000 6.218 0.000 array_ops.py:119(na_arithmetic_op)\n",
	" 500000 0.213 0.000 5.618 0.000 expressions.py:214(evaluate)\n",
	" 500000 0.195 0.000 5.405 0.000 expressions.py:97(_evaluate_numexpr)\n",
	" 500000 0.129 0.000 5.210 0.000 expressions.py:61(_evaluate_standard)\n",
	" 500000 5.081 0.000 5.081 0.000 {built-in method _operator.add}\n",
	" 500000 0.747 0.000 1.995 0.000 blocks.py:248(make_block_same_class)\n",
	" 500000 0.584 0.000 1.170 0.000 blocks.py:112(__init__)\n",
	" 500000 0.268 0.000 0.447 0.000 blocks.py:2740(_extend_blocks)\n",
	" 100 0.000 0.000 0.400 0.004 managers.py:152(from_blocks)\n",
	" 100 0.083 0.001 0.400 0.004 managers.py:123(__init__)\n",
	" 500000 0.177 0.000 0.239 0.000 blocks.py:229(mgr_locs)\n",
	" 100 0.134 0.001 0.221 0.002 managers.py:1892(_split_blocks)\n",
	"1501100/1500700 0.218 0.000 0.218 0.000 {built-in method builtins.len}"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	}
	],
	"source": [
	"%%prun -s cumtime -l 20\n",
	"# repeat 100 times to get more stable results\n",
	"for _ in range(100):\n",
	" df2 + 1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can see that the `make_block_same_class` takes ca 20% of the time, and when not having the different (non)-consolidated block classes and complicated \"block placement\", we should also be able to almost fully eliminate. \n",
	"\n",
	"With the above, we already get closer to the block-wise performance (starting from a 10x slowdown). And this is only with some changes that should be relatively straightforward to implement within a simplified BlockManager in pure Python. We should be able to look for further optimizations.\n",
	"\n",
	"And also note that things rapidly change with eg only taking 10x the number of rows, where now the non-consolidated frame becomes as fast as the consolidated one:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"df1 = pd.DataFrame(1, index=range(50000), columns=range(5000), dtype=float, policy=\"block\").copy()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"df2 = pd.DataFrame(1, index=range(50000), columns=range(5000), dtype=float, policy=\"split\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's time the `df + 1` operation for both dataframes:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"694 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%timeit df1 + 1"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"485 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%timeit df2 + 1"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python (dev)",
	"language": "python",
	"name": "dev"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.6"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}