drorata · August 11, 2022 06:29
diff --git a/benchmarking_columns_operation.ipynb b/benchmarking_columns_operation.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I recently ran the following experiment.\n",
    "The reason was my need to perform operations on two columns.\n",
    "Think for example in terms of features engineering; you want to produce an new feature (i.e. column) which is the ratio of the some other two columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(np.random.random(size=(10000,2)), columns=['foo', 'bar'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>foo</th>\n",
       "      <th>bar</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.065648</td>\n",
       "      <td>0.593402</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.592051</td>\n",
       "      <td>0.123502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.621030</td>\n",
       "      <td>0.629470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.210630</td>\n",
       "      <td>0.462535</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.263708</td>\n",
       "      <td>0.807304</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        foo       bar\n",
       "0  0.065648  0.593402\n",
       "1  0.592051  0.123502\n",
       "2  0.621030  0.629470\n",
       "3  0.210630  0.462535\n",
       "4  0.263708  0.807304"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Iterating over the rows\n",
    "\n",
    "Here we iterate over the rows, and access the indexes of the `pd.Series` which represents the row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "518 ms ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "pd.Series([\n",
    "    x[1]['foo'] * x[1]['bar'] for x in df.iterrows()\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Applying a function\n",
    "\n",
    "Here we use `pd.DataFrame.apply` and provide the axis.\n",
    "This is actually, the solution I had to use in a case which I now longer remember its details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "299 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df.apply(lambda x: x['foo'] * x['bar'], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Columns operations\n",
    "\n",
    "Best and fastest approach; at least for this simple case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "97.2 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df.foo.mul(df.bar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "Obviously, for the simple multiplication, the last option is the most pythonic (and the fastest as well).\n",
    "But, still due to an edge case I decided to run this test.\n",
    "Maybe someone would find it helpful."
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I recently ran the following experiment.\n",
	"The reason was my need to perform operations on two columns.\n",
	"Think for example in terms of features engineering; you want to produce an new feature (i.e. column) which is the ratio of the some other two columns."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import numpy as np"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"df = pd.DataFrame(np.random.random(size=(10000,2)), columns=['foo', 'bar'])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style>\n",
	" .dataframe thead tr:only-child th {\n",
	" text-align: right;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: left;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>foo</th>\n",
	" <th>bar</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>0.065648</td>\n",
	" <td>0.593402</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>0.592051</td>\n",
	" <td>0.123502</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>0.621030</td>\n",
	" <td>0.629470</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>0.210630</td>\n",
	" <td>0.462535</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>0.263708</td>\n",
	" <td>0.807304</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" foo bar\n",
	"0 0.065648 0.593402\n",
	"1 0.592051 0.123502\n",
	"2 0.621030 0.629470\n",
	"3 0.210630 0.462535\n",
	"4 0.263708 0.807304"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"df.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Iterating over the rows\n",
	"\n",
	"Here we iterate over the rows, and access the indexes of the `pd.Series` which represents the row."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"518 ms ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%%timeit\n",
	"pd.Series([\n",
	" x[1]['foo'] * x[1]['bar'] for x in df.iterrows()\n",
	"])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Applying a function\n",
	"\n",
	"Here we use `pd.DataFrame.apply` and provide the axis.\n",
	"This is actually, the solution I had to use in a case which I now longer remember its details."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"299 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%%timeit\n",
	"df.apply(lambda x: x['foo'] * x['bar'], axis=1)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Columns operations\n",
	"\n",
	"Best and fastest approach; at least for this simple case."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"97.2 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
	]
	}
	],
	"source": [
	"%%timeit\n",
	"df.foo.mul(df.bar)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Summary\n",
	"\n",
	"Obviously, for the simple multiplication, the last option is the most pythonic (and the fastest as well).\n",
	"But, still due to an edge case I decided to run this test.\n",
	"Maybe someone would find it helpful."
	]
	}
	],
	"metadata": {
	"hide_input": false,
	"kernelspec": {
	"display_name": "Python [conda root]",
	"language": "python",
	"name": "conda-root-py"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}