-
-
Save ChronoJon/cee8de9e8e34c37d58d288ff0f97e6ab to your computer and use it in GitHub Desktop.
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Exploring assignments with multiindex column dataframes and subset assignment" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead tr th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr>\n", | |
" <th></th>\n", | |
" <th colspan=\"2\" halign=\"left\">a</th>\n", | |
" <th>b</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th></th>\n", | |
" <th>A</th>\n", | |
" <th>B</th>\n", | |
" <th>C</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>a-1</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>b-2</td>\n", | |
" <td>3.0</td>\n", | |
" <td>4.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>c</td>\n", | |
" <td>5.0</td>\n", | |
" <td>6.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" a b\n", | |
" A B C\n", | |
"0 a-1 1.0 2.0\n", | |
"1 b-2 3.0 4.0\n", | |
"2 c 5.0 6.0\n", | |
"3 1 NaN NaN" | |
] | |
}, | |
"execution_count": 36, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import pandas as pd\n", | |
"from IPython.display import display\n", | |
"\n", | |
"# example adapted from https://pandas.pydata.org/docs/reference/api/pandas.melt.html\n", | |
"df = pd.DataFrame(\n", | |
" {\n", | |
" (\"a\", \"A\"): {0: \"a-1\", 1: \"b-2\", 2: \"c\", 3: 1},\n", | |
" (\"a\", \"B\"): {0: 1, 1: 3, 2: 5},\n", | |
" (\"b\", \"C\"): {0: 2, 1: 4, 2: 6},\n", | |
" }\n", | |
")\n", | |
"df\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Assignment with multiindex-columns\n", | |
"\n", | |
"You can't properly assign to multiindex columns using the `.assign` method. \n", | |
"A lot of people seem to shy away from multiindex columns but they have their uses." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead tr th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr>\n", | |
" <th></th>\n", | |
" <th colspan=\"2\" halign=\"left\">a</th>\n", | |
" <th>b</th>\n", | |
" <th>D</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th></th>\n", | |
" <th>A</th>\n", | |
" <th>B</th>\n", | |
" <th>C</th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>a-1</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.0</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>b-2</td>\n", | |
" <td>3.0</td>\n", | |
" <td>4.0</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>c</td>\n", | |
" <td>5.0</td>\n", | |
" <td>6.0</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" a b D\n", | |
" A B C \n", | |
"0 a-1 1.0 2.0 5\n", | |
"1 b-2 3.0 4.0 5\n", | |
"2 c 5.0 6.0 5\n", | |
"3 1 NaN NaN 5" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"MultiIndex([('a', 'A'),\n", | |
" ('a', 'B'),\n", | |
" ('b', 'C'),\n", | |
" ('D', '')],\n", | |
" )" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"df_new = df.assign(D=5)\n", | |
"\n", | |
"display(\n", | |
" df_new, \n", | |
" df_new.columns\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"ename": "TypeError", | |
"evalue": "keywords must be strings", | |
"output_type": "error", | |
"traceback": [ | |
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", | |
"\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", | |
"\u001b[1;32md:\\dev\\testing\\pandas_stuff\\testing_assignments.ipynb Cell 4'\u001b[0m in \u001b[0;36m<cell line: 1>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> <a href='vscode-notebook-cell:/d%3A/dev/testing/pandas_stuff/testing_assignments.ipynb#ch0000003?line=0'>1</a>\u001b[0m df_new2 \u001b[39m=\u001b[39m df\u001b[39m.\u001b[39massign(\u001b[39m*\u001b[39m\u001b[39m*\u001b[39m{(\u001b[39m\"\u001b[39m\u001b[39mb\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mD\u001b[39m\u001b[39m\"\u001b[39m):\u001b[39m5\u001b[39m})\n", | |
"\u001b[1;31mTypeError\u001b[0m: keywords must be strings" | |
] | |
} | |
], | |
"source": [ | |
"df_new2 = df.assign(**{(\"b\", \"D\"):5})" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Proper assignment with multiindex columns is not possible. You have to resort do direct mutation with \n", | |
"`.loc` or similar." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead tr th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr>\n", | |
" <th></th>\n", | |
" <th colspan=\"2\" halign=\"left\">a</th>\n", | |
" <th colspan=\"2\" halign=\"left\">b</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th></th>\n", | |
" <th>A</th>\n", | |
" <th>B</th>\n", | |
" <th>C</th>\n", | |
" <th>D</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>a-1</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.0</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>b-2</td>\n", | |
" <td>3.0</td>\n", | |
" <td>4.0</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>c</td>\n", | |
" <td>5.0</td>\n", | |
" <td>6.0</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" a b \n", | |
" A B C D\n", | |
"0 a-1 1.0 2.0 5\n", | |
"1 b-2 3.0 4.0 5\n", | |
"2 c 5.0 6.0 5\n", | |
"3 1 NaN NaN 5" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"MultiIndex([('a', 'A'),\n", | |
" ('a', 'B'),\n", | |
" ('b', 'C'),\n", | |
" ('b', 'D')],\n", | |
" )" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"df_new2 = df.copy()\n", | |
"df_new2[(\"b\", \"D\")] = 5\n", | |
"display(\n", | |
" df_new2, \n", | |
" df_new2.columns\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Assignment to subset using `.mask`\n", | |
"\n", | |
"I typically use the `.mask` method for assignment to dataframe subsets in pandas call chains. \n", | |
"I usually assign a default value and calculate different values for different subsets.\n", | |
"\n", | |
"But it doesn't work, if the calculation would produce an error on part of data that doesn't belong to the subset." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Before\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>A</th>\n", | |
" <th>B</th>\n", | |
" <th>C</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>a-1</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>b-2</td>\n", | |
" <td>3.0</td>\n", | |
" <td>4.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>c</td>\n", | |
" <td>5.0</td>\n", | |
" <td>6.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" A B C\n", | |
"0 a-1 1.0 2.0\n", | |
"1 b-2 3.0 4.0\n", | |
"2 c 5.0 6.0\n", | |
"3 1 NaN NaN" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" ...\n", | |
" Long cryptic error message\n", | |
" ...\n", | |
" \n" | |
] | |
} | |
], | |
"source": [ | |
"df_mask = df.droplevel(axis=1, level=0)\n", | |
"print(\"Before\")\n", | |
"display(df_mask)\n", | |
"try:\n", | |
" display(\n", | |
" df_mask\n", | |
" .assign(subassign=-1)\n", | |
" .mask(\n", | |
" lambda df: df.B < 5,\n", | |
" lambda df: df\n", | |
" .assign(\n", | |
" subassign=lambda df: df.A.str.split(\"-\").str[-1].astype(int)\n", | |
" )\n", | |
" )\n", | |
" )\n", | |
"except ValueError:\n", | |
" print(\"\"\"\\\n", | |
" ...\n", | |
" Long cryptic error message\n", | |
" ...\n", | |
" \"\"\")\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>A</th>\n", | |
" <th>B</th>\n", | |
" <th>C</th>\n", | |
" <th>subassign</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>a-1</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>b-2</td>\n", | |
" <td>3.0</td>\n", | |
" <td>4.0</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>c</td>\n", | |
" <td>5.0</td>\n", | |
" <td>6.0</td>\n", | |
" <td>-1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>-1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" A B C subassign\n", | |
"0 a-1 1.0 2.0 1\n", | |
"1 b-2 3.0 4.0 2\n", | |
"2 c 5.0 6.0 -1\n", | |
"3 1 NaN NaN -1" | |
] | |
}, | |
"execution_count": 35, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"def subassignment(df: pd.DataFrame) -> pd.DataFrame:\n", | |
" df = df.copy().assign(subassign=-1)\n", | |
" mask = df.B < 5\n", | |
"\n", | |
" df.loc[mask, \"subassign\"] = df.loc[mask, \"A\"].str.split(\"-\").str[-1].astype(int)\n", | |
" return df\n", | |
"\n", | |
"df_mask.pipe(subassignment)" | |
] | |
} | |
], | |
"metadata": { | |
"interpreter": { | |
"hash": "57538d13acb4f8c4f0f08404b3c2a92e5e461901282c6f6756900542d0cf9832" | |
}, | |
"kernelspec": { | |
"display_name": "Python 3.9.10 ('.venv': venv)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.9.10" | |
}, | |
"orig_nbformat": 4 | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
This is how I would do it. I have resorted to only using .where
(or np.where
or np.select
):
(df
.droplevel(axis='columns', level=0)
.assign(subassign=-1)
.assign(subassign=lambda df_: df_.subassign.where(
(df_.B >= 5) | (df_.B.isna()),
df_.A.str.split('-').str[-1]).astype(int))
)
WRT .assign
on hierarchical columns. You are correct, it doesn't appear that .assign
can create the inner column name. I guess I've never ran into this as I try to flatten columns ASAP. 🤷♀️
This is how I would do it. I have resorted to only using
.where
(ornp.where
ornp.select
):(df .droplevel(axis='columns', level=0) .assign(subassign=-1) .assign(subassign=lambda df_: df_.subassign.where( (df_.B >= 5) | (df_.B.isna()), df_.A.str.split('-').str[-1] ) .astype(int) ) )
Ok, but you just have postponed the .astype
operation after the .mask
method call is finished. mask
and where
are just the same method with inverse logic.
This would be equivalent:
df_mask
.assign(subassign=-1)
.mask(
lambda df: df.B < 5,
lambda df: df.assign(subassign=lambda df_: df_.A.str.split("-").str[-1])
)
.astype(dict(subassign=int))
This was just an example of a problem that occurs, when using chain operations, where you want to change a subset of the data with a transformation, that would result in an error on data not part of that subset.
Furthermore, the transformation is called on the whole dataframe, even if you change a relatively small part of it and most the transformation would be thrown away. This is especially wasteful with any kind of string operation as showcased here (because you leave numpy land and are working in the python domain).
Direct mutation is clearly superior here:
- No unnecessary calculations are performed
- You don't have to change the operation because it throws an error in unrelated parts of the dataframe. Thus you only have to think about the parts you want to change.
WRT
.assign
on hierarchical columns. You are correct, it doesn't appear that.assign
can create the inner column name. I guess I've never ran into this as I try to flatten columns ASAP. 🤷♀️
I really don't understand this sentiment, but it is the not first time, I've read it. Hierarchical columns can be useful for grouping related data. Otherwise you would have to use multiple dataframes and SQL
like association dataframes or resort to ugly filter calls to select these groups.
In my view, the only problem with it is, that it's not well supported in panda's functional API. One could provide something like an .assign_map
method (analogous to str.format_map
) with pandas_flavor or similar.
My sentiment is that I've (sample size 1, but consulted with Pandas, used Pandas for years, and taught Pandas to thousands) never had a need for this. I'm not saying it might not happen. But perhaps that is why support is lacking... 🤷♀️
Also, with the
.mask
method the calculations are performed for parts of the dataframe, which don't end up being used. If the calls are expensive, mask is very wasteful. You then have to write your own pipe, where you do proper assignments or just do it outside of a call chain.