Skip to content

Instantly share code, notes, and snippets.

@ChronoJon
Created April 12, 2022 22:58
Show Gist options
  • Save ChronoJon/cee8de9e8e34c37d58d288ff0f97e6ab to your computer and use it in GitHub Desktop.
Save ChronoJon/cee8de9e8e34c37d58d288ff0f97e6ab to your computer and use it in GitHub Desktop.
Exploring assignments with multiindex column dataframes and subset assignment
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring assignments with multiindex column dataframes and subset assignment"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th colspan=\"2\" halign=\"left\">a</th>\n",
" <th>b</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a-1</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b-2</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>c</td>\n",
" <td>5.0</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b\n",
" A B C\n",
"0 a-1 1.0 2.0\n",
"1 b-2 3.0 4.0\n",
"2 c 5.0 6.0\n",
"3 1 NaN NaN"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"from IPython.display import display\n",
"\n",
"# example adapted from https://pandas.pydata.org/docs/reference/api/pandas.melt.html\n",
"df = pd.DataFrame(\n",
" {\n",
" (\"a\", \"A\"): {0: \"a-1\", 1: \"b-2\", 2: \"c\", 3: 1},\n",
" (\"a\", \"B\"): {0: 1, 1: 3, 2: 5},\n",
" (\"b\", \"C\"): {0: 2, 1: 4, 2: 6},\n",
" }\n",
")\n",
"df\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assignment with multiindex-columns\n",
"\n",
"You can't properly assign to multiindex columns using the `.assign` method. \n",
"A lot of people seem to shy away from multiindex columns but they have their uses."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th colspan=\"2\" halign=\"left\">a</th>\n",
" <th>b</th>\n",
" <th>D</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a-1</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b-2</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>c</td>\n",
" <td>5.0</td>\n",
" <td>6.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b D\n",
" A B C \n",
"0 a-1 1.0 2.0 5\n",
"1 b-2 3.0 4.0 5\n",
"2 c 5.0 6.0 5\n",
"3 1 NaN NaN 5"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"MultiIndex([('a', 'A'),\n",
" ('a', 'B'),\n",
" ('b', 'C'),\n",
" ('D', '')],\n",
" )"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_new = df.assign(D=5)\n",
"\n",
"display(\n",
" df_new, \n",
" df_new.columns\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "keywords must be strings",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32md:\\dev\\testing\\pandas_stuff\\testing_assignments.ipynb Cell 4'\u001b[0m in \u001b[0;36m<cell line: 1>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> <a href='vscode-notebook-cell:/d%3A/dev/testing/pandas_stuff/testing_assignments.ipynb#ch0000003?line=0'>1</a>\u001b[0m df_new2 \u001b[39m=\u001b[39m df\u001b[39m.\u001b[39massign(\u001b[39m*\u001b[39m\u001b[39m*\u001b[39m{(\u001b[39m\"\u001b[39m\u001b[39mb\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mD\u001b[39m\u001b[39m\"\u001b[39m):\u001b[39m5\u001b[39m})\n",
"\u001b[1;31mTypeError\u001b[0m: keywords must be strings"
]
}
],
"source": [
"df_new2 = df.assign(**{(\"b\", \"D\"):5})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Proper assignment with multiindex columns is not possible. You have to resort do direct mutation with \n",
"`.loc` or similar."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th colspan=\"2\" halign=\"left\">a</th>\n",
" <th colspan=\"2\" halign=\"left\">b</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a-1</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b-2</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>c</td>\n",
" <td>5.0</td>\n",
" <td>6.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b \n",
" A B C D\n",
"0 a-1 1.0 2.0 5\n",
"1 b-2 3.0 4.0 5\n",
"2 c 5.0 6.0 5\n",
"3 1 NaN NaN 5"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"MultiIndex([('a', 'A'),\n",
" ('a', 'B'),\n",
" ('b', 'C'),\n",
" ('b', 'D')],\n",
" )"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_new2 = df.copy()\n",
"df_new2[(\"b\", \"D\")] = 5\n",
"display(\n",
" df_new2, \n",
" df_new2.columns\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assignment to subset using `.mask`\n",
"\n",
"I typically use the `.mask` method for assignment to dataframe subsets in pandas call chains. \n",
"I usually assign a default value and calculate different values for different subsets.\n",
"\n",
"But it doesn't work, if the calculation would produce an error on part of data that doesn't belong to the subset."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Before\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a-1</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b-2</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>c</td>\n",
" <td>5.0</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C\n",
"0 a-1 1.0 2.0\n",
"1 b-2 3.0 4.0\n",
"2 c 5.0 6.0\n",
"3 1 NaN NaN"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" ...\n",
" Long cryptic error message\n",
" ...\n",
" \n"
]
}
],
"source": [
"df_mask = df.droplevel(axis=1, level=0)\n",
"print(\"Before\")\n",
"display(df_mask)\n",
"try:\n",
" display(\n",
" df_mask\n",
" .assign(subassign=-1)\n",
" .mask(\n",
" lambda df: df.B < 5,\n",
" lambda df: df\n",
" .assign(\n",
" subassign=lambda df: df.A.str.split(\"-\").str[-1].astype(int)\n",
" )\n",
" )\n",
" )\n",
"except ValueError:\n",
" print(\"\"\"\\\n",
" ...\n",
" Long cryptic error message\n",
" ...\n",
" \"\"\")\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>subassign</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a-1</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b-2</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>c</td>\n",
" <td>5.0</td>\n",
" <td>6.0</td>\n",
" <td>-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>-1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C subassign\n",
"0 a-1 1.0 2.0 1\n",
"1 b-2 3.0 4.0 2\n",
"2 c 5.0 6.0 -1\n",
"3 1 NaN NaN -1"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def subassignment(df: pd.DataFrame) -> pd.DataFrame:\n",
" df = df.copy().assign(subassign=-1)\n",
" mask = df.B < 5\n",
"\n",
" df.loc[mask, \"subassign\"] = df.loc[mask, \"A\"].str.split(\"-\").str[-1].astype(int)\n",
" return df\n",
"\n",
"df_mask.pipe(subassignment)"
]
}
],
"metadata": {
"interpreter": {
"hash": "57538d13acb4f8c4f0f08404b3c2a92e5e461901282c6f6756900542d0cf9832"
},
"kernelspec": {
"display_name": "Python 3.9.10 ('.venv': venv)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
@ChronoJon
Copy link
Author

Also, with the .mask method the calculations are performed for parts of the dataframe, which don't end up being used. If the calls are expensive, mask is very wasteful. You then have to write your own pipe, where you do proper assignments or just do it outside of a call chain.

@mattharrison
Copy link

This is how I would do it. I have resorted to only using .where (or np.where or np.select):

(df
 .droplevel(axis='columns', level=0)
 .assign(subassign=-1)
 .assign(subassign=lambda df_: df_.subassign.where(
     (df_.B >= 5) | (df_.B.isna()), 
     df_.A.str.split('-').str[-1]).astype(int))
)

@mattharrison
Copy link

WRT .assign on hierarchical columns. You are correct, it doesn't appear that .assign can create the inner column name. I guess I've never ran into this as I try to flatten columns ASAP. 🤷‍♀️

@ChronoJon
Copy link
Author

This is how I would do it. I have resorted to only using .where (or np.where or np.select):

(df
 .droplevel(axis='columns', level=0)
 .assign(subassign=-1)
 .assign(subassign=lambda df_: df_.subassign.where(
     (df_.B >= 5) | (df_.B.isna()), 
     df_.A.str.split('-').str[-1]
   )
   .astype(int)
 )
)

Ok, but you just have postponed the .astype operation after the .mask method call is finished. mask and where are just the same method with inverse logic.
This would be equivalent:

df_mask
    .assign(subassign=-1)
    .mask(
        lambda df: df.B < 5,
        lambda df: df.assign(subassign=lambda df_: df_.A.str.split("-").str[-1])
    )
    .astype(dict(subassign=int))

This was just an example of a problem that occurs, when using chain operations, where you want to change a subset of the data with a transformation, that would result in an error on data not part of that subset.

Furthermore, the transformation is called on the whole dataframe, even if you change a relatively small part of it and most the transformation would be thrown away. This is especially wasteful with any kind of string operation as showcased here (because you leave numpy land and are working in the python domain).
Direct mutation is clearly superior here:

  1. No unnecessary calculations are performed
  2. You don't have to change the operation because it throws an error in unrelated parts of the dataframe. Thus you only have to think about the parts you want to change.

@ChronoJon
Copy link
Author

WRT .assign on hierarchical columns. You are correct, it doesn't appear that .assign can create the inner column name. I guess I've never ran into this as I try to flatten columns ASAP. 🤷‍♀️

I really don't understand this sentiment, but it is the not first time, I've read it. Hierarchical columns can be useful for grouping related data. Otherwise you would have to use multiple dataframes and SQL like association dataframes or resort to ugly filter calls to select these groups.

In my view, the only problem with it is, that it's not well supported in panda's functional API. One could provide something like an .assign_map method (analogous to str.format_map) with pandas_flavor or similar.

@mattharrison
Copy link

My sentiment is that I've (sample size 1, but consulted with Pandas, used Pandas for years, and taught Pandas to thousands) never had a need for this. I'm not saying it might not happen. But perhaps that is why support is lacking... 🤷‍♀️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment