Skip to content

Instantly share code, notes, and snippets.

@ChronoJon
Created April 12, 2022 22:58
Show Gist options
  • Save ChronoJon/cee8de9e8e34c37d58d288ff0f97e6ab to your computer and use it in GitHub Desktop.
Save ChronoJon/cee8de9e8e34c37d58d288ff0f97e6ab to your computer and use it in GitHub Desktop.
Exploring assignments with multiindex column dataframes and subset assignment
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@mattharrison
Copy link

WRT .assign on hierarchical columns. You are correct, it doesn't appear that .assign can create the inner column name. I guess I've never ran into this as I try to flatten columns ASAP. 🤷‍♀️

@ChronoJon
Copy link
Author

This is how I would do it. I have resorted to only using .where (or np.where or np.select):

(df
 .droplevel(axis='columns', level=0)
 .assign(subassign=-1)
 .assign(subassign=lambda df_: df_.subassign.where(
     (df_.B >= 5) | (df_.B.isna()), 
     df_.A.str.split('-').str[-1]
   )
   .astype(int)
 )
)

Ok, but you just have postponed the .astype operation after the .mask method call is finished. mask and where are just the same method with inverse logic.
This would be equivalent:

df_mask
    .assign(subassign=-1)
    .mask(
        lambda df: df.B < 5,
        lambda df: df.assign(subassign=lambda df_: df_.A.str.split("-").str[-1])
    )
    .astype(dict(subassign=int))

This was just an example of a problem that occurs, when using chain operations, where you want to change a subset of the data with a transformation, that would result in an error on data not part of that subset.

Furthermore, the transformation is called on the whole dataframe, even if you change a relatively small part of it and most the transformation would be thrown away. This is especially wasteful with any kind of string operation as showcased here (because you leave numpy land and are working in the python domain).
Direct mutation is clearly superior here:

  1. No unnecessary calculations are performed
  2. You don't have to change the operation because it throws an error in unrelated parts of the dataframe. Thus you only have to think about the parts you want to change.

@ChronoJon
Copy link
Author

WRT .assign on hierarchical columns. You are correct, it doesn't appear that .assign can create the inner column name. I guess I've never ran into this as I try to flatten columns ASAP. 🤷‍♀️

I really don't understand this sentiment, but it is the not first time, I've read it. Hierarchical columns can be useful for grouping related data. Otherwise you would have to use multiple dataframes and SQL like association dataframes or resort to ugly filter calls to select these groups.

In my view, the only problem with it is, that it's not well supported in panda's functional API. One could provide something like an .assign_map method (analogous to str.format_map) with pandas_flavor or similar.

@mattharrison
Copy link

My sentiment is that I've (sample size 1, but consulted with Pandas, used Pandas for years, and taught Pandas to thousands) never had a need for this. I'm not saying it might not happen. But perhaps that is why support is lacking... 🤷‍♀️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment