Exploring assignments with multiindex column dataframes and subset assignment

mattharrison commented Apr 13, 2022

WRT .assign on hierarchical columns. You are correct, it doesn't appear that .assign can create the inner column name. I guess I've never ran into this as I try to flatten columns ASAP. 🤷‍♀️

Author

ChronoJon commented Apr 14, 2022

This is how I would do it. I have resorted to only using .where (or np.where or np.select):

(df
 .droplevel(axis='columns', level=0)
 .assign(subassign=-1)
 .assign(subassign=lambda df_: df_.subassign.where(
     (df_.B >= 5) | (df_.B.isna()), 
     df_.A.str.split('-').str[-1]
   )
   .astype(int)
 )
)

Ok, but you just have postponed the .astype operation after the .mask method call is finished. mask and where are just the same method with inverse logic.
This would be equivalent:

df_mask
    .assign(subassign=-1)
    .mask(
        lambda df: df.B < 5,
        lambda df: df.assign(subassign=lambda df_: df_.A.str.split("-").str[-1])
    )
    .astype(dict(subassign=int))

This was just an example of a problem that occurs, when using chain operations, where you want to change a subset of the data with a transformation, that would result in an error on data not part of that subset.

Furthermore, the transformation is called on the whole dataframe, even if you change a relatively small part of it and most the transformation would be thrown away. This is especially wasteful with any kind of string operation as showcased here (because you leave numpy land and are working in the python domain).
Direct mutation is clearly superior here:

No unnecessary calculations are performed
You don't have to change the operation because it throws an error in unrelated parts of the dataframe. Thus you only have to think about the parts you want to change.

Author

ChronoJon commented Apr 14, 2022

WRT .assign on hierarchical columns. You are correct, it doesn't appear that .assign can create the inner column name. I guess I've never ran into this as I try to flatten columns ASAP. 🤷‍♀️

I really don't understand this sentiment, but it is the not first time, I've read it. Hierarchical columns can be useful for grouping related data. Otherwise you would have to use multiple dataframes and SQL like association dataframes or resort to ugly filter calls to select these groups.

In my view, the only problem with it is, that it's not well supported in panda's functional API. One could provide something like an .assign_map method (analogous to str.format_map) with pandas_flavor or similar.

mattharrison commented May 21, 2022

My sentiment is that I've (sample size 1, but consulted with Pandas, used Pandas for years, and taught Pandas to thousands) never had a need for this. I'm not saying it might not happen. But perhaps that is why support is lacking... 🤷‍♀️

ChronoJon/pandas_assignment.ipynb

mattharrison commented Apr 13, 2022

Uh oh!

ChronoJon commented Apr 14, 2022

Uh oh!

ChronoJon commented Apr 14, 2022

Uh oh!

mattharrison commented May 21, 2022

Uh oh!