I come across many great libraries every day, but unfortunately most of them are not well suited for enterprise or medical projects, because they lack the possibility for proper interface definitions or the ussage of refactoring capabilities of your IDE.
This is especially true if it comes to libraries in the data science area. Just remember your last df['my_feature'] = df['my_feature'] * 2
π And unfortunately, exactly these libraries are also the ones that are written for super fast computations.
Well it seems that we have the choice between the super fast not typed option and a bunch of slow crappy other implementations...
Hard life... π₯
But wait, I have heard about a small wrapper arround a pandas DataFrame, which alows for something like typing?
Right, it's called a TypedDataFrame
. πππ
Let's dive into it!
A TypedDataFrame
builds on top of pandas and dataclasses. For creating such a TypedDataFrame
we simply need to define
an own class and inherit from TypedDataFrame
. In the example bellow we want to have table of flight delays:
from .typed_df import TypedDataFrame
class FlightDelayDf(TypedDataFrame):
origin: str
destination: str
delay_sec: int
We now have a custom type that we can use for method annotations like:
def load_flight_delays() -> FlightDelayDf:
flight_delays = [
['BRN', 'BXO', 10],
['GVA', 'LUG', 22],
['SIR', 'LUG', 34],
['ZRH', 'BRN', 65]
]
df = pd.DataFrame(flight_delays, columns=['origin', 'destination', 'delay_sec'])
return FlightDelayDf(df)
Yey, have you seen it? Not? We have just define an interface between our load_flight_delays()
method and the outher world and
still have retained the advantages of a pandas DataFrame. You don't believe me? See yourself...
flight_delays = load_flight_delays()
flight_delays.df[FlightDelayDf.delay_sec].hist()
Convinced? And the best thing about it is that it would have told you if the columns of the original DataFrame would not
have mached the definition of FlightDelayDf
. Want to see another amazing feature? You can return your underlying DataFrame
as a list dataclass objects. See here...
flight_delay_obj = flight_delays.objects[0]
print(f'From {flight_delay_obj.origin} to {flight_delay_obj.destination} with {flight_delay_obj.delay_sec} sec delay.'
This is sometimes super helpful for debugging.
Now let's have fun with it! If you have questions or you're intrested in developing this concept further, please get in contact at any time.
Cheers,
Iwan