-
-
Save sainathadapa/eb3303975196d15c73bac5b92d8a210f to your computer and use it in GitHub Desktop.
import pandas as pd | |
def anti_join(x, y, on): | |
"""Return rows in x which are not present in y""" | |
ans = pd.merge(left=x, right=y, how='left', indicator=True, on=on) | |
ans = ans.loc[ans._merge == 'left_only', :].drop(columns='_merge') | |
return ans | |
def anti_join_all_cols(x, y): | |
"""Return rows in x which are not present in y""" | |
assert set(x.columns.values) == set(y.columns.values) | |
return anti_join(x, y, x.columns.tolist()) |
You can make anti_join
a one-liner (warning, I typed it in directly):
def anti_join(x, y, on):
"""Return rows in x which are not present in y"""
return pd.merge(left=x, right=y, how='left', indicator=True, on=on).query("_merge == 'left_only'").drop(columns='_merge')
Helped me in my project. Thank you @sainathadapa for sharing this gist!
And thank you @mgmarino for this one, too!
You can make
anti_join
a one-liner (warning, I typed it in directly):def anti_join(x, y, on): """Return rows in x which are not present in y""" return pd.merge(left=x, right=y, how='left', indicator=True, on=on).query("_merge == 'left_only'").drop(columns='_merge')
Awesome, thank you! Saved me a lot of time
Thank you so much, you saved mine time too.
Thank you!
After using this answer for a while (appreciate the original PO), I found out it runs into issues when there are duplicate ids in the DataFrame (if column on
has non-unique values).
So I developed some different functions that don't rely on pd.DataFrame.merge()
but rather the more error-proof, super fast, and Pythonic set operations.
Below are my functions, and anyone is welcome to use or read my StackOverflow post or GitHub repo for more details.
df_diff()
does "anti-join"df_overlap()
does "intersection"
import pandas as pd
def df_diff(df_A: pd.DataFrame, df_B: pd.DataFrame, on_A: str = "", on_B: str = "") -> pd.DataFrame:
"""
Function: Compare DataFrame "A" and "B" to find rows only in "A" but not in "B"
Input:
df_A: DataFrame "A" ("left table")
df_B: DataFrame "B" ("right table")
on_A: column name in DataFrame "A" to compare on. If not provided/valid, will default to using df_A's index
on_B: column name in DataFrame "B" to compare on. If not provided/valid, will default to using df_B's index
Output:
DataFrame containing diff result (all rows only in df_A but not in df_B, and same columns as df_A)
If find zero rows, will return a DataFrame of 0 row and same columns as df_A (can be checked by `df_output.empty and df_output.shape[1] != 0`)
If input is not valid DataFrame, will return a DataFrame of 0 row and 0 column (can be checked by `df_output.empty and df_output.shape[1] == 0`)
Dependency: `import pandas as pd`
History: 2022-02-07 Developed by frank-yifei-wang@GitHub
"""
if type(df_A) != pd.core.frame.DataFrame or type(df_B) != pd.core.frame.DataFrame:
return pd.DataFrame()
if on_A != "" and on_A in df_A.columns:
id_col_A = df_A[on_A]
else:
id_col_A = df_A.index
if on_B != "" and on_B in df_B.columns:
id_col_B = df_B[on_B]
else:
id_col_B = df_B.index
id_set_A = set(id_col_A)
id_set_B = set(id_col_B)
id_set_diff = id_set_A.difference(id_set_B)
df_output = df_A[id_col_A.isin(id_set_diff)].copy()
return df_output
def df_overlap(df_A: pd.DataFrame, df_B: pd.DataFrame, on_A: str = "", on_B: str = "") -> pd.DataFrame:
"""
Function: Compare DataFrame "A" and "B" to find rows in "A" and also in "B"
Input:
df_A: DataFrame "A" ("left table")
df_B: DataFrame "B" ("right table")
on_A: column name in DataFrame "A" to compare on. If not provided/valid, will default to using df_A's index
on_B: column name in DataFrame "B" to compare on. If not provided/valid, will default to using df_B's index
Output:
DataFrame containing overlap result (all rows in df_A and also in df_B, and same columns as df_A)
Note: result of df_overlap(df_A, df_B) (= a slice of df_A) is different from df_overlap(df_B, df_A) (= a slice of df_B)
If find zero rows, will return a DataFrame of 0 row and same columns as df_A (can be checked by `df_output.empty and df_output.shape[1] != 0`)
If input is not valid DataFrame, will return a DataFrame of 0 row and 0 column (can be checked by `df_output.empty and df_output.shape[1] == 0`)
Dependency: `import pandas as pd`
History: 2022-02-07 Developed by frank-yifei-wang@GitHub
"""
if type(df_A) != pd.core.frame.DataFrame or type(df_B) != pd.core.frame.DataFrame:
return pd.DataFrame()
if on_A != "" and on_A in df_A.columns:
id_col_A = df_A[on_A]
else:
id_col_A = df_A.index
if on_B != "" and on_B in df_B.columns:
id_col_B = df_B[on_B]
else:
id_col_B = df_B.index
id_set_A = set(id_col_A)
id_set_B = set(id_col_B)
id_set_overlap = id_set_A.intersection(id_set_B)
df_output = df_A[id_col_A.isin(id_set_overlap)].copy()
return df_output
Thank you!
Thanks, really helpful.