sainathadapa/anti_join.py

gajanankathar · 2021-07-27T17:19:26Z

Thank you so much, you saved mine time too.

dnovai · 2021-11-03T18:00:41Z

Thank you!

frank-yifei-wang · 2022-02-12T01:19:37Z

After using this answer for a while (appreciate the original PO), I found out it runs into issues when there are duplicate ids in the DataFrame (if column on has non-unique values).

So I developed some different functions that don't rely on pd.DataFrame.merge() but rather the more error-proof, super fast, and Pythonic set operations.

Below are my functions, and anyone is welcome to use or read my StackOverflow post or GitHub repo for more details.

df_diff() does "anti-join"
df_overlap() does "intersection"

import pandas as pd

def df_diff(df_A: pd.DataFrame, df_B: pd.DataFrame, on_A: str = "", on_B: str = "") -> pd.DataFrame:
    """
    Function: Compare DataFrame "A" and "B" to find rows only in "A" but not in "B"
    Input:
        df_A: DataFrame "A" ("left table")
        df_B: DataFrame "B" ("right table")
        on_A: column name in DataFrame "A" to compare on. If not provided/valid, will default to using df_A's index
        on_B: column name in DataFrame "B" to compare on. If not provided/valid, will default to using df_B's index
    
    Output:
        DataFrame containing diff result (all rows only in df_A but not in df_B, and same columns as df_A)
        If find zero rows, will return a DataFrame of 0 row and same columns as df_A (can be checked by `df_output.empty and df_output.shape[1] != 0`)
        If input is not valid DataFrame, will return a DataFrame of 0 row and 0 column (can be checked by `df_output.empty and df_output.shape[1] == 0`)
    
    Dependency: `import pandas as pd`
    History: 2022-02-07 Developed by frank-yifei-wang@GitHub
    """
    
    if type(df_A) != pd.core.frame.DataFrame or type(df_B) != pd.core.frame.DataFrame: 
        return pd.DataFrame()
    
    if on_A != "" and on_A in df_A.columns: 
        id_col_A = df_A[on_A]
    else:
        id_col_A = df_A.index
    if on_B != "" and on_B in df_B.columns: 
        id_col_B = df_B[on_B]
    else:
        id_col_B = df_B.index
    
    id_set_A = set(id_col_A)
    id_set_B = set(id_col_B)

    id_set_diff = id_set_A.difference(id_set_B)
    df_output = df_A[id_col_A.isin(id_set_diff)].copy()

    return df_output

def df_overlap(df_A: pd.DataFrame, df_B: pd.DataFrame, on_A: str = "", on_B: str = "") -> pd.DataFrame:
    """
    Function: Compare DataFrame "A" and "B" to find rows in "A" and also in "B"
    Input:
        df_A: DataFrame "A" ("left table")
        df_B: DataFrame "B" ("right table")
        on_A: column name in DataFrame "A" to compare on. If not provided/valid, will default to using df_A's index
        on_B: column name in DataFrame "B" to compare on. If not provided/valid, will default to using df_B's index
    
    Output:
        DataFrame containing overlap result (all rows in df_A and also in df_B, and same columns as df_A)
        Note: result of df_overlap(df_A, df_B) (= a slice of df_A) is different from df_overlap(df_B, df_A) (= a slice of df_B)
        If find zero rows, will return a DataFrame of 0 row and same columns as df_A (can be checked by `df_output.empty and df_output.shape[1] != 0`)
        If input is not valid DataFrame, will return a DataFrame of 0 row and 0 column (can be checked by `df_output.empty and df_output.shape[1] == 0`)
    
    Dependency: `import pandas as pd`
    History: 2022-02-07 Developed by frank-yifei-wang@GitHub
    """
    
    if type(df_A) != pd.core.frame.DataFrame or type(df_B) != pd.core.frame.DataFrame: 
        return pd.DataFrame()
    
    if on_A != "" and on_A in df_A.columns: 
        id_col_A = df_A[on_A]
    else:
        id_col_A = df_A.index
    if on_B != "" and on_B in df_B.columns: 
        id_col_B = df_B[on_B]
    else:
        id_col_B = df_B.index
    
    id_set_A = set(id_col_A)
    id_set_B = set(id_col_B)

    id_set_overlap = id_set_A.intersection(id_set_B)
    df_output = df_A[id_col_A.isin(id_set_overlap)].copy()

    return df_output

echennh-zz · 2022-04-25T19:50:31Z

Thank you!

	import pandas as pd

	def anti_join(x, y, on):
	"""Return rows in x which are not present in y"""
	ans = pd.merge(left=x, right=y, how='left', indicator=True, on=on)
	ans = ans.loc[ans._merge == 'left_only', :].drop(columns='_merge')
	return ans


	def anti_join_all_cols(x, y):
	"""Return rows in x which are not present in y"""
	assert set(x.columns.values) == set(y.columns.values)
	return anti_join(x, y, x.columns.tolist())

sainathadapa/anti_join.py

gajanankathar commented Jul 27, 2021

Uh oh!

dnovai commented Nov 3, 2021

Uh oh!

frank-yifei-wang commented Feb 12, 2022

Uh oh!

echennh-zz commented Apr 25, 2022

Uh oh!