Skip to content

Instantly share code, notes, and snippets.

@sainathadapa
Created May 9, 2018 10:00
Show Gist options
  • Save sainathadapa/eb3303975196d15c73bac5b92d8a210f to your computer and use it in GitHub Desktop.
Save sainathadapa/eb3303975196d15c73bac5b92d8a210f to your computer and use it in GitHub Desktop.
anti-join-pandas
import pandas as pd
def anti_join(x, y, on):
"""Return rows in x which are not present in y"""
ans = pd.merge(left=x, right=y, how='left', indicator=True, on=on)
ans = ans.loc[ans._merge == 'left_only', :].drop(columns='_merge')
return ans
def anti_join_all_cols(x, y):
"""Return rows in x which are not present in y"""
assert set(x.columns.values) == set(y.columns.values)
return anti_join(x, y, x.columns.tolist())
@christinahaig
Copy link

Awesome, thank you!

@carlosevi94
Copy link

Very useful, thanks!

@shrikantsoni88
Copy link

This is so useful, thanks a lot :)

@nader-ld
Copy link

Thanks, really helpful.

@mgmarino
Copy link

mgmarino commented Sep 7, 2020

You can make anti_join a one-liner (warning, I typed it in directly):

def anti_join(x, y, on):
    """Return rows in x which are not present in y"""
    return pd.merge(left=x, right=y, how='left', indicator=True, on=on).query("_merge == 'left_only'").drop(columns='_merge')

@frank-yifei-wang
Copy link

Helped me in my project. Thank you @sainathadapa for sharing this gist!

And thank you @mgmarino for this one, too!

You can make anti_join a one-liner (warning, I typed it in directly):

def anti_join(x, y, on):
    """Return rows in x which are not present in y"""
    return pd.merge(left=x, right=y, how='left', indicator=True, on=on).query("_merge == 'left_only'").drop(columns='_merge')

@tuterbatuhan
Copy link

Awesome, thank you! Saved me a lot of time

@gajanankathar
Copy link

Thank you so much, you saved mine time too.

@dnovai
Copy link

dnovai commented Nov 3, 2021

Thank you!

@frank-yifei-wang
Copy link

After using this answer for a while (appreciate the original PO), I found out it runs into issues when there are duplicate ids in the DataFrame (if column on has non-unique values).

So I developed some different functions that don't rely on pd.DataFrame.merge() but rather the more error-proof, super fast, and Pythonic set operations.

Below are my functions, and anyone is welcome to use or read my StackOverflow post or GitHub repo for more details.

  • df_diff() does "anti-join"
  • df_overlap() does "intersection"
import pandas as pd

def df_diff(df_A: pd.DataFrame, df_B: pd.DataFrame, on_A: str = "", on_B: str = "") -> pd.DataFrame:
    """
    Function: Compare DataFrame "A" and "B" to find rows only in "A" but not in "B"
    Input:
        df_A: DataFrame "A" ("left table")
        df_B: DataFrame "B" ("right table")
        on_A: column name in DataFrame "A" to compare on. If not provided/valid, will default to using df_A's index
        on_B: column name in DataFrame "B" to compare on. If not provided/valid, will default to using df_B's index
    
    Output:
        DataFrame containing diff result (all rows only in df_A but not in df_B, and same columns as df_A)
        If find zero rows, will return a DataFrame of 0 row and same columns as df_A (can be checked by `df_output.empty and df_output.shape[1] != 0`)
        If input is not valid DataFrame, will return a DataFrame of 0 row and 0 column (can be checked by `df_output.empty and df_output.shape[1] == 0`)
    
    Dependency: `import pandas as pd`
    History: 2022-02-07 Developed by frank-yifei-wang@GitHub
    """
    
    if type(df_A) != pd.core.frame.DataFrame or type(df_B) != pd.core.frame.DataFrame: 
        return pd.DataFrame()
    
    if on_A != "" and on_A in df_A.columns: 
        id_col_A = df_A[on_A]
    else:
        id_col_A = df_A.index
    if on_B != "" and on_B in df_B.columns: 
        id_col_B = df_B[on_B]
    else:
        id_col_B = df_B.index
    
    id_set_A = set(id_col_A)
    id_set_B = set(id_col_B)

    id_set_diff = id_set_A.difference(id_set_B)
    df_output = df_A[id_col_A.isin(id_set_diff)].copy()

    return df_output

def df_overlap(df_A: pd.DataFrame, df_B: pd.DataFrame, on_A: str = "", on_B: str = "") -> pd.DataFrame:
    """
    Function: Compare DataFrame "A" and "B" to find rows in "A" and also in "B"
    Input:
        df_A: DataFrame "A" ("left table")
        df_B: DataFrame "B" ("right table")
        on_A: column name in DataFrame "A" to compare on. If not provided/valid, will default to using df_A's index
        on_B: column name in DataFrame "B" to compare on. If not provided/valid, will default to using df_B's index
    
    Output:
        DataFrame containing overlap result (all rows in df_A and also in df_B, and same columns as df_A)
        Note: result of df_overlap(df_A, df_B) (= a slice of df_A) is different from df_overlap(df_B, df_A) (= a slice of df_B)
        If find zero rows, will return a DataFrame of 0 row and same columns as df_A (can be checked by `df_output.empty and df_output.shape[1] != 0`)
        If input is not valid DataFrame, will return a DataFrame of 0 row and 0 column (can be checked by `df_output.empty and df_output.shape[1] == 0`)
    
    Dependency: `import pandas as pd`
    History: 2022-02-07 Developed by frank-yifei-wang@GitHub
    """
    
    if type(df_A) != pd.core.frame.DataFrame or type(df_B) != pd.core.frame.DataFrame: 
        return pd.DataFrame()
    
    if on_A != "" and on_A in df_A.columns: 
        id_col_A = df_A[on_A]
    else:
        id_col_A = df_A.index
    if on_B != "" and on_B in df_B.columns: 
        id_col_B = df_B[on_B]
    else:
        id_col_B = df_B.index
    
    id_set_A = set(id_col_A)
    id_set_B = set(id_col_B)

    id_set_overlap = id_set_A.intersection(id_set_B)
    df_output = df_A[id_col_A.isin(id_set_overlap)].copy()

    return df_output

@echennh-zz
Copy link

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment