Skip to content

Instantly share code, notes, and snippets.

@aialenti
Created September 13, 2020 21:34
Show Gist options
  • Save aialenti/e5aeea5609d502f9c445d907a788e288 to your computer and use it in GitHub Desktop.
Save aialenti/e5aeea5609d502f9c445d907a788e288 to your computer and use it in GitHub Desktop.
# Read the source tables in Parquet format
sales_table = spark.read.parquet("./data/sales_parquet")
sellers_table = spark.read.parquet("./data/sellers_parquet")
'''
SELECT *
FROM sales_table
WHERE seller_id NOT IN (SELECT seller_id FROM sellers_table)
'''
# Left Anti joins are a way to express the NOT IN operation in SQL
anti_join_execution_plan = sales_table.join(sellers_table,
on=sales_table["seller_id"] == sellers_table["seller_id"],
how="left_anti")
anti_join_execution_plan.show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment