Skip to content

Instantly share code, notes, and snippets.

@aialenti
Last active September 13, 2020 21:33
Show Gist options
  • Save aialenti/6f3fa6d65d04ef27d36184297e9e4f02 to your computer and use it in GitHub Desktop.
Save aialenti/6f3fa6d65d04ef27d36184297e9e4f02 to your computer and use it in GitHub Desktop.
# Read the source tables in Parquet format
sales_table = spark.read.parquet("./data/sales_parquet")
sellers_table = spark.read.parquet("./data/sellers_parquet")
'''
SELECT *
FROM sales_table
WHERE seller_id IN (SELECT seller_id FROM sellers_table)
'''
# Left Semi joins are a way to express the IN operation in SQL
semi_join_execution_plan = sales_table.join(sellers_table,
on=sales_table["seller_id"] == sellers_table["seller_id"],
how="left_semi")
semi_join_execution_plan.show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment