Skip to content

Instantly share code, notes, and snippets.

@aialenti
Created September 13, 2020 15:29
Show Gist options
  • Save aialenti/949f4ff33efb3f2096d7c74c39a1105b to your computer and use it in GitHub Desktop.
Save aialenti/949f4ff33efb3f2096d7c74c39a1105b to your computer and use it in GitHub Desktop.
# Read the source tables in Parquet format
sales_table = spark.read.parquet("./data/sales_parquet")
'''
CREATE TABLE part_1 AS
SELECT *
FROM sales_table
WHERE num_pieces_sold > 50;
CREATE TABLE part_2 AS
SELECT *
FROM sales_table
WHERE num_pieces_sold <= 50;
SELECT *
FROM part_1
UNION ALL
SELECT *
FROM part_2
'''
# Split part 1
sales_table_execution_plan_part_1 = sales_table.where(col("num_pieces_sold") > 50)
# Split part 2
sales_table_execution_plan_part_2 = sales_table.where(col("num_pieces_sold") <= 50)
# Union back
sales_table_execution_plan = sales_table_execution_plan_part_1.unionByName(sales_table_execution_plan_part_2)
sales_table_execution_plan.explain()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment