Troubleshooting for dividing data into 2 separate groups with subtract vs left join

This approach leads to the duplications in `df_holdout` and `df_target`

subtract() is row-based and requires exact row match
If df has duplicate rows, subtract() doesn't guarantee it removes just one instance.

# Filter 20% of the data for the holdout group
df_holdout = df.sample(fraction=0.2, seed=42)
    
df_holdout.display()
logger.info(f"Total holdout loyalty member id: {df_holdout.count():,}") #263,232


# Remove the holdout group from the original DataFrame to get the remaining 80%
df_target = df.subtract(df_holdout)
df_target.display()
logger.info(f"Total remaining loyalty member id: {df_target.count():,}") #1,051,368

Try this approach to avoid the duplicate records in `df_holdout` and `df_target`

# Filter 20% of the data for the holdout group
df_holdout = df.sample(fraction=0.2, seed=42)
    
df_holdout.display()
logger.info(f"Total holdout loyalty member id: {df_holdout.count():,}") #263,232


# Remove the holdout group from the original DataFrame to get the remaining 80%
df_target = df.join(df_holdout, on="BR_ID", how="leftanti")
df_target.display()
logger.info(f"Total remaining loyalty member id: {df_target.count():,}") #1,051,368

Optimal solution

df_target, df_holdout = df.randomSplit([0.8, 0.2], seed=42)

Root Cause	Does `sample()` cause it?	Fix
Duplicate rows	✅ Yes	Use `randomSplit()` or add row IDs
`subtract()` not safe	✅ Yes	Use `join(..., how='left_anti')`
Overlap in splits	✅ Possible	Prefer `randomSplit()`

dvu4/subtract_vs_left_join_in_pyspark.md

This approach leads to the duplications in df_holdout and df_target

Try this approach to avoid the duplicate records in df_holdout and df_target

Optimal solution

This approach leads to the duplications in `df_holdout` and `df_target`

Try this approach to avoid the duplicate records in `df_holdout` and `df_target`