Skip to content

Instantly share code, notes, and snippets.

@dvu4
Last active April 23, 2025 19:30
Show Gist options
  • Save dvu4/76441c29517b2a8e66e3888d93a889b9 to your computer and use it in GitHub Desktop.
Save dvu4/76441c29517b2a8e66e3888d93a889b9 to your computer and use it in GitHub Desktop.
Troubleshooting for dividing data into 2 separate groups with subtract vs left join

This approach leads to the duplications in df_holdout and df_target

  • subtract() is row-based and requires exact row match

  • If df has duplicate rows, subtract() doesn't guarantee it removes just one instance.

# Filter 20% of the data for the holdout group
df_holdout = df.sample(fraction=0.2, seed=42)
    
df_holdout.display()
logger.info(f"Total holdout loyalty member id: {df_holdout.count():,}") #263,232


# Remove the holdout group from the original DataFrame to get the remaining 80%
df_target = df.subtract(df_holdout)
df_target.display()
logger.info(f"Total remaining loyalty member id: {df_target.count():,}") #1,051,368

Try this approach to avoid the duplicate records in df_holdout and df_target

# Filter 20% of the data for the holdout group
df_holdout = df.sample(fraction=0.2, seed=42)
    
df_holdout.display()
logger.info(f"Total holdout loyalty member id: {df_holdout.count():,}") #263,232


# Remove the holdout group from the original DataFrame to get the remaining 80%
df_target = df.join(df_holdout, on="BR_ID", how="leftanti")
df_target.display()
logger.info(f"Total remaining loyalty member id: {df_target.count():,}") #1,051,368

Optimal solution

df_target, df_holdout = df.randomSplit([0.8, 0.2], seed=42)
Root Cause Does sample() cause it? Fix
Duplicate rows ✅ Yes Use randomSplit() or add row IDs
subtract() not safe ✅ Yes Use join(..., how='left_anti')
Overlap in splits ✅ Possible Prefer randomSplit()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment