Skip to content

Instantly share code, notes, and snippets.

@dvu4
Last active April 23, 2025 18:51
Show Gist options
  • Save dvu4/bb8db50e8b4e01efa8758cd4f421c130 to your computer and use it in GitHub Desktop.
Save dvu4/bb8db50e8b4e01efa8758cd4f421c130 to your computer and use it in GitHub Desktop.
Troubleshooting for writing total rows to .ok file in Pyspark

Write count record to .ok file

This code is functionally correct, but it's inefficient and overly complex for what it does: writing a single integer (record count) to a file.

def write_table_to_file(df, container, storage_account, out_name = None, delimiter = "|", audit_prefix = ".ok"):

  ok_output_path  = f"abfss://{container}@{storage_account}.dfs.core.windows.net/tmp_ok_output/"
  final_path  = f"abfss://{container}@{storage_account}.dfs.core.windows.net/result/"
  
  # Write .ok file
  data = [(df.count(),)]  # Wrap the integer in a tuple
  schema = ["total_record"]

  df_count = spark.createDataFrame(data, schema)
  df_count = df_count.withColumn("total_record", F.col("total_record").cast("string"))
  df_count.coalesce(1).write.format("csv").option("header", "false").mode("overwrite").save(ok_output_path)

  # Get the location of the CSV file that was just saved to Azure blob storage (it ends with 'csv')
  ok_output_paths = [file_name.path for file_name in dbutils.fs.ls(ok_output_path) if file_name.name.endswith("csv")]
  dbutils.fs.mv(ok_output_paths[0], final_path + f"{out_name}{audit_prefix}")

  # Remove the output folder 
  dbutils.fs.rm(ok_output_path, True)
  logger.info(f"{out_name}{audit_prefix} has been created in {storage_account}")
  • Unnecessary Spark overhead

    • spark.createDataFrame(...) kicks off distributed processing just to handle one number.

    • Creating a DataFrame, applying transformations, coalescing, and writing as CSV is overkill for a single value.

  • Slow due to Spark's I/O pipeline

    • Even with .coalesce(1), Spark still writes a folder with a part-*.csv file inside it.

    • This adds latency and requires post-processing if you want a .ok file.

  • Not producing a simple .ok file directly

    • Spark writes a folder, not a flat file like abc.ok, so it’s not ideal if other systems expect a single .ok file.
  • Extra casting step is unnecessary

    • withColumn("total_record", F.col(...).cast("string")) adds a transformation you don't need if you're writing a string directly.

Try this approach which is faster and cleaner

def write_table_to_file(df, container, storage_account, out_name = None, delimiter = "|", audit_prefix = ".ok"):

  final_path  = f"abfss://{container}@{storage_account}.dfs.core.windows.net/result/"
  
  # Count the rows in this partition
  record_count = df.count()
  
  # Create the corresponding .ok file with the record count
  ok_output_path = final_path + f"{out_name}{audit_prefix}"
  
  dbutils.fs.put(ok_output_path, str(record_count), overwrite=True)
  logger.info(f"{out_name}{audit_prefix} has been created in {storage_account}")
  • df.count() computes the number of rows (which does trigger a full scan — unavoidable).

  • dbutils.fs.put(...) writes the count directly to the specified path as a small file.

  • Avoids the overhead of Spark DataFrame creation and distributed writing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment