Troubleshooting for writing total rows to .ok file in Pyspark

Write count record to .ok file

This code is functionally correct, but it's inefficient and overly complex for what it does: writing a single integer (record count) to a file.

def write_table_to_file(df, container, storage_account, out_name = None, delimiter = "|", audit_prefix = ".ok"):

  ok_output_path  = f"abfss://{container}@{storage_account}.dfs.core.windows.net/tmp_ok_output/"
  final_path  = f"abfss://{container}@{storage_account}.dfs.core.windows.net/result/"
  
  # Write .ok file
  data = [(df.count(),)]  # Wrap the integer in a tuple
  schema = ["total_record"]

  df_count = spark.createDataFrame(data, schema)
  df_count = df_count.withColumn("total_record", F.col("total_record").cast("string"))
  df_count.coalesce(1).write.format("csv").option("header", "false").mode("overwrite").save(ok_output_path)

  # Get the location of the CSV file that was just saved to Azure blob storage (it ends with 'csv')
  ok_output_paths = [file_name.path for file_name in dbutils.fs.ls(ok_output_path) if file_name.name.endswith("csv")]
  dbutils.fs.mv(ok_output_paths[0], final_path + f"{out_name}{audit_prefix}")

  # Remove the output folder 
  dbutils.fs.rm(ok_output_path, True)
  logger.info(f"{out_name}{audit_prefix} has been created in {storage_account}")

Unnecessary Spark overhead
- spark.createDataFrame(...) kicks off distributed processing just to handle one number.
- Creating a DataFrame, applying transformations, coalescing, and writing as CSV is overkill for a single value.
Slow due to Spark's I/O pipeline
- Even with .coalesce(1), Spark still writes a folder with a part-*.csv file inside it.
- This adds latency and requires post-processing if you want a .ok file.
Not producing a simple .ok file directly
- Spark writes a folder, not a flat file like abc.ok, so it’s not ideal if other systems expect a single .ok file.
Extra casting step is unnecessary
- withColumn("total_record", F.col(...).cast("string")) adds a transformation you don't need if you're writing a string directly.

Try this approach which is faster and cleaner

def write_table_to_file(df, container, storage_account, out_name = None, delimiter = "|", audit_prefix = ".ok"):

  final_path  = f"abfss://{container}@{storage_account}.dfs.core.windows.net/result/"
  
  # Count the rows in this partition
  record_count = df.count()
  
  # Create the corresponding .ok file with the record count
  ok_output_path = final_path + f"{out_name}{audit_prefix}"
  
  dbutils.fs.put(ok_output_path, str(record_count), overwrite=True)
  logger.info(f"{out_name}{audit_prefix} has been created in {storage_account}")

df.count() computes the number of rows (which does trigger a full scan — unavoidable).
dbutils.fs.put(...) writes the count directly to the specified path as a small file.
Avoids the overhead of Spark DataFrame creation and distributed writing.

dvu4/write_count_record_to_file.md

Write count record to .ok file

Try this approach which is faster and cleaner