This code is functionally correct, but it's inefficient and overly complex for what it does: writing a single integer (record count) to a file.
def write_table_to_file(df, container, storage_account, out_name = None, delimiter = "|", audit_prefix = ".ok"):
ok_output_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/tmp_ok_output/"
final_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/result/"
# Write .ok file
data = [(df.count(),)] # Wrap the integer in a tuple
schema = ["total_record"]
df_count = spark.createDataFrame(data, schema)
df_count = df_count.withColumn("total_record", F.col("total_record").cast("string"))
df_count.coalesce(1).write.format("csv").option("header", "false").mode("overwrite").save(ok_output_path)
# Get the location of the CSV file that was just saved to Azure blob storage (it ends with 'csv')
ok_output_paths = [file_name.path for file_name in dbutils.fs.ls(ok_output_path) if file_name.name.endswith("csv")]
dbutils.fs.mv(ok_output_paths[0], final_path + f"{out_name}{audit_prefix}")
# Remove the output folder
dbutils.fs.rm(ok_output_path, True)
logger.info(f"{out_name}{audit_prefix} has been created in {storage_account}")
-
Unnecessary Spark overhead
-
spark.createDataFrame(...)
kicks off distributed processing just to handle one number. -
Creating a DataFrame, applying transformations, coalescing, and writing as CSV is overkill for a single value.
-
-
Slow due to Spark's I/O pipeline
-
Even with
.coalesce(1)
, Spark still writes a folder with apart-*.csv
file inside it. -
This adds latency and requires post-processing if you want a .ok file.
-
-
Not producing a simple .ok file directly
- Spark writes a folder, not a flat file like abc.ok, so it’s not ideal if other systems expect a single .ok file.
-
Extra casting step is unnecessary
withColumn("total_record", F.col(...).cast("string"))
adds a transformation you don't need if you're writing a string directly.
def write_table_to_file(df, container, storage_account, out_name = None, delimiter = "|", audit_prefix = ".ok"):
final_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/result/"
# Count the rows in this partition
record_count = df.count()
# Create the corresponding .ok file with the record count
ok_output_path = final_path + f"{out_name}{audit_prefix}"
dbutils.fs.put(ok_output_path, str(record_count), overwrite=True)
logger.info(f"{out_name}{audit_prefix} has been created in {storage_account}")
-
df.count()
computes the number of rows (which does trigger a full scan — unavoidable). -
dbutils.fs.put(...)
writes the count directly to the specified path as a small file. -
Avoids the overhead of Spark DataFrame creation and distributed writing.