Zach Young zacharysyoung

CSV (w/TOML,YAML) to Go structs

Making it run not so slow

I mocked up a 60 MB XML by taking all the small samples in your original ZIP archive and just copying them all 200 times, which ended up with over 425k tok elements.

I then profiled your code and found a really bad culprit for chewing up time.

To process that XML took about 35 seconds:

Thu Jun 29 10:50:59 2023 profile.stats

How can I efficiently handle large CSV files in Python?

The simplest and (probably) most efficient way to read/write any-sized CSV in Python will always be to use the csv module—bar none.

Its reader provides a very simple interface for iterating the CSV a-row-at-a-time (so never more than one row's worth of memory consumed), and that row can be passed directly to the writer (which will probably be buffered, so minimal sys calls). But, the documentation doesn't show you how to do this, even though it's so simple:

reader → row → process(row) → writerow(row)

Testing Airtable Rate-limiting

The following code has revealed different behavior over time. When I first tested this, I saw behavior like the following where 50 requests were kicked off at once, all requests completed (with a 200), but subsequent requests took longer and longer. The following output also shows a minimum response time of 1.5s, which I think (I remember) shows that Airtable's dynamic back-offs were sticky:

Submitting 50 requests "at once"

All 50 are sent in the span of one second, but each subsequent request takes progressively longer to respond:

started request 42 at Jul 21 11:23:27.565001, ended in 1.50s

SO-75698546: XML file to CSV

input.xml: a sample of OP's XML. The downloaded XML incorrectly states its encoding as ISO-8859-1; it really is encoded as Windows-1252. I've tried viewing the Raw representation in this Gist, and copying-pasting over my original file; doing so, git doesn't alert me of any modificaions, so I presume we are copying-pasting the Windows-1252 encoding.
main.py: OP's original program with some small tweaks for style and type correction, and I fixed the issue with not iterating the rupture nodes.
output.csv: what main.py generates given input.xml

Performance tuning

For CSV to dataclass...

I orginally had this logic to check if the row of a CSV contained any blank values:

n_cols = len(rows[0])
for row in rows:
 if len([x for x in row if x]) != n_cols:

How to write large JSON data?

Download all the .py scripts and run.sh
pip install json-stream
sh run.sh

run.sh calls the run_*.py scripts, which will run gen_json.py to generate three JSON test files of varying size.

The generate JSON looks like:

Establish CSV read/process/write baseline

gen.py: pass N as a cmd arg for the number of rows to create and save as test-N.csv, incrementing a date and time column for each row by 1 hour
filter.py: pass N as cmd arg to filter test-N.csv by some date criteria and write test-N-out.csv
run_test.py: run gen and filter together for a few Ns and get their timings:

When I run python3 run_test.py I get:

Test N	row specs	DT Start	DT End	time (s)

	#!/usr/bin/env python3
	import csv
	import re
	import subprocess

	from typing import TypedDict

	import glob, os, sys

	package main

	// Print directory tree from first arg.
	//
	// https://gist.github.com/zacharysyoung/64b6593f7d0314d0eb29bbc9ef121f1e

	import (
	"fmt"
	"log"
	"os"