I mocked up a 60 MB XML by taking all the small samples in your original ZIP archive and just copying them all 200 times, which ended up with over 425k tok elements.
I then profiled your code and found a really bad culprit for chewing up time.
To process that XML took about 35 seconds:
Thu Jun 29 10:50:59 2023 profile.stats
The simplest and (probably) most efficient way to read/write any-sized CSV in Python will always be to use the csv module—bar none.
Its reader provides a very simple interface for iterating the CSV a-row-at-a-time (so never more than one row's worth of memory consumed), and that row can be passed directly to the writer (which will probably be buffered, so minimal sys calls). But, the documentation doesn't show you how to do this, even though it's so simple:
reader → row → process(row) → writerow(row)
The following code has revealed different behavior over time. When I first tested this, I saw behavior like the following where 50 requests were kicked off at once, all requests completed (with a 200), but subsequent requests took longer and longer. The following output also shows a minimum response time of 1.5s, which I think (I remember) shows that Airtable's dynamic back-offs were sticky:
Submitting 50 requests "at once"
All 50 are sent in the span of one second, but each subsequent request takes progressively longer to respond:
started request 42 at Jul 21 11:23:27.565001, ended in 1.50s
- input.xml: a sample of OP's XML. The downloaded XML incorrectly states its encoding as ISO-8859-1; it really is encoded as Windows-1252. I've tried viewing the Raw representation in this Gist, and copying-pasting over my original file; doing so, git doesn't alert me of any modificaions, so I presume we are copying-pasting the Windows-1252 encoding.
- main.py: OP's original program with some small tweaks for style and type correction, and I fixed the issue with not iterating the rupture nodes.
- output.csv: what main.py generates given input.xml
For CSV to dataclass...
I orginally had this logic to check if the row of a CSV contained any blank values:
n_cols = len(rows[0])
for row in rows:
if len([x for x in row if x]) != n_cols:
- Download all the .py scripts and run.sh
pip install json-stream
sh run.sh
run.sh calls the run_*.py scripts, which will run gen_json.py to generate three JSON test files of varying size.
The generate JSON looks like:
- gen.py: pass N as a cmd arg for the number of rows to create and save as test-N.csv, incrementing a date and time column for each row by 1 hour
- filter.py: pass N as cmd arg to filter test-N.csv by some date criteria and write test-N-out.csv
- run_test.py: run gen and filter together for a few Ns and get their timings:
When I run python3 run_test.py
I get:
Test N | row specs | DT Start | DT End | time (s) |
---|
#!/usr/bin/env python3 | |
import csv | |
import re | |
import subprocess | |
from typing import TypedDict | |
import glob, os, sys | |
package main | |
// Print directory tree from first arg. | |
// | |
// https://gist.github.com/zacharysyoung/64b6593f7d0314d0eb29bbc9ef121f1e | |
import ( | |
"fmt" | |
"log" | |
"os" |