- gen.py: pass N as a cmd arg for the number of rows to create and save as test-N.csv, incrementing a date and time column for each row by 1 hour
- filter.py: pass N as cmd arg to filter test-N.csv by some date criteria and write test-N-out.csv
- run_test.py: run gen and filter together for a few Ns and get their timings:
When I run python3 run_test.py
I get:
Test N | row specs | DT Start | DT End | time (s) |
---|---|---|---|---|
test-100000.csv | generated 100_000 rows | 1900-01-01 00:00:00 | 1911-05-30 15:00:00 | 0.31 |
test-100000-out.csv | filtered 100_000 rows | 2000-01-01 11:00:00 | 2010-01-01 14:00:00 | 0.29 |
test-1000000.csv | generated 1_000_000 rows | 1900-01-01 00:00:00 | 2014-01-29 15:00:00 | 2.97 |
test-1000000-out.csv | filtered 912_324 rows | 2000-01-01 11:00:00 | 2010-01-01 14:00:00 | 2.93 |
test-10000000.csv | generated 10_000_000 rows | 1900-01-01 00:00:00 | 3040-10-17 15:00:00 | 29.96 |
test-10000000-out.csv | filtered 9_912_324 rows | 2000-01-01 11:00:00 | 2010-01-01 14:00:00 | 29.70 |
Reading the table we can see:
-
For the 100-thousand-row run, all 100_000 rows were filtered out because test-100000.csv stops at 1911 and the filter only keeps rows between
2000-01-01 11:00:00
and2010-01-01 14:00:00
.The 1-million and 10-million row CSVs are within the filter's range, so some rows are written to their test-N-out.csv files.
-
A 10x increase in row size leads to 10x increase in run time .3s → 3s → 30s
The three test-N.csv input files have the sizes:
1.8 M test-100000.csv
18 M test-1000000.csv
181 M test-10000000.csv
I estimate that generating a test file with 1 billion rows would make a file about 20GB in size, and should take about 3000 seconds (50 minutes) to process with filter.py.
As for memory, this method of reading a row and immediately writing the row uses almost no memory:
/usr/bin/time -l ./filter.py 10_000_000
29.32 real 29.11 user 0.18 sys
... ...
7963648 peak memory footprint
Only 8MB to filter the 10-million-row CSV; and Python3 takes just about 8MB to run on my system, so virtually 0 memory for the filtering process itself.
The actual code for gen and filter is a little messier than I wanted, because of all the timing and printing.
Here is the essence of both, to highlight the basics of the problem I think you're trying to solve:
-
gen.py:
import csv from datetime import datetime, timedelta BASE_DT = datetime(1900, 1, 1) with open("test-1_000_000.csv", "w", newline="") as f: writer = csv.writer(f) for i in range(1_000_000): dt = BASE_DT + timedelta(hours=i) ymd = dt.strftime(r"%Y%m%d") hms = dt.strftime(r"%H:%M:%S") writer.writerow([ymd, hms])
-
filter.py:
import csv from datetime import datetime START_DT = datetime(2000, 1, 1, 11) END_DT = datetime(2010, 1, 1, 14) def meets_my_condition(row): dt = datetime.strptime(row[0] + row[1], r"%Y%m%d%H:%M:%S") return dt >= START_DT and dt <= END_DT f_in = open("test-1_000_000.csv", newline="") f_out = open("test-1_000_000-out.csv", "w", newline="") reader = csv.reader(f_in) writer = csv.writer(f_out) for row in reader: if meets_my_condition(row): writer.writerow(row) f_in.close() f_out.close()