Skip to content

Instantly share code, notes, and snippets.

@zacharysyoung
Last active February 20, 2023 17:58
Show Gist options
  • Save zacharysyoung/cd62a191f49283c2bdd6cb2121880771 to your computer and use it in GitHub Desktop.
Save zacharysyoung/cd62a191f49283c2bdd6cb2121880771 to your computer and use it in GitHub Desktop.
Trying to help OP solve SO-75413294

Establish CSV read/process/write baseline

  • gen.py: pass N as a cmd arg for the number of rows to create and save as test-N.csv, incrementing a date and time column for each row by 1 hour
  • filter.py: pass N as cmd arg to filter test-N.csv by some date criteria and write test-N-out.csv
  • run_test.py: run gen and filter together for a few Ns and get their timings:

When I run python3 run_test.py I get:

Test N row specs DT Start DT End time (s)
test-100000.csv generated 100_000 rows 1900-01-01 00:00:00 1911-05-30 15:00:00 0.31
test-100000-out.csv filtered 100_000 rows 2000-01-01 11:00:00 2010-01-01 14:00:00 0.29
test-1000000.csv generated 1_000_000 rows 1900-01-01 00:00:00 2014-01-29 15:00:00 2.97
test-1000000-out.csv filtered 912_324 rows 2000-01-01 11:00:00 2010-01-01 14:00:00 2.93
test-10000000.csv generated 10_000_000 rows 1900-01-01 00:00:00 3040-10-17 15:00:00 29.96
test-10000000-out.csv filtered 9_912_324 rows 2000-01-01 11:00:00 2010-01-01 14:00:00 29.70

Reading the table we can see:

  • For the 100-thousand-row run, all 100_000 rows were filtered out because test-100000.csv stops at 1911 and the filter only keeps rows between 2000-01-01 11:00:00 and 2010-01-01 14:00:00.

    The 1-million and 10-million row CSVs are within the filter's range, so some rows are written to their test-N-out.csv files.

  • A 10x increase in row size leads to 10x increase in run time .3s → 3s → 30s

The three test-N.csv input files have the sizes:

  1.8 M      test-100000.csv
 18   M     test-1000000.csv
181   M    test-10000000.csv

I estimate that generating a test file with 1 billion rows would make a file about 20GB in size, and should take about 3000 seconds (50 minutes) to process with filter.py.

As for memory, this method of reading a row and immediately writing the row uses almost no memory:

/usr/bin/time -l ./filter.py 10_000_000
       29.32 real        29.11 user         0.18 sys
                 ...  ...
             7963648  peak memory footprint

Only 8MB to filter the 10-million-row CSV; and Python3 takes just about 8MB to run on my system, so virtually 0 memory for the filtering process itself.

Code for the actual problem at hand

The actual code for gen and filter is a little messier than I wanted, because of all the timing and printing.

Here is the essence of both, to highlight the basics of the problem I think you're trying to solve:

  • gen.py:

    import csv
    
    from datetime import datetime, timedelta
    
    
    BASE_DT = datetime(1900, 1, 1)
    
    with open("test-1_000_000.csv", "w", newline="") as f:
        writer = csv.writer(f)
        for i in range(1_000_000):
            dt = BASE_DT + timedelta(hours=i)
            ymd = dt.strftime(r"%Y%m%d")
            hms = dt.strftime(r"%H:%M:%S")
    
            writer.writerow([ymd, hms])
  • filter.py:

    import csv
    
    from datetime import datetime
    
    
    START_DT = datetime(2000, 1, 1, 11)
    END_DT = datetime(2010, 1, 1, 14)
    
    def meets_my_condition(row):
        dt = datetime.strptime(row[0] + row[1], r"%Y%m%d%H:%M:%S")
        return dt >= START_DT and dt <= END_DT
    
    
    f_in = open("test-1_000_000.csv", newline="")
    f_out = open("test-1_000_000-out.csv", "w", newline="")
    
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)
    
    for row in reader:
        if meets_my_condition(row):
            writer.writerow(row)
    
    f_in.close()
    f_out.close()
import csv
import sys
import time
from datetime import datetime
START_DT = datetime(2000, 1, 1, 11)
END_DT = datetime(2010, 1, 1, 14)
def meets_my_condition(row):
dt = datetime.strptime(row[0] + row[1], r"%Y%m%d%H:%M:%S")
return dt >= START_DT and dt <= END_DT
n = int(sys.argv[1])
in_ct = 0
out_ct = 0
fname_in = f"test-{n}.csv"
fname_out = f"test-{n}-out.csv"
beg_t = time.perf_counter()
f_in = open(fname_in, newline="")
f_out = open(fname_out, "w", newline="")
reader = csv.reader(f_in)
writer = csv.writer(f_out)
for row in reader:
in_ct += 1
if meets_my_condition(row):
writer.writerow(row)
out_ct += 1
f_in.close()
f_out.close()
end_t = time.perf_counter()
print(f"| {fname_out} | filtered {in_ct-out_ct:_} rows | {START_DT} | {END_DT} | {round(end_t-beg_t,2)} |")
import csv
import sys
import time
from datetime import datetime, timedelta
# Generate a test file of n-sized rows, starting at Jan 1, 1900 and
# incrementing each row by an hour
BASE_DT = datetime(1900, 1, 1)
n = int(sys.argv[1])
dt = datetime(1, 1, 1) # declare value outside loop so we can print its final value at the end
fname = f"test-{n}.csv"
beg_t = time.perf_counter()
with open(fname, "w", newline="") as f:
writer = csv.writer(f)
for i in range(n):
dt = BASE_DT + timedelta(hours=i)
ymd = dt.strftime(r"%Y%m%d")
hms = dt.strftime(r"%H:%M:%S")
writer.writerow([ymd, hms])
end_t = time.perf_counter()
print(f"| {fname} | generated {n:_} rows | {BASE_DT} | {dt} | {round(end_t-beg_t,2)} |")
from subprocess import run
print("| Test N | row specs | DT Start | DT End | time (s) |")
print("| ------ | --------- | -------- | ------ | -------- |")
for x in [100_000, 1_000_000, 10_000_000]:
n = str(x)
run(["python3", "gen.py", n])
run(["python3", "filter.py", n])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment