Skip to content

Instantly share code, notes, and snippets.

@zacharysyoung
Last active April 20, 2023 17:02
Show Gist options
  • Save zacharysyoung/ac4acc19274d9fc40bfc2f98d05c3ae8 to your computer and use it in GitHub Desktop.
Save zacharysyoung/ac4acc19274d9fc40bfc2f98d05c3ae8 to your computer and use it in GitHub Desktop.
Answering SO-76062643

The simplest and (probably) most efficient way to read/write any-sized CSV in Python will always be to use the csv module—bar none.

Its reader provides a very simple interface for iterating the CSV a-row-at-a-time (so never more than one row's worth of memory consumed), and that row can be passed directly to the writer (which will probably be buffered, so minimal sys calls). But, the documentation doesn't show you how to do this, even though it's so simple:

reader → row → process(row) → writerow(row)

and repeat. The following assumes you have a number in the first column. That number is squared, and if the squared value is greater than 100 the row is written to the output. You can process a row, and filter by its value:

import csv

with (
    open("input.csv", newline="") as f_in,
    open("output.csv", "w", newline="") as f_out,
):
    reader = csv.reader(f_in)
    header = next(reader)  # if you have a header

    writer = csv.writer(f_out)
    writer.writerow(header)

    for row in reader:
        x = int(row[0])
        row[0] = x * x

        if row[0] > 100:
            writer.writerow(row)

Coming from the reader, every row will just be a list of strs. That same list of strs can then be passed directly to the writerow() method. If you need parse and manipulate some numbers just pass the resultant numbers as-is, the writer can accept a list of anything if those anythings have their own __str__ method.

With this approach, your max memory footprint should be around the size of a row plus the runtime's overhead. But with garbage collection, it'll probably be the size of your row times some number, X—I don't know how quickly previous rows will be garbage collected. I've processed 30GB CSVs with this approach and never saw the RAM usage (from /usr/bin/time -l) rise above the overhead of the runtime itself, about 8MB on my machine. Granted, that was millions of rows with only a few dozen columns... you said something about millions of columns (yikes!) so you should expect to use more RAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment