Skip to content

Instantly share code, notes, and snippets.

@jacksmith15
Last active April 22, 2020 20:23
Show Gist options
  • Save jacksmith15/af3459d36a420a6f57da816d73cfe30e to your computer and use it in GitHub Desktop.
Save jacksmith15/af3459d36a420a6f57da816d73cfe30e to your computer and use it in GitHub Desktop.

Some examples of CSV operations in Python

Using standard library

The following uses the csv standard library, specifically DictReader, to parse a CSV as a list of dictionaries.

DictReader accepts the file handler, and automatically parses the header for the field names. It can then be iterated to get each row of the CSV. This happens lazily by default, but since we want the whole thing, we load it into a list (which means we can let the file close.

import csv
from typing import Dict, List

def csv_to_dict(filename: str) -> List[Dict[str, str]]:
    with open(filename, "r", encoding="utf-8") as file:
        reader = csv.DictReader(file)
        return list(reader)

Breakdown of the above

with

In the above with ensures open is treated as a "context manager" which is similar to Java's try-with-resources. Context managers define entry and exit logic, in this case opening and closing the file. This is equivalent to

try:
    file = open(filename)
    ...
finally:
    file.close()

It's good practice to use context managers when available, as they ensure the closing behaviour.

What is reader?

The return value of DictReader is an instance of itself, which is a wrapper to the file handler. It exposes the file as an Iterator. This means it can be interacted with in the following ways:

For loop

for row in reader:
    # do something with the 

List Comprehension

rows_with_missing_values = [
    row
    for row in reader
    if some_condition(row)
]

A key point of this typing is that reader is not an instance of a class called "Iterator", but rather it implements the "Iterator" protocol which the Python syntax uses. This is duck typing in action! We don't care whether its a list, tuple or a DictReader - so long as we can iterate it!

There's a good table summarising these implicit container types here.

Iterators in particular are a really nice concept in Python, because the defer iteration until its needed. Suppose you wanted to transform each row in the dictionary somehow:

def load_and_transform_csv(filename):
    with open(filename) as file:
        reader = csv.DictReader(file)
        result = []
        for row in reader:
            result.append(transform(row))
    return result

Or, more pythonically:

def load_and_transform_csv(filename):
    with open(filename) as file:
        return [
            transform(row)
            for row in csv.DictReader(file)
        ]

In this example, despite the code loading the CSV in one step, and then iterating it in another step, the data is only iterated once. This behaviour is really useful for pushing large streams of data through a pipeline (and is a lightweight version of how "Big Data" platforms work at scale).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment