The following uses the csv
standard library, specifically DictReader, to parse a CSV as a list of dictionaries.
DictReader
accepts the file handler, and automatically parses the header for the field names. It can then be iterated to get each row of the CSV. This happens lazily by default, but since we want the whole thing, we load it into a list (which means we can let the file close.
import csv
from typing import Dict, List
def csv_to_dict(filename: str) -> List[Dict[str, str]]:
with open(filename, "r", encoding="utf-8") as file:
reader = csv.DictReader(file)
return list(reader)
In the above with
ensures open
is treated as a "context manager" which is similar to Java's try-with-resources. Context managers define entry and exit logic, in this case opening and closing the file. This is equivalent to
try:
file = open(filename)
...
finally:
file.close()
It's good practice to use context managers when available, as they ensure the closing behaviour.
The return value of DictReader
is an instance of itself, which is a wrapper to the file handler. It exposes the file as an Iterator. This means it can be interacted with in the following ways:
for row in reader:
# do something with the
rows_with_missing_values = [
row
for row in reader
if some_condition(row)
]
A key point of this typing is that reader
is not an instance of a class called "Iterator", but rather it implements the "Iterator" protocol which the Python syntax uses. This is duck typing in action! We don't care whether its a list
, tuple
or a DictReader
- so long as we can iterate it!
There's a good table summarising these implicit container types here.
Iterators in particular are a really nice concept in Python, because the defer iteration until its needed. Suppose you wanted to transform each row in the dictionary somehow:
def load_and_transform_csv(filename):
with open(filename) as file:
reader = csv.DictReader(file)
result = []
for row in reader:
result.append(transform(row))
return result
Or, more pythonically:
def load_and_transform_csv(filename):
with open(filename) as file:
return [
transform(row)
for row in csv.DictReader(file)
]
In this example, despite the code loading the CSV in one step, and then iterating it in another step, the data is only iterated once. This behaviour is really useful for pushing large streams of data through a pipeline (and is a lightweight version of how "Big Data" platforms work at scale).