Welcome to CSV w/Python!

CSV files are a good way to share tables of data, and Python's CSV module makes working with them straightforward.

This guide will quickly move you through all the concepts you need to fill in this basic CSV program:

# Open the CSV
# Read all the data
# Extract the header
# Process the data
# Save the processed data to a new CSV

Python's CSV module provides two ways of reading and writing rows: a row as a list of strings; and a row as a dict. I'm going to show the first way, a row as a list of strings.

The Row

Fundamental to reading and writing CSVs with Python is understanding the concept of a row.

A row is just a list of strings.

This CSV source row:

normal value,"value, quoted"

results in this list in Python:

['normal value', 'value, quoted']

CSV allows line-breaks in a row. This single row in CSV:

normal value,"line-split
value", another normal

looks like this in Python:

['normal value', 'line-split\nvalue', 'another normal']

Whether a row in the CSV source is composed of one line of text, or many lines of text, in Python that row is always just one list of strings.

Your CSV file will probably have multiple rows, like:

First name,Favorite number
Alice,8
Blythe,0
Chen,-1

Here's how it will look in Python:

[
 ['First name', 'Favorite number'],
 ['Alice', '8'],
 ['Blythe', '0'],
 ['Chen', '-1']
]

(Did you spot the extra [ and ] at the top and bottom?)

Your whole CSV is just a list of rows, and each row is just a list of strings.

How do you think this looks as CSV and Python? We haven't seen empty cells yet:

Col1	Col2	Col3
a
	b
		c

In CSV, a comma with no text represents an empty cell. The "a row" and the "b row" both have trailing commas, which hold the place for their empty, trailing columns:

Col1,Col2,Col3
a,,
,b,
,,c

Python's CSV module uses empty strings ('') to represent empty cells:

[
 ['Col1','Col2','Col3'],
 [   'a',    '',    ''],
 [    '',   'b',    ''],
 [    '',    '',   'c'],
]

(If you're making your own row from scratch, you can use Python's None value, and the CSV writer will write empty cells all the same.)

Now that you can see what a row is between the CSV source and Python it's time to start reading and writing files.

Reading

To read your CSV, open the file for reading, and create a reader object:

import csv

with open('input.csv', newline='') as f:
    reader = csv.reader(f)
    ...

Your new reader is called an iterable and gives you the control to read your CSV row-by-row.

(If newline='' doesn't look familiar, it's there to make sure multi-line rows (like above) are read as whole rows. There's more on that at the end of this guide.)

The first approach to iterating your rows requires the least lines of code. Use Python's list() function to turn the entire file into a list of rows:

    ...
    my_data = list(reader)

print(my_data)

and you'll see something like:

[
 ['First name', 'Favorite number'],
 ['Alice', '8'],
 ['Blythe', '0'],
 ['Chen', '-1']
]

You can access each row in my_data like any list in Python. To print the first row:

print(my_data[0])

and you'll see:

['First name', 'Favorite number']

Processing

That row, in this case, is a header. If you want to process the data, then the header might get in the way. Here's how to save it for later, and delete it from the data:

my_header = my_data[0]
my_data.remove(my_header)

print(my_data)    # the header has been removed from the list of rows
print(my_header)  # ther header still exists on its own

[
 ['Alice', '8'],
 ['Blythe', '0'],
 ['Chen', '-1']
]
...
['First name', 'Favorite number']

Looking at the example values in my_data, 'Favorite number' refers to a number, but right now each value is just a string in the second field of each row. The numbers are all "whole numbers", so use the int() function to convert the string representation of the numbers:

for row in my_data:
    # Second field is at index 1 for the row
    row[1] = int(row[1])

and now my_data looks like:

[
 ['Alice', 8],
 ['Blythe', 0],
 ['Chen', -1]
]

With the second field actually containing numbers, you can now use them to sort the data correctly by 'Favorite number':

my_data.sort(key=lambda row: row[1])

[
 ['Chen', -1],
 ['Blythe', 0],
 ['Alice', 8]
]

With the data read in, and processed, it's time to save the processed data back to a CSV file.

Writing

To write a CSV, open the file for writing, and create a writer object:

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    ...

Your writer writes single rows with its writerow() method. Use this to add the header back:

    ...
    writer.writerow(my_header)
    ...

It writes lists of rows with its writerows() method. Use this to write all of my_data:

    ...
    writer.writerows(my_data)

Here's the output CSV:

First name,Favorite number
Chen,-1
Blythe,0
Alice,8

The complete script

import csv

with open('input.csv', newline='') as f:
    reader = csv.reader(f)
    my_data = list(reader)

# Separate header, save for later
my_header = my_data[0]
my_data.remove(my_header)

for row in my_data:
    # Second field is at index 1 for the row
    row[1] = int(row[1])

my_data.sort(key=lambda row: row[1])

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(my_header)
    writer.writerows(my_data)

Other ways of Reading, Processing, and Writing

At the beginning of the Reading section I mentioned 3 ways of controlling how iterate the reader object, and then showed my_data = list(reader) which resulted in my_data being a list of all the rows.

This is fine for most applications, modern computers have enough RAM and most CSVs are small enough that you can easily hold all the data in one place (in one variable). And, for sorting it's necessary to have all the rows so the sorter can compare rows.

There are times when you don't want the entire CSV in memory, and you want to process the CSV one row at a time. For that, you'll need the other two methods of iterating a reader, next(reader) and for row in reader:.

If you haven't used next() before, it's the function you use to manually advance an iterator by one element. It only works on iterators, and calls the iterator's hidden __next__() method.

You could iterate an entire CSV by calling next():

...
reader = csv.reader(f)

first_row = next(reader)
second_row = next(reader)
third_row = next(reader)
fourth_row = next(reader)
...

but that's tedious, and more importantly, it almost requires you to know how many rows your CSV has ahead of time. What happens when you call next(reader) after you've read all the rows? Python raises the StopIteration error to signal that there are no more elements in the container, which you could handle... but, all that's beyond difficult.

Enter the for element in iterator: structure.

reader = csv.reader(f)
for row in reader:
    # process row

Behind the scenes, Python is effectively calling row = next(reader) then passing row into the for-loop. When the iterator is exhausted and StopIteration is raised, Python gracefully handles that exception and stops the loop.

You can put those two methods together to read the first row of the input CSV to get (and get past) its header-row, then hand over control to the for-loop to read and process the rest of the CSV, the data-rows:

with open('input.csv', newline='') as f:
    reader = csv.reader(f)
    my_header = next(reader)

    # we're past header, row will be data
    for row in reader:
        # process data

Finally, for every row we read, we can write the row immediately to our new file. The following will filter our small sample CSV from before, only writing rows where 'Favorite number' is zero or greater:

with open('output_filtered.csv', 'w', newline='') as out_f:
    writer = csv.writer(out_f)

    with open('input.csv', newline='') as in_f:
        reader = csv.reader(in_f)

        # Transfer header
        my_header = next(reader)
        writer.writerow(my_header)

        for row in reader:
            number = int(row[1])
            if number < 0:
                continue

            writer.writerow(row)

First name,Favorite number
Alice,8
Blythe,0

Opening files

File modes, reading and writing

This is a very common way to open a file for reading in Python:

with open('file.txt') as my_file:
    # Read a line of text
    first_line = next(my_file)

    # Loop over the rest of the lines
    for line in my_file:
        ....

The open() function always takes a path to the file to be opened. The function also takes an argument that says whether the file is for reading only, for writing, or other special modes.

When you open a file for writing, a new file is created. If the path points to a file that already exists, the file will cleared (you'll be "overwriting" it). Add the 'w' argument to tell Python you want to create a file for writing:

with open('file.txt', 'w') as my_file:
    my_file.write('a new line of text')

Appending is not Modifying

One of the modes I called "special" is append. I call this special as an encouragement to stop and consider its use. I've seen a lot of confusion where "append" is confused for "edit". The big take away I want you to have is:

append strictly means adding something new to the end of a file; do not set the 'a' mode if you're tyring to modify existing data

Modifying existing data requires these four steps, and explicitly reading-then-writing:

read the file to get the existing data
find the data you want to change
change that data
finally, overwrite the file with the modified data

Encoding

The open() function also open files with the UTF-8 byte encoding by default. If your file has a different encoding, specify that with the encoding='YOUR-FILES-ENCODING' option:

with open('file.txt', encoding='windows-1252') as my_file:
    ...

Line-breaking characters

When you press the ENTER key in a text editor, it inserts a line-breaking character sequence, which we'll refer to generically as a newline. What that sequence is varies across operating systems, and has changed over the years.

By default, when Python opens a text file for reading or writing, it tries to make sure that all newlines are the same (sequence of characters) and will pick what it thinks is the "appropriate newline" for your system and convert any other newline it finds to be in line with its choice.

But, CSV allows for different newlines: your file might have one newline for line breaks inside a cell, and another newline to mark the end of a row.

When we open a CSV file, it's important to pass the newline='' option. This instructs Python not make that choice, and leave all newlines as they are, because the CSV parser will sort it out.

zacharysyoung/Better_CSV_doc.md