Read a large data file with pandas

Below is some working code I used to read a large tab-delimited data file. The file was over 3 Gb uncompressed and couldn't be loaded on a laptop with 8 GB RAM.

There are a lot of different ways to handle insufficient memory problems in pandas. In this case I used the built-in chunksize method to first load the data in chunks, and then iterated over them before concatenating into a single dataframe.

import pandas as pd
. . .
cols = [...]
chunksize = 100000

dfc = pd.read_csv('nc2020/ncvoter_Statewide.txt', sep='\t', encoding='ISO-8859-1',
                   usecols=cols,
                   chunksize=chunksize,
                   iterator=True,
                   low_memory=False,
                   dtype=str
                 )
df = pd.concat(dfc, ignore_index=True)

Encoding (UTF-8 is the default) to iso-8859-1 (latin-1) was required to get past an error that was probably the result of the original data generation process. I set the initial dtype of all fields to str to avoid the improper display of phone numbers in scientific notation.

plembo/rdlgdatpandas.md

Read a large data file with pandas