Skip to content

Instantly share code, notes, and snippets.

@plembo
Last active November 2, 2020 02:02
Show Gist options
  • Save plembo/367a2ec8c4f347ba3eee397e600ecbdf to your computer and use it in GitHub Desktop.
Save plembo/367a2ec8c4f347ba3eee397e600ecbdf to your computer and use it in GitHub Desktop.
Read a large data file with pandas

Read a large data file with pandas

Below is some working code I used to read a large tab-delimited data file. The file was over 3 Gb uncompressed and couldn't be loaded on a laptop with 8 GB RAM.

There are a lot of different ways to handle insufficient memory problems in pandas. In this case I used the built-in chunksize method to first load the data in chunks, and then iterated over them before concatenating into a single dataframe.

import pandas as pd
. . .
cols = [...]
chunksize = 100000

dfc = pd.read_csv('nc2020/ncvoter_Statewide.txt', sep='\t', encoding='ISO-8859-1',
                   usecols=cols,
                   chunksize=chunksize,
                   iterator=True,
                   low_memory=False,
                   dtype=str
                 )
df = pd.concat(dfc, ignore_index=True)

Encoding (UTF-8 is the default) to iso-8859-1 (latin-1) was required to get past an error that was probably the result of the original data generation process. I set the initial dtype of all fields to str to avoid the improper display of phone numbers in scientific notation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment