Last active
October 8, 2024 08:21
-
-
Save kelvintaywl/37dbfaea789707ec5f48 to your computer and use it in GitHub Desktop.
Python Script to split CSV files into smaller files based on number of lines
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import csv | |
import sys | |
import os | |
# example usage: python split.py example.csv 200 | |
# above command would split the `example.csv` into smaller CSV files of 200 rows each (with header included) | |
# if example.csv has 401 rows for instance, this creates 3 files in same directory: | |
# - `example_1.csv` (row 1 - 200) | |
# - `example_2.csv` (row 201 - 400) | |
# - `example_3.csv` (row 401) | |
CURRENT_DIR = os.path.dirname(os.path.realpath(__file__)) | |
filename = sys.argv[1] | |
full_file_path = os.path.join(CURRENT_DIR, filename) | |
file_name = os.path.splitext(full_file_path)[0] | |
rows_per_csv = int(sys.argv[2]) if len(sys.argv) > 2 else 5000 | |
with open(filename) as infile: | |
reader = csv.DictReader(infile) | |
header = reader.fieldnames | |
rows = [row for row in reader] | |
pages = [] | |
row_count = len(rows) | |
start_index = 0 | |
# here, we slice the total rows into pages, each page having [row_per_csv] rows | |
while start_index < row_count: | |
pages.append(rows[start_index: start_index+rows_per_csv]) | |
start_index += rows_per_csv | |
for i, page in enumerate(pages): | |
with open('{}_{}.csv'.format(file_name, i+1), 'w+') as outfile: | |
writer = csv.DictWriter(outfile, fieldnames=header) | |
writer.writeheader() | |
for row in page: | |
writer.writerow(row) | |
print('DONE splitting {} into {} files'.format(filename, len(pages))) |
This is a great code!! works perfectly fine. But I have a file which is 4GB and when I run this code over it its giving me the following error:
Traceback (most recent call last):
rows = [row for row in reader]
File "****\Split_CSV_sys.py", line 23, in
rows = [row for row in reader]
File "C:\Program Files\Python39\lib\csv.py", line 111, in next
row = next(self.reader)
_csv.Error: field larger than field limit (131072)
would really appreciate if there's a fix for this!!
I tried running it over other files smaller that this and it works just fine.
Simple version but will work for large files:
title = None
LINES_PER_FILE = 4000
lines_written = 0
file_no = 1
with open('test.txt', 'r') as f:
for line in f:
if not title:
title = line
if not lines_written:
g = open('test%03d.txt' % file_no, 'w')
g.write(title)
else:
g.write(line)
lines_written += 1
if lines_written >= LINES_PER_FILE:
g.close()
file_no += 1
lines_written = 0
g.close()
Simple version but will work for large files:
except that it doesn't work for perfectly valid csv files with multi-line fields.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@badbeef
that is a great question!
In a simple CSV file, it is probably true that you can get away with manipulating lines without using the
csv
module indeed.However, think about the case when one of the value for a column may be perhaps a multi-line string?
For example:
If we did not use the
csv
module, we need to deal with parsing quotations to "know" when exactly the end of a column value may be in the case above.Hope this answers your question!
Also, thank you for all the kind comments from everyone :)
This was something I wrote a long time ago (when Python 2.7 was still the "main" runtime version).
I'm glad to see that this is still useful, and working for you folks! 🍺