-
-
Save kelvintaywl/37dbfaea789707ec5f48 to your computer and use it in GitHub Desktop.
import csv | |
import sys | |
import os | |
# example usage: python split.py example.csv 200 | |
# above command would split the `example.csv` into smaller CSV files of 200 rows each (with header included) | |
# if example.csv has 401 rows for instance, this creates 3 files in same directory: | |
# - `example_1.csv` (row 1 - 200) | |
# - `example_2.csv` (row 201 - 400) | |
# - `example_3.csv` (row 401) | |
CURRENT_DIR = os.path.dirname(os.path.realpath(__file__)) | |
filename = sys.argv[1] | |
full_file_path = os.path.join(CURRENT_DIR, filename) | |
file_name = os.path.splitext(full_file_path)[0] | |
rows_per_csv = int(sys.argv[2]) if len(sys.argv) > 2 else 5000 | |
with open(filename) as infile: | |
reader = csv.DictReader(infile) | |
header = reader.fieldnames | |
rows = [row for row in reader] | |
pages = [] | |
row_count = len(rows) | |
start_index = 0 | |
# here, we slice the total rows into pages, each page having [row_per_csv] rows | |
while start_index < row_count: | |
pages.append(rows[start_index: start_index+rows_per_csv]) | |
start_index += rows_per_csv | |
for i, page in enumerate(pages): | |
with open('{}_{}.csv'.format(file_name, i+1), 'w+') as outfile: | |
writer = csv.DictWriter(outfile, fieldnames=header) | |
writer.writeheader() | |
for row in page: | |
writer.writerow(row) | |
print('DONE splitting {} into {} files'.format(filename, len(pages))) |
there is no error and no output on Mac OS.
Just what I needed! Thank you! Added it in the same directory as my CSVs, ran the script with the split size and split CSVs were generated beautifully.
why output split files have double quotes after it is split?
Dude this is so beautiful, this why I love github :)
thanks Kelvin
Life saver
No offense. I wonder why this program needs the csv module?
Isn't a csv file made up of a schema line and data lines in a text file? In which case, the program just needs to split a csv file by reading and writing lines instead of importing and exporting csv data.
that is a great question!
In a simple CSV file, it is probably true that you can get away with manipulating lines without using the csv
module indeed.
However, think about the case when one of the value for a column may be perhaps a multi-line string?
For example:
title,body
"Zen of Python, By Tim Peters","Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!"
If we did not use the csv
module, we need to deal with parsing quotations to "know" when exactly the end of a column value may be in the case above.
Hope this answers your question!
Also, thank you for all the kind comments from everyone :)
This was something I wrote a long time ago (when Python 2.7 was still the "main" runtime version).
I'm glad to see that this is still useful, and working for you folks! 🍺
This is a great code!! works perfectly fine. But I have a file which is 4GB and when I run this code over it its giving me the following error:
Traceback (most recent call last):
rows = [row for row in reader]
File "****\Split_CSV_sys.py", line 23, in
rows = [row for row in reader]
File "C:\Program Files\Python39\lib\csv.py", line 111, in next
row = next(self.reader)
_csv.Error: field larger than field limit (131072)
would really appreciate if there's a fix for this!!
I tried running it over other files smaller that this and it works just fine.
Simple version but will work for large files:
title = None
LINES_PER_FILE = 4000
lines_written = 0
file_no = 1
with open('test.txt', 'r') as f:
for line in f:
if not title:
title = line
if not lines_written:
g = open('test%03d.txt' % file_no, 'w')
g.write(title)
else:
g.write(line)
lines_written += 1
if lines_written >= LINES_PER_FILE:
g.close()
file_no += 1
lines_written = 0
g.close()
Simple version but will work for large files:
except that it doesn't work for perfectly valid csv files with multi-line fields.
Thank you!