Skip to content

Instantly share code, notes, and snippets.

@oleiade
Created January 4, 2013 16:49
Show Gist options
  • Save oleiade/4454048 to your computer and use it in GitHub Desktop.
Save oleiade/4454048 to your computer and use it in GitHub Desktop.
Gzip file divider in python, based on a lines per file sampling method. Usage: ./divide.py input_file output_dir lines_per_file
#!/usr/bin/env python
import sys
import os
import gzip
fpath = sys.argv[1]
output = sys.argv[2]
parts_size = int(sys.argv[3])
os.mkdir(output)
f = gzip.GzipFile(fpath, 'r')
linenum = 1
filenum = 1
counter = 1
outfile = open(os.path.join(output, str(filenum)), 'a')
for line in f:
if not linenum % parts_size:
sys.stdout.write("\r%d Generated part file" % counter)
sys.stdout.flush()
outfile.close()
del outfile
filenum += 1
outfile = open(os.path.join(output, str(filenum)), 'a')
counter += 1
else:
outfile.write(line)
linenum += 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment