Last active
July 9, 2018 05:21
-
-
Save pratikone/0a8d503ffe893b2111f5a1d70d3999b7 to your computer and use it in GitHub Desktop.
It splits a large SQL dump file into smaller ones. It preservers SQL queries by splitting only after ; It is multi-threaded for I/O write and easy resumable.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
########################################################################################################### | |
# Created by : | |
# PRATIK ANAND | |
# github.com/pratikone | |
# It splits a large SQL dump file into smaller files. | |
# It preservers queries by only splitting at ; | |
# It launches a new thread for write I/O operation | |
# Tested on SQL dump of size 170 GB | |
# It is resumable by filename as it creates filename with lines | |
# For example, if a file contains lines from 501 - 2000 , the filename will be 500-2000 | |
# Using this information, skip set to 2000 will start the script from line 2001 next time you launch it | |
# Timestamp : Nov 1 2017 | |
############################################################################################################ | |
#TODO : Keep size of file in consideration while splitting and find a better way to calculate that | |
# as sys.getsizeof() doesn't provide accurate results | |
import threading | |
INPUT_FILE = "H:/steam.sql" | |
OUTPUT_FOLDER = "G:/steam_data" | |
CHUNK = 500 #lines | |
# SIZE_LIMIT_MB = 1024 * 1024 | |
# SIZE_LIMIT = 1024 #bytes | |
def write_to_file(file_name, data_to_write) : | |
print("Starting new thread to write to", file_name) | |
with open(OUTPUT_FOLDER + "/" + file_name, "w") as f: | |
for item in data_to_write: | |
f.write("%s\n" % item) | |
with open(INPUT_FILE, "r", encoding="utf-8") as fileobject: | |
count = 0 | |
skip = 0 #to avoid splitting already processed lines, change it to correct line number | |
start_count = count | |
chunk_check = count | |
buffer_space = [] | |
for line in fileobject: | |
count = count + 1 | |
if count <= skip : | |
if count % 100 == 0 : | |
print("Skipping till line ", count) | |
start_count = count | |
continue | |
# print("Processing line ", count) | |
line = line.strip() | |
buffer_space.append(line) | |
if ";" in line and (count - chunk_check) > CHUNK : | |
print("Processed till line {}".format(count)) | |
# if list_size > SIZE_LIMIT : | |
file_name = "{}-{}".format(start_count, count) | |
t = threading.Thread(target=write_to_file, args=(file_name, buffer_space[:])) | |
# t.daemon = True | |
t.start() | |
buffer_space = [] | |
start_count = count | |
chunk_check = count | |
# else : | |
#chunk_check = count | |
#last part | |
file_name = "{}-{}".format(start_count, count) | |
t = threading.Thread(target=write_to_file, args=(file_name, buffer_space[:])) | |
t.start() | |
print("DONE. Chalo Party karein") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thankyou for this script.
It seems to work, but I am facing issue with my data - perhaps there is a special character which is breaking the code;
Processed till line 27618
Starting new thread to write to 27117-27618
Processed till line 28119
Starting new thread to write to 27618-28119
Traceback (most recent call last):
File "I:\testsql\sqlsplitter.py", line 41, in
for line in fileobject:
File "C:\FAST\Python\3.5.2\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 2125: invalid start byte
PS C:\FAST\Python\3.5.2>