Working out how to read-in 30GB worth of text without a line-breaking record separator... at least, there wasn't a newline in the short example:
28807644'~'0'~'Maun FCU'~'US#@#@#28855353'~'0'~'WNB Holdings LLC'~'US#@#@#29212330'~'0'~'Idaho First Bank'~'US#@#@#29278777'~'0'~'Republic Bank of Arizona'~'US#@#@#29633181'~'0'~'Friendly Hills Bank'~'US#@#@#29760145'~'0'~'The Freedom Bank of Virginia'~'US#@#@#100504846'~'0'~'Community First Fund Federal Credit Union'~'US#@#@#
To make this right:
- replace field separator
'~'
with,
- replace record separator
#@#@#
with\n
This solution has a number of components:
main.py
The Python solution (uses buffered read/write).
Generate a test file first, link that test file to my_file.txt, run with a chunk-size argument, python3 main.py [1K|32K|1M]
.
gen_test.py
A helper to take that small example and multiply it to get KB, MB, or GB's worth of test data (also has a buffering approach).
Run python3 gen_test.py [MAGNITUDE] > test.txt
where MAGNITUDE is 10 raised to the number of copies of the sample string you want to generate.
For example, python3 gen_test.py 6 > e6test.txt
to create a file with one million copies of the sample string.
gen_target.py
Like gen_test.py, but creates what the processor should produce given a simarlarly sized input from gen_test.
Run python3 gen_target.py [MAGNITUDE] > target.csv
.
tesh.sh
A simple test harness for evaluating time, memory usage, and correctness compared to target files.
Run sh test.sh CHUNK_SZ
where CHUNK_SZ is the chunk size in main.py you'd like to evaluate.
Redirect to a file, like results_[CHUNK_SZ].txt
to save and collect the results, and maybe run through process_results.py.
process_results.py
Parses the results[CHUNK_SZ].txt
files, and creates a CSV with the metrics.
Run `python3 process_results.py [CHUNK_SZ] [[CHUNK_SZ_N...]]