Last active
June 30, 2020 06:10
-
-
Save phucnm/8f59b6c9ce49435efb0eb28828d68a0f to your computer and use it in GitHub Desktop.
Broken gzip, whereas bzip2 is still stable.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
''' | |
When reading about gzip and bzip2, I found an interesting data pattern | |
that terribly breaks gzip where the compressed file is larged than the | |
uncompressed file. The reason is that the input data is incompressible | |
to gzip, hence gzip has to store the data as uncompressed, adding file | |
header and block headers, resulting in extra bytes. On the other hand, | |
bzip is still quite stable and able to achieve a good compression ratio. | |
The format of data from the below script is as follows: | |
step 1, start 0: 0 1 2 3 ... 255 | |
step 2, start 0: 0 2 4 ... 254 | |
step 2, start 1: 1 3 5 7 ... 255 | |
... | |
Looking at the data, there is no 3 charater set appears more than once | |
which means that gzip (which uses DEFLATE under the hood for compressed | |
blocks) cannot compress anything. To make it worse, I can input reversed | |
data since the reversed byte sequence is also incompressible. | |
The below script generates 1343488 bytes. gzip -9 outputs a "compressed" | |
file of size 1343921. Whereas bzip2 -9 outputs a compressed file of size 186546. | |
''' | |
data = [] | |
f = open('broken.bin', 'wb') | |
for step in range(1, 129): | |
for start in range(0, step): | |
for b in range(start, 256, step): | |
data.append(b) | |
b_data = bytes(data) | |
b_r_data = bytes(reversed(data)) | |
for i in range(20): | |
f.write(b_data) | |
f.write(b_r_data) | |
f.write(b_data) | |
f.close() | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment