Skip to content

Instantly share code, notes, and snippets.

@phucnm
Last active June 30, 2020 06:10
Show Gist options
  • Save phucnm/8f59b6c9ce49435efb0eb28828d68a0f to your computer and use it in GitHub Desktop.
Save phucnm/8f59b6c9ce49435efb0eb28828d68a0f to your computer and use it in GitHub Desktop.
Broken gzip, whereas bzip2 is still stable.
'''
When reading about gzip and bzip2, I found an interesting data pattern
that terribly breaks gzip where the compressed file is larged than the
uncompressed file. The reason is that the input data is incompressible
to gzip, hence gzip has to store the data as uncompressed, adding file
header and block headers, resulting in extra bytes. On the other hand,
bzip is still quite stable and able to achieve a good compression ratio.
The format of data from the below script is as follows:
step 1, start 0: 0 1 2 3 ... 255
step 2, start 0: 0 2 4 ... 254
step 2, start 1: 1 3 5 7 ... 255
...
Looking at the data, there is no 3 charater set appears more than once
which means that gzip (which uses DEFLATE under the hood for compressed
blocks) cannot compress anything. To make it worse, I can input reversed
data since the reversed byte sequence is also incompressible.
The below script generates 1343488 bytes. gzip -9 outputs a "compressed"
file of size 1343921. Whereas bzip2 -9 outputs a compressed file of size 186546.
'''
data = []
f = open('broken.bin', 'wb')
for step in range(1, 129):
for start in range(0, step):
for b in range(start, 256, step):
data.append(b)
b_data = bytes(data)
b_r_data = bytes(reversed(data))
for i in range(20):
f.write(b_data)
f.write(b_r_data)
f.write(b_data)
f.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment