phucnm · June 30, 2020 06:10
diff --git a/broken_gzip.py b/broken_gzip.py
 '''
 When reading about gzip and bzip2, I found an interesting data pattern 
 that terribly breaks gzip where the compressed file is larged than the 
 uncompressed file. The reason is that the input data is incompressible 
 to gzip, hence gzip has to store the data as uncompressed, adding file 
 header and block headers, resulting in extra bytes. On the other hand, 
 bzip is still quite stable and able to achieve a good compression ratio.

 The format of data from the below script is as follows:
 step 1, start 0: 0 1 2 3 ... 255 
 step 2, start 0: 0 2 4 ... 254
 step 2, start 1: 1 3 5 7 ... 255
 ...

 Looking at the data, there is no 3 charater set appears more than once 
 which means that gzip (which uses DEFLATE under the hood for compressed 
 blocks) cannot compress anything. To make it worse, I can input reversed 
 data since the reversed byte sequence is also incompressible. 

 The below script generates 1343488 bytes. gzip -9 outputs a "compressed" 
 file of size 1343921. Whereas bzip2 -9 outputs a compressed file of size 186546.
 '''
 data = []
 f = open('broken.bin', 'wb')
 for step in range(1, 129):
    for start in range(0, step):
        for b in range(start, 256, step):
            data.append(b)
 b_data = bytes(data)
 b_r_data = bytes(reversed(data))
 for i in range(20):
    f.write(b_data)
    f.write(b_r_data)
 f.write(b_data)
 f.close()
	'''
	When reading about gzip and bzip2, I found an interesting data pattern
	that terribly breaks gzip where the compressed file is larged than the
	uncompressed file. The reason is that the input data is incompressible
	to gzip, hence gzip has to store the data as uncompressed, adding file
	header and block headers, resulting in extra bytes. On the other hand,
	bzip is still quite stable and able to achieve a good compression ratio.

	The format of data from the below script is as follows:
	step 1, start 0: 0 1 2 3 ... 255
	step 2, start 0: 0 2 4 ... 254
	step 2, start 1: 1 3 5 7 ... 255
	...

	Looking at the data, there is no 3 charater set appears more than once
	which means that gzip (which uses DEFLATE under the hood for compressed
	blocks) cannot compress anything. To make it worse, I can input reversed
	data since the reversed byte sequence is also incompressible.

	The below script generates 1343488 bytes. gzip -9 outputs a "compressed"
	file of size 1343921. Whereas bzip2 -9 outputs a compressed file of size 186546.
	'''
	data = []
	f = open('broken.bin', 'wb')
	for step in range(1, 129):
	for start in range(0, step):
	for b in range(start, 256, step):
	data.append(b)
	b_data = bytes(data)
	b_r_data = bytes(reversed(data))
	for i in range(20):
	f.write(b_data)
	f.write(b_r_data)
	f.write(b_data)
	f.close()