maropu/Snappy+BitShuffle

maropu · 2016-03-05T04:58:13Z

BitShuffle Kiyo Masui has proposed is a novel technique to improve compression of typed binary data. The technique itself is not a compression algorithm and it rearranges input typed and binary data into more compressible ones for LZ-variant algorithms such as LZ4 and Snappy. Apache Kudu, Hadoop storage for fast analysis, has already incorporated bit-shuffling in column encoding (See here for details). Typically, LZ-variants cannot compress a sequence of typed data (e.g., ints and floats) efficiently against various type-specific compression algorithms (e.g., delta coding and fpc) because of having different compression models. BitShuffle solves this issue and the rearranged data BitShuffle outputs become suitable for LZ-variants. To check this, I incorporated bit-shuffling in snappy-java and ran quick benchmarks by using a low-skewed 32-bit integer array;

Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
Compress(Lower Skew):               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
vanilla snappy(0.281)                     212 /  258         75.3          13.3       1.0X
snappy + bitshuffle(0.077)                 62 /   78        258.3           3.9       3.4X
parquet encoder(0.072)                    297 /  375         53.9          18.6       0.7X

Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
Uncompress(Lower Skew):             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
vanilla snappy                             93 /  105        171.8           5.8       1.0X
snappy + bitshuffle                        55 /   59        290.7           3.4       1.7X
parquet encoder                            77 /   78        207.2           4.8       1.2X

The exciting results I got!; in terms of compression ratios, snappy + bitshuffle overcomes vanilla snappy by 4 times and it is almost comparable with parquet encoder having an integer-specific encoder. Moreover, the compression/decompression speeds of snappy + bitshuffle are the fastest among them. Detailed benchmark codes and other experimental results can be found here.

maropu/Snappy+BitShuffle

Select an option

No results found

Select an option

No results found

maropu commented Mar 5, 2016

Uh oh!