Last active
October 19, 2020 06:17
-
-
Save maropu/01103215df34b317a7a7 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
BitShuffle Kiyo Masui has proposed is a novel technique to improve compression of typed binary data. The technique itself is not a compression algorithm and it rearranges input typed and binary data into more compressible ones for LZ-variant algorithms such as LZ4 and Snappy. Apache Kudu, Hadoop storage for fast analysis, has already incorporated bit-shuffling in column encoding (See here for details). Typically, LZ-variants cannot compress a sequence of typed data (e.g., ints and floats) efficiently against various type-specific compression algorithms (e.g., delta coding and fpc) because of having different compression models. BitShuffle solves this issue and the rearranged data BitShuffle outputs become suitable for LZ-variants. To check this, I incorporated bit-shuffling in snappy-java and ran quick benchmarks by using a low-skewed 32-bit integer array;
The exciting results I got!; in terms of compression ratios,
snappy + bitshuffle
overcomesvanilla snappy
by 4 times and it is almost comparable withparquet encoder
having an integer-specific encoder. Moreover, the compression/decompression speeds ofsnappy + bitshuffle
are the fastest among them. Detailed benchmark codes and other experimental results can be found here.