This is trying to combine the lessons learned from LZ4, Snappy, S2 and friends.
- LZ4: Allow matches > 65536. More efficient long matches. More efficient short offsets.
- Snappy: Improve the max match length. More efficient longer match offsets.
- S2: More efficient repeat storage, relative offsets. Add 24 bits copy lengths.
Additional interesting:
- Minumim match length 8
- Minimum offset 8
- No match start less than 8 bytes from end.
We stay away from anything that would make decompression significantly slower/more intensive. That means no entropy compression and no ROLZ, no transforms or similar.
Encoding is to remain byte-aligned, with each operation starting at a full byte.
A compressed file is a sequence of operations.
Each operation starts with a tag.
Tags are stored as [ID bits 6...7][VALUE 0...5]
The ID represents the operation to perform.
ID | Command |
---|---|
00 |
Literals |
01 |
Lits + Copy Near |
10 |
Copy |
11 |
Repeat |
The VALUE is the lower 6 bits and represents an unsigned integer. The value has different meaning for each operation type.
The maximum length of an operations, excluding any data to copy for literals is 1+3+3 = 7 bytes. (offset > 132096 and length > 8235)
Value | Represents | Range |
---|---|---|
0-60 | Value = value | 0-60 |
61 | Read 1 byte. Value is 61+uint8(read). | 61-316 |
62 | Read 2 bytes. Value is 317 + uint16(read) | 317-65852 |
63 | Read 3 Bytes. Value is 65853 + uint24(read) | 65853 - 16843068 |
Extra bytes to be read are immediately after the tag.
Values longer than 16843068 cannot be represented. Compressors can split these into multiple statements.
Different operations have different use of the value.
It is not possible to encode invalid values. While the operation may get values that are impossible to satisfy due to the amount currently decoded, invalid values cannot be represented.
out_length = value + 1
Literals are uncompressed data stored directly in the byte stream.
Copy out_length
bytes from the stream to output.
This command will output a number of literals followed by a match copy with an offset <= 65536 and a max copy length of 11 bytes.
copy_len = (value&7) + 4 // 4 -> 11
lit_length = (value >> 3) + 1 // 1 -> 1052691
offset = read_2_bytes() + 1 // 1 -> 65536
copy literals
apply copy
(next operation)
Length depends on literals. The following table shows the total number of bytes (excluding literals) for a given number of literals:
Literals → | 1 -> 8 | 9 -> 40 | 40 -> 8232 | 8233-> 2105384 |
---|---|---|---|---|
Output size | 3 | 4 | 5 | 6 |
If more than 11 copy bytes should be emitted, use a repeat command to emit more.
Copy with long offset, either reading a 1, 2 or 3 byte offset.
Two lowest bits of value determines copy type:
Bits | Operation | Read bytes | Bits from value | Offset Bits | Offset base |
---|---|---|---|---|---|
00 , 01 |
Copy 1 | 1 | 1 | 9 | 1 |
10 |
Copy 2 | 2 | 1 | 17 | 513 |
11 |
Copy 3 | 3 | -2 | 22 | 131585 |
Offsets 1->512. Reads 1 byte.
- Offset = uint8 | value[1] << 8 + 1
- Length = value >> 2 + 4
Length → | 4 -> 19 | 20 -> 83 | 84 -> 16467 | 16468 -> 4210771 |
---|---|---|---|---|
Output size | 2 | 3 | 4 | 5 |
Offsets 513 -> 131584. 17 bits offset.
Read 2 bytes (little endian) as x
.
- Offset =
x
| value[2] << 16 + 513 - Length =
value[3:]
+ 4
Output sizes:
Length → | 4 -> 11 | 12 -> 43 | 44 -> 8235 | 8236 -> 2105388 |
---|---|---|---|---|
Output size | 3 | 4 | 5 | 6 |
Offsets 131585 -> 4325888. 22 bits offset. Cheap length on matches.
Length 4 matches are allowed, but not recommended.
Read 3 byte as x
(little endian order)
- Offset =
x[0:21]
+ 131585 - Length =
x[22:24]
+ (value &~3) + 4
Output sizes:
Length → | 4 -> 64 | 65 -> 320 | 321 -> 65856 | 65856 -> 16843072 |
---|---|---|---|---|
Output size | 4 | 5 | 6 | 7 |
It is not possible to represent a copy longer than 4325888 bytes, ~4 MiB back. This seems like a reasonable limit.
Repeat last copy offset.
out_length = value[2:] + 1 || 4
Offset can be modified. value[0:2]
contains the modifier:
Modifier | Represents | Read Bytes | Offset Range | Length |
---|---|---|---|---|
00 |
Use offset as-is | 0 | 0 | 1-4210768 |
01 |
offset = offset +- 2 | 0 | -2 -> 2 (excl 0) | 4-... |
10 |
offset = offset + read_signed_int() | 1 | -128 -> +127 | 4-4210771 |
11 |
offset = offset + read_signed_int16() | 2 | -32768 -> +32767 | 4-4210771 |
Note how the modifier is added to the length.
This also ensures that a copy can always be followed by a repeat if the length cannot be represented.
Length has pretty cheap encoding:
Offset ↓ / Length → | 1 -> 16 | 17 -> 80 | 80 -> 16464 | 16464-> 4210768 |
---|---|---|---|---|
0 | 1 | 2 | 3 | 4 |
-2 -> 2 | 1 | 2 | 3 | 4 |
-128 -> +127 | 2 | 3 | 4 | 5 |
-32768 -> +32767 | 3 | 4 | 5 | 6 |
Offsets have are bigger when going down the table. Ie -32768 -> +32767
has range 4 -> 19
, 20 -> 83
, etc.
delta := int(offset) - int(lastoffset)
switch {
case delta == 0:
// emit modifier '00'
case delta <= 2 && delta >= -2:
// emit modifier '01' - can be omitted for speed.
case delta >= math.MinInt8 && delta <= math.MaxInt8:
// emit modifier '10'
case delta >= math.MinInt16 && delta <= math.MaxInt16:
// emit modifier '11'
}
Note that encoding can be much simpler for faster compression modes or rely on internal checks to avoid the full decision tree.
if e.MatchLen == 0 {
s.emitLits(e.LitLen)
return
}
// Always at least just as good as the alternatives.
delta := int(e.Offset) - int(s.lastoffset)
if delta >= math.MinInt8 && delta <= math.MaxInt8 {
s.emitLits(e.LitLen)
s.emitRepeat(e.Offset, e.MatchLen)
return
}
bigOffset := e.Offset > shortOffset
canRepeat := delta >= math.MinInt16 && delta <= math.MaxInt16 && bigOffset
// If no literals, we don't have to consider the combination
if e.LitLen == 0 {
if canRepeat {
s.emitRepeat(e.Offset, e.MatchLen)
return
}
s.emitCopy(e)
return
}
// Add combined if possible and we have either very small match or lit length.
if bigOffset && e.Offset <= 65536 && (e.MatchLen <= 11 || e.LitLen < 7) && e.LitLen < 2<<20 {
s.emitLitCopy(e)
return
}
// Emit lits separately
s.emitLits(e.LitLen)
// Repeat if it makes sense.
if canRepeat {
s.emitRepeat(e.Offset, e.MatchLen)
return
}
s.emitCopy(e)
It would be really nice if a tag cannot start with less than 8 bytes of the encoded block ending.
That way we always knows there will be enough data to read the command and we don't have to validate that.
It will however more or less force the encoder to output the last 6 bytes as literals.
We could keep 2 offsets and maybe add 1 bit to repeats.
Tried it. Worse in all cases.
We often have extra bits for literal copies in the value. Can't see anything reasonable to use.