NG byte aligned compression

Intro

This is trying to combine the lessons learned from LZ4, Snappy, S2 and friends.

LZ4: Allow matches > 65536. More efficient long matches. More efficient short offsets.
Snappy: Improve the max match length. More efficient longer match offsets.
S2: More efficient repeat storage, relative offsets. Add 24 bits copy lengths.

Additional interesting:

Minumim match length 8
Minimum offset 8
No match start less than 8 bytes from end.

We stay away from anything that would make decompression significantly slower/more intensive. That means no entropy compression and no ROLZ, no transforms or similar.

Encoding is to remain byte-aligned, with each operation starting at a full byte.

Format

A compressed file is a sequence of operations.

Each operation starts with a tag.

Tags are stored as [ID bits 6...7][VALUE 0...5]

The ID represents the operation to perform.

ID	Command
`00`	Literals
`01`	Lits + Copy Near
`10`	Copy
`11`	Repeat

The VALUE is the lower 6 bits and represents an unsigned integer. The value has different meaning for each operation type.

The maximum length of an operations, excluding any data to copy for literals is 1+3+3 = 7 bytes. (offset > 132096 and length > 8235)

Value	Represents	Range
0-60	Value = value	0-60
61	Read 1 byte. Value is 61+uint8(read).	61-316
62	Read 2 bytes. Value is 317 + uint16(read)	317-65852
63	Read 3 Bytes. Value is 65853 + uint24(read)	65853 - 16843068

Extra bytes to be read are immediately after the tag.

Values longer than 16843068 cannot be represented. Compressors can split these into multiple statements.

Different operations have different use of the value.

It is not possible to encode invalid values. While the operation may get values that are impossible to satisfy due to the amount currently decoded, invalid values cannot be represented.

Commands

Copy Literals.

out_length = value + 1

Literals are uncompressed data stored directly in the byte stream.

Copy out_length bytes from the stream to output.

Lits + Near Copy

This command will output a number of literals followed by a match copy with an offset <= 65536 and a max copy length of 11 bytes.

copy_len = (value&7) + 4 // 4 -> 11
lit_length = (value >> 3) + 1 // 1 -> 1052691
offset = read_2_bytes() + 1 // 1 -> 65536
copy literals
apply copy
(next operation)

Length depends on literals. The following table shows the total number of bytes (excluding literals) for a given number of literals:

Literals →	1 -> 8	9 -> 40	40 -> 8232	8233-> 2105384
Output size	3	4	5	6

If more than 11 copy bytes should be emitted, use a repeat command to emit more.

Copy

Copy with long offset, either reading a 1, 2 or 3 byte offset.

Two lowest bits of value determines copy type:

Bits	Operation	Read bytes	Bits from value	Offset Bits	Offset base
`00`, `01`	Copy 1	1	1	9	1
`10`	Copy 2	2	1	17	513
`11`	Copy 3	3	-2	22	131585

Copy 1

Offsets 1->512. Reads 1 byte.

Offset = uint8 | value[1] << 8 + 1
Length = value >> 2 + 4

Length →	4 -> 19	20 -> 83	84 -> 16467	16468 -> 4210771
Output size	2	3	4	5

Copy 2

Offsets 513 -> 131584. 17 bits offset.

Read 2 bytes (little endian) as x.

Offset = x | value[2] << 16 + 513
Length = value[3:] + 4

Output sizes:

Length →	4 -> 11	12 -> 43	44 -> 8235	8236 -> 2105388
Output size	3	4	5	6

Copy 3

Offsets 131585 -> 4325888. 22 bits offset. Cheap length on matches.

Length 4 matches are allowed, but not recommended.

Read 3 byte as x (little endian order)

Offset = x[0:21] + 131585
Length = x[22:24] + (value &~3) + 4

Output sizes:

Length →	4 -> 64	65 -> 320	321 -> 65856	65856 -> 16843072
Output size	4	5	6	7

It is not possible to represent a copy longer than 4325888 bytes, ~4 MiB back. This seems like a reasonable limit.

Repeat (with offset)

Repeat last copy offset.

out_length = value[2:] + 1 || 4

Offset can be modified. value[0:2] contains the modifier:

Modifier	Represents	Read Bytes	Offset Range	Length
`00`	Use offset as-is	0	0	1-4210768
`01`	offset = offset +- 2	0	-2 -> 2 (excl 0)	4-...
`10`	offset = offset + read_signed_int()	1	-128 -> +127	4-4210771
`11`	offset = offset + read_signed_int16()	2	-32768 -> +32767	4-4210771

Note how the modifier is added to the length.

This also ensures that a copy can always be followed by a repeat if the length cannot be represented.

Length has pretty cheap encoding:

Offset ↓ / Length →	1 -> 16	17 -> 80	80 -> 16464	16464-> 4210768
0	1	2	3	4
-2 -> 2	1	2	3	4
-128 -> +127	2	3	4	5
-32768 -> +32767	3	4	5	6

Offsets have are bigger when going down the table. Ie -32768 -> +32767 has range 4 -> 19, 20 -> 83, etc.

	delta := int(offset) - int(lastoffset)
	switch {
	case delta == 0:
		// emit modifier '00'
	case delta <= 2 && delta >= -2:
		// emit modifier '01' - can be omitted for speed.
	case delta >= math.MinInt8 && delta <= math.MaxInt8:
		// emit modifier '10'
	case delta >= math.MinInt16 && delta <= math.MaxInt16:
		// emit modifier '11'
	}

Encoding decision tree

Note that encoding can be much simpler for faster compression modes or rely on internal checks to avoid the full decision tree.

	if e.MatchLen == 0 {
		s.emitLits(e.LitLen)
		return
	}

	// Always at least just as good as the alternatives.
	delta := int(e.Offset) - int(s.lastoffset)
	if delta >= math.MinInt8 && delta <= math.MaxInt8 {
		s.emitLits(e.LitLen)
		s.emitRepeat(e.Offset, e.MatchLen)
		return
	}
	bigOffset := e.Offset > shortOffset
	canRepeat := delta >= math.MinInt16 && delta <= math.MaxInt16 && bigOffset

	// If no literals, we don't have to consider the combination
	if e.LitLen == 0 {
		if canRepeat {
			s.emitRepeat(e.Offset, e.MatchLen)
			return
		}
		s.emitCopy(e)
		return
	}

	// Add combined if possible and we have either very small match or lit length.
	if bigOffset && e.Offset <= 65536 && (e.MatchLen <= 11 || e.LitLen < 7) && e.LitLen < 2<<20 {
		s.emitLitCopy(e)
		return
	}

	// Emit lits separately
	s.emitLits(e.LitLen)

	// Repeat if it makes sense.
	if canRepeat {
		s.emitRepeat(e.Offset, e.MatchLen)
		return
	}
	s.emitCopy(e)

IDEAS AND STUFF

Endgame

It would be really nice if a tag cannot start with less than 8 bytes of the encoded block ending.

That way we always knows there will be enough data to read the command and we don't have to validate that.

It will however more or less force the encoder to output the last 6 bytes as literals.

Repeats

We could keep 2 offsets and maybe add 1 bit to repeats.

Tried it. Worse in all cases.

Literals

We often have extra bits for literal copies in the value. Can't see anything reasonable to use.

Alternatively we could just change S2 to emit long matches as 3 bytes instead of 4 (S2 sizes adjusted)

    decode_test.go:218: ENWIK9:
    decode_test.go:219: S2 Size: 340160015
    decode_test.go:220: NG Size: 336917924 - Brute Force: 336896076
    decode_test.go:223: Gain: 3242091 bytes (0.95%; brute: 0.96%; 1.15% ex lits)
    decode_test.go:218: RANKS:
    decode_test.go:219: S2 Size: 510767105
    decode_test.go:220: NG Size: 496903025 - Brute Force: 496774758
    decode_test.go:223: Gain: 13864080 bytes (2.71%; brute: 2.74%; 3.88% ex lits)
    decode_test.go:218: GITHUB:
    decode_test.go:219: S2 Size: 781331477
    decode_test.go:220: NG Size: 788315878 - Brute Force: 788176052
    decode_test.go:223: Gain: -6984401 bytes (-0.89%; brute: -0.88%; -1.40% ex lits)
    decode_test.go:218: CONSENSUS:
    decode_test.go:219: S2 Size: 4121907724
    decode_test.go:220: NG Size: 4103624175 - Brute Force: 4102970365
    decode_test.go:223: Gain: 18283549 bytes (0.44%; brute: 0.46%; 2.73% ex lits)
    decode_test.go:218: NYC:
    decode_test.go:219: S2 Size: 688973678
    decode_test.go:220: NG Size: 686325672 - Brute Force: 686303567
    decode_test.go:223: Gain: 2648006 bytes (0.38%; brute: 0.39%; 0.42% ex lits)
    decode_test.go:218: MINT:
    decode_test.go:219: S2 Size: 3661007999
    decode_test.go:220: NG Size: 3619864350 - Brute Force: 3618459691
    decode_test.go:223: Gain: 41143649 bytes (1.12%; brute: 1.16%; 3.65% ex lits)
    decode_test.go:218: SOFIA:
    decode_test.go:219: S2 Size: 3668146768
    decode_test.go:220: NG Size: 3728298101 - Brute Force: 3728268060
    decode_test.go:223: Gain: -60151333 bytes (-1.64%; brute: -1.64%; -1.87% ex lits)

LOL

klauspost/byte-compression.md

NG byte aligned compression

Intro

Format

Commands

Copy Literals.

Lits + Near Copy

Copy

Copy 1

Copy 2

Copy 3

Repeat (with offset)

Encoding decision tree

IDEAS AND STUFF

Endgame

Repeats

Literals

klauspost commented Mar 8, 2023 •

edited

Loading

klauspost commented Mar 10, 2023 •

edited

Loading

klauspost commented Mar 11, 2023 •

edited

Loading

klauspost/byte-compression.md

NG byte aligned compression

Intro

Format

Commands

Copy Literals.

Lits + Near Copy

Copy

Copy 1

Copy 2

Copy 3

Repeat (with offset)

Encoding decision tree

IDEAS AND STUFF

Endgame

Repeats

Literals

klauspost commented Mar 8, 2023 • edited Loading

klauspost commented Mar 10, 2023 • edited Loading

klauspost commented Mar 11, 2023 • edited Loading

klauspost commented Mar 8, 2023 •

edited

Loading

klauspost commented Mar 10, 2023 •

edited

Loading

klauspost commented Mar 11, 2023 •

edited

Loading