UTF-8: Unmasked

NB: This explanation assumes you are familiar with Unicode and differences between Unicode itself and encodings.

When I tried to understand UTF-8, first I read some decoders sources. They use bitwise masks and control flow, which is good for performance, but kind of bad for understanding.

So, here is UTF-8 explanation without masks. (And with PEG.js grammar.)

Rules

UTF-8 byte string consists of 1 to 4 byte sequences, which must follow the rules:

Bytes	Bits	Byte 1	Byte 2	Byte 3	Byte 4	Range
1	7	`0XXXXXXX`	Unused	Unused	Unused	`[0x0, 0x80)`
2	11	`110XXXXX`	`10XXXXXX`	Unused	Unused	`[0x80, 0x800)`
3	16	`1110XXXX`	`10XXXXXX`	`10XXXXXX`	Unused	`[0x800, 0x10000) \ [0xd800, 0xdfff]`
4	21	`11110XXX`	`10XXXXXX`	`10XXXXXX`	`10XXXXXX`	`[0x10000, 0x10ffff]`

X-bits of each sequence concatenated in left-to-right order and form its value.

If the decoding value does not belong to the specified range, the sequence is not well-formed (may be treated as invalid).

NB: The [0xd800, 0xdfff] is subtracted from 3-byte sequences’ range to avoid Unicode surrogates in UTF-8. Surrogates (surrograte pairs) are used in UTF-16 and absolutely pointless in UTF-8.

PEG.js

To understand these rules better, you can play with them as a grammar defined on '0' and '1' strings with optional whitespace/newlines between “bytes”.

Open PEG.js playground and paste the following source to the grammar field:

utf-8.pegjs

{
    const parse = xs => parseInt(xs.join(''), 2)

    const format = (bytes, ...xs) => {
        const int = parse(xs)

        return [
            'bin: ' + int.toString(2).padStart(21, '0'),
            'dec: ' + int.toString(10).padStart(7, '0'),
            'hex: ' + int.toString(16).padStart(6, '0'),
            'bytes: ' + bytes,
            'string: ' + String.fromCodePoint(int)
        ].join(', ')
    }

    const check = (f, ...xs) => f(parse(xs))
}

Demo = Sequence*

// UTF-8 sequence may consist of 1 to 4 bytes.
// We will parse sequences as strings of `'0'` and `'1'` separated by whitespaces/newlines.
Sequence
    = Sequence1Byte
    / Sequence2Byte
    / Sequence3Byte
    / Sequence4Byte

// Continuation bytes are used in multi-byte sequences.
// Top bits of each continuation byte must be `10`.
// Remaining 6 bits are its value.
ContinuationByte = '10'x:XXXXXX { return x }

Sequence1Byte '1-byte sequence' =
    // The top bit of 1-byte sequence myst be unset.
    // The remaining 7 bits are its value.
    _ '0'a:XXXXXXX _ // Byte 1 (7 bits)
     { return format(1, a) }
     // Total: 7 bits.

Sequence2Byte '2-byte sequence' =
    // Top bits of the 2-byte sequence’s first byte must be `110`.
    // The remaining 5 bits are the sequence’s top bits.
    _ '110'a:XXXXX       _ // Byte 1 (5 bits)
    _ b:ContinuationByte _ // Byte 2 (6 bits)
    &{ return check(x => x >= 0x80, a, b) }
     { return format(2, a, b) }
     // Total: 11 bits.

Sequence3Byte '3-byte sequence' =
    // Top bits of first byte must be `1110`, the value is remaining 4 bits.
    _ '1110'a:XXXX       _ // Byte 1 (4 bits)
    _ b:ContinuationByte _ // Byte 2 (6 bits)
    _ c:ContinuationByte _ // Byte 3 (6 bits)
    &{ return check(x => x >= 0x800 && x < 0xd800 || x > 0xdfff, a, b, c) }
     { return format(3, a, b, c) }
     // Total: 16 bits.

Sequence4Byte '4-byte sequence' =
    // 4-byte sequence’s top bits must be `11110`.
    _ '11110'a:XXX       _ // Byte 1 (3 bits)
    _ b:ContinuationByte _ // Byte 2 (6 bits)
    _ c:ContinuationByte _ // Byte 3 (6 bits)
    _ d:ContinuationByte _ // Byte 4 (6 bits)
    &{ return check(x => x >= 0x010000 && x <= 0x10ffff, a, b, c, d) }
     { return format(4, a, b, c, d) }
     // Total: 21 bits.

XXXXXXX = $(X X X X X X X)
XXXXXX = $(X X X X X X)
XXXXX = $(X X X X X)
XXXX = $(X X X X)
XXX = $(X X X)
XX = $(X X)
X = [01]

_ = [ \n]*

Type some sequences of zeroes and ones into the input field, e.g:

01110001
11110000 10101000 10101101 10001110
11010001 10000110
11100011 10000001 10000010
11001110 10111011
11100010 10000110 10010010
00100011
11110000 10011101 10010011 10010000

To get good sequences to test the grammar, you may use browser’s devtools:

Paste the function:

const encode = x => {
    const encoded = [...x]
        .map(x => [...new TextEncoder().encode(x)].map(x => x.toString(2).padStart(8, '0')).join(' '))
        .join('\n');

    copy(encoded);

    return encoded;
};

This function:

Splits your string into the Unicode characters,
encodes them using built-in encoder,
serializes them as a string of zeroes and ones,
copies serialized value to the clipboard.

Use it:
```
encode('q𨭎цあλ→#𝓐');
```
Paste encoded string into the playground’s input field.

NB: This grammar is completely regular, so it may be written as regular expression. But range checks will become unclear hacks.

Reverse order

Well, is it possible to read UTF-8 string from the end? Yes, absolutely!

Bytes of each sequence must be read in reverse order, but other logic remains the same.

utf-8-reverse.pegjs

{
    const parse = xs => parseInt(xs.join(''), 2)

    const format = (bytes, ...xs) => {
        const int = parse(xs)

        return [
            'bin: ' + int.toString(2).padStart(21, '0'),
            'dec: ' + int.toString(10).padStart(7, '0'),
            'hex: ' + int.toString(16).padStart(6, '0'),
            'bytes: ' + bytes,
            'string: ' + String.fromCodePoint(int)
        ].join(', ')
    }

    const check = (f, ...xs) => f(parse(xs))
}

Demo = Sequence*

Sequence
    = Sequence1Byte
    / Sequence2Byte
    / Sequence3Byte
    / Sequence4Byte

ContinuationByte = '10'x:XXXXXX { return x }

Sequence1Byte '1-byte sequence' =
    _ '0'a:XXXXXXX _
     { return format(1, a) }

Sequence2Byte '2-byte sequence' =
    _ b:ContinuationByte _
    _ '110'a:XXXXX       _
    &{ return check(x => x >= 0x80, a, b) }
     { return format(2, a, b) }

Sequence3Byte '3-byte sequence' =
    _ c:ContinuationByte _
    _ b:ContinuationByte _
    _ '1110'a:XXXX       _
    &{ return check(x => x >= 0x800 && x < 0xd800 || x > 0xdfff, a, b, c) }
     { return format(3, a, b, c) }

Sequence4Byte '4-byte sequence' =
    // 4-byte sequence’s top bits must be `11110`.
    _ d:ContinuationByte _
    _ c:ContinuationByte _
    _ b:ContinuationByte _
    _ '11110'a:XXX       _
    &{ return check(x => x >= 0x010000 && x <= 0x10ffff, a, b, c, d) }
     { return format(4, a, b, c, d) }

XXXXXXX = $(X X X X X X X)
XXXXXX = $(X X X X X X)
XXXXX = $(X X X X X)
XXXX = $(X X X X)
XXX = $(X X X)
XX = $(X X)
X = [01]

_ = [ \n]*

Encoding function becomes:

const encodeReverse = x => {
    const encoded = [...x]
        .map(x =>
            [...new TextEncoder().encode(x)]
                .reverse()
                .map(x => x.toString(2).padStart(8, '0'))
                .join(' '),
        )
        .join('\n');

    copy(encoded);

    return encoded;
};

encodeReverse('q𨭎цあλ→#𝓐');

Or even:

const encodeReverse_ = x => {
    const encoded = [...new TextEncoder().encode(x)]
        .reverse()
        .map(x => x.toString(2).padStart(8, '0'))
        .join(' ');

    copy(encoded);

    return encoded;
};

encodeReverse_('q𨭎цあλ→#𝓐');

Well, that’s almost all about UTF-8. Please note, that when decoding actual byte string you should rather use bitwise masks and control flow, because it’s faster.

As an alternative to bitwise checks (like if ((x & 0xc0) == 0x80) for the continuation bytes) you may use sequences grammar defined in the RFC, but even in that case you should use masks and shifts to get scalar values.

Also there are some rules related to encoding operations, but it may be clearly inferred from the rules and grammars above.

miyaokamarina/UTF-8: Unmasked.md

UTF-8: Unmasked

Rules

PEG.js

Reverse order

maximal commented May 28, 2020

Uh oh!