Skip to content

Instantly share code, notes, and snippets.

@miyaokamarina
Last active December 24, 2020 12:03
Show Gist options
  • Save miyaokamarina/2f9c12edb0224d77f9c45e31e6c37aa1 to your computer and use it in GitHub Desktop.
Save miyaokamarina/2f9c12edb0224d77f9c45e31e6c37aa1 to your computer and use it in GitHub Desktop.
UTF-8: Unmasked

UTF-8: Unmasked

NB: This explanation assumes you are familiar with Unicode and differences between Unicode itself and encodings.

When I tried to understand UTF-8, first I read some decoders sources. They use bitwise masks and control flow, which is good for performance, but kind of bad for understanding.

So, here is UTF-8 explanation without masks. (And with PEG.js grammar.)

Rules

UTF-8 byte string consists of 1 to 4 byte sequences, which must follow the rules:

Bytes Bits Byte 1 Byte 2 Byte 3 Byte 4 Range
1 7 0XXXXXXX Unused Unused Unused [0x0, 0x80)
2 11 110XXXXX 10XXXXXX Unused Unused [0x80, 0x800)
3 16 1110XXXX 10XXXXXX 10XXXXXX Unused [0x800, 0x10000) \ [0xd800, 0xdfff]
4 21 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX [0x10000, 0x10ffff]

X-bits of each sequence concatenated in left-to-right order and form its value.

If the decoding value does not belong to the specified range, the sequence is not well-formed (may be treated as invalid).

NB: The [0xd800, 0xdfff] is subtracted from 3-byte sequences’ range to avoid Unicode surrogates in UTF-8. Surrogates (surrograte pairs) are used in UTF-16 and absolutely pointless in UTF-8.

PEG.js

To understand these rules better, you can play with them as a grammar defined on '0' and '1' strings with optional whitespace/newlines between “bytes”.

  1. Open PEG.js playground and paste the following source to the grammar field:

    utf-8.pegjs
    {
        const parse = xs => parseInt(xs.join(''), 2)
    
        const format = (bytes, ...xs) => {
            const int = parse(xs)
    
            return [
                'bin: ' + int.toString(2).padStart(21, '0'),
                'dec: ' + int.toString(10).padStart(7, '0'),
                'hex: ' + int.toString(16).padStart(6, '0'),
                'bytes: ' + bytes,
                'string: ' + String.fromCodePoint(int)
            ].join(', ')
        }
    
        const check = (f, ...xs) => f(parse(xs))
    }
    
    Demo = Sequence*
    
    // UTF-8 sequence may consist of 1 to 4 bytes.
    // We will parse sequences as strings of `'0'` and `'1'` separated by whitespaces/newlines.
    Sequence
        = Sequence1Byte
        / Sequence2Byte
        / Sequence3Byte
        / Sequence4Byte
    
    // Continuation bytes are used in multi-byte sequences.
    // Top bits of each continuation byte must be `10`.
    // Remaining 6 bits are its value.
    ContinuationByte = '10'x:XXXXXX { return x }
    
    Sequence1Byte '1-byte sequence' =
        // The top bit of 1-byte sequence myst be unset.
        // The remaining 7 bits are its value.
        _ '0'a:XXXXXXX _ // Byte 1 (7 bits)
         { return format(1, a) }
         // Total: 7 bits.
    
    Sequence2Byte '2-byte sequence' =
        // Top bits of the 2-byte sequence’s first byte must be `110`.
        // The remaining 5 bits are the sequence’s top bits.
        _ '110'a:XXXXX       _ // Byte 1 (5 bits)
        _ b:ContinuationByte _ // Byte 2 (6 bits)
        &{ return check(x => x >= 0x80, a, b) }
         { return format(2, a, b) }
         // Total: 11 bits.
    
    Sequence3Byte '3-byte sequence' =
        // Top bits of first byte must be `1110`, the value is remaining 4 bits.
        _ '1110'a:XXXX       _ // Byte 1 (4 bits)
        _ b:ContinuationByte _ // Byte 2 (6 bits)
        _ c:ContinuationByte _ // Byte 3 (6 bits)
        &{ return check(x => x >= 0x800 && x < 0xd800 || x > 0xdfff, a, b, c) }
         { return format(3, a, b, c) }
         // Total: 16 bits.
    
    Sequence4Byte '4-byte sequence' =
        // 4-byte sequence’s top bits must be `11110`.
        _ '11110'a:XXX       _ // Byte 1 (3 bits)
        _ b:ContinuationByte _ // Byte 2 (6 bits)
        _ c:ContinuationByte _ // Byte 3 (6 bits)
        _ d:ContinuationByte _ // Byte 4 (6 bits)
        &{ return check(x => x >= 0x010000 && x <= 0x10ffff, a, b, c, d) }
         { return format(4, a, b, c, d) }
         // Total: 21 bits.
    
    XXXXXXX = $(X X X X X X X)
    XXXXXX = $(X X X X X X)
    XXXXX = $(X X X X X)
    XXXX = $(X X X X)
    XXX = $(X X X)
    XX = $(X X)
    X = [01]
    
    _ = [ \n]*
  2. Type some sequences of zeroes and ones into the input field, e.g:

    01110001
    11110000 10101000 10101101 10001110
    11010001 10000110
    11100011 10000001 10000010
    11001110 10111011
    11100010 10000110 10010010
    00100011
    11110000 10011101 10010011 10010000
    

To get good sequences to test the grammar, you may use browser’s devtools:

  1. Paste the function:

    const encode = x => {
        const encoded = [...x]
            .map(x => [...new TextEncoder().encode(x)].map(x => x.toString(2).padStart(8, '0')).join(' '))
            .join('\n');
    
        copy(encoded);
    
        return encoded;
    };

    This function:

    1. Splits your string into the Unicode characters,
    2. encodes them using built-in encoder,
    3. serializes them as a string of zeroes and ones,
    4. copies serialized value to the clipboard.
  2. Use it:

    encode('q𨭎цあλ→#𝓐');
  3. Paste encoded string into the playground’s input field.

NB: This grammar is completely regular, so it may be written as regular expression. But range checks will become unclear hacks.

Reverse order

Well, is it possible to read UTF-8 string from the end? Yes, absolutely!

Bytes of each sequence must be read in reverse order, but other logic remains the same.

utf-8-reverse.pegjs
{
    const parse = xs => parseInt(xs.join(''), 2)

    const format = (bytes, ...xs) => {
        const int = parse(xs)

        return [
            'bin: ' + int.toString(2).padStart(21, '0'),
            'dec: ' + int.toString(10).padStart(7, '0'),
            'hex: ' + int.toString(16).padStart(6, '0'),
            'bytes: ' + bytes,
            'string: ' + String.fromCodePoint(int)
        ].join(', ')
    }

    const check = (f, ...xs) => f(parse(xs))
}

Demo = Sequence*

Sequence
    = Sequence1Byte
    / Sequence2Byte
    / Sequence3Byte
    / Sequence4Byte

ContinuationByte = '10'x:XXXXXX { return x }

Sequence1Byte '1-byte sequence' =
    _ '0'a:XXXXXXX _
     { return format(1, a) }

Sequence2Byte '2-byte sequence' =
    _ b:ContinuationByte _
    _ '110'a:XXXXX       _
    &{ return check(x => x >= 0x80, a, b) }
     { return format(2, a, b) }

Sequence3Byte '3-byte sequence' =
    _ c:ContinuationByte _
    _ b:ContinuationByte _
    _ '1110'a:XXXX       _
    &{ return check(x => x >= 0x800 && x < 0xd800 || x > 0xdfff, a, b, c) }
     { return format(3, a, b, c) }

Sequence4Byte '4-byte sequence' =
    // 4-byte sequence’s top bits must be `11110`.
    _ d:ContinuationByte _
    _ c:ContinuationByte _
    _ b:ContinuationByte _
    _ '11110'a:XXX       _
    &{ return check(x => x >= 0x010000 && x <= 0x10ffff, a, b, c, d) }
     { return format(4, a, b, c, d) }

XXXXXXX = $(X X X X X X X)
XXXXXX = $(X X X X X X)
XXXXX = $(X X X X X)
XXXX = $(X X X X)
XXX = $(X X X)
XX = $(X X)
X = [01]

_ = [ \n]*

Encoding function becomes:

const encodeReverse = x => {
    const encoded = [...x]
        .map(x =>
            [...new TextEncoder().encode(x)]
                .reverse()
                .map(x => x.toString(2).padStart(8, '0'))
                .join(' '),
        )
        .join('\n');

    copy(encoded);

    return encoded;
};

encodeReverse('q𨭎цあλ→#𝓐');

Or even:

const encodeReverse_ = x => {
    const encoded = [...new TextEncoder().encode(x)]
        .reverse()
        .map(x => x.toString(2).padStart(8, '0'))
        .join(' ');

    copy(encoded);

    return encoded;
};

encodeReverse_('q𨭎цあλ→#𝓐');

Well, that’s almost all about UTF-8. Please note, that when decoding actual byte string you should rather use bitwise masks and control flow, because it’s faster.

As an alternative to bitwise checks (like if ((x & 0xc0) == 0x80) for the continuation bytes) you may use sequences grammar defined in the RFC, but even in that case you should use masks and shifts to get scalar values.

Also there are some rules related to encoding operations, but it may be clearly inferred from the rules and grammars above.

@maximal
Copy link

maximal commented May 28, 2020

Need more explanation on:

NB: The [0xd800, 0xdfff] is subtracted from 3-byte sequences’s range to avoid Unicode surrogates in UTF-8. Surrogates (surrograte pairs) are used in UTF-16 and absolutely pointless in UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment