Last active
May 29, 2017 19:18
-
-
Save barend/de5eef180d95f4834676304711676322 to your computer and use it in GitHub Desktop.
How does UTF8 work, anyway?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The following is the Black Female Astronaut emoji as encoded | |
in UTF8, shown in hex: | |
F0 9F 91 A9 F0 9F 8F BF E2 80 8D F0 9F 9A 80 byte value | |
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 byte number | |
As you can tell it's fifteen bytes. If you express the hex | |
digits in binary you can see how UTF8 encoding works, and | |
you can see it's made up of four characters. | |
F0 9F 91 A9 F0 9F 8F BF E2 80 8D F0 9F 9A 80 | |
11110000 | | | | |
10011111 | | | | |
10010001 | | | | |
10101001 | | | | |
11110000 | | | |
10011111 | | | |
10001111 | | | |
10111111 | | | |
11100010 | | |
10000000 | | |
10001101 | | |
11110000 | |
10011111 | |
10011010 | |
10000000 | |
For every multi-byte UTF8 character, the leading 1-bits of | |
the first byte tell you how many total bytes the character | |
spans. The nul-byte and the 127 characters of the original | |
7-bit ASCII set take up one byte. | |
All single-byte UTF8 characters have a 0 for the first bit. | |
These are the fileformat.info pages for the four characters | |
shown above: | |
http://www.fileformat.info/info/unicode/char/1f469/index.htm | |
http://www.fileformat.info/info/unicode/char/1f3ff/index.htm | |
http://www.fileformat.info/info/unicode/char/200d/index.htm | |
http://www.fileformat.info/info/unicode/char/1f680/index.htm | |
That's woman, modifier-fitzpatrick-type-6*, joiner, rocket. | |
Fitzpatrick Type, you say? | |
https://en.wikipedia.org/wiki/Fitzpatrick_scale | |
๐ฉ๐ฟโ๐ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment