Skip to content

Instantly share code, notes, and snippets.

@zyedidia
Last active April 11, 2023 08:04
Show Gist options
  • Save zyedidia/40477f3de0f6f2627d2bc48fe1e7e640 to your computer and use it in GitHub Desktop.
Save zyedidia/40477f3de0f6f2627d2bc48fe1e7e640 to your computer and use it in GitHub Desktop.

Text Encoding

How are text files encoded, and what are the pros and cons of different encodings? This document introduces the ASCII and Unicode encodings to answer these questions. Some short pieces of Go code are provided which you can run yourself to see how strings are being encoded. Go is used because it has built-in support for Unicode, but you should be able to follow without any knowledge of Go.

ASCII

ASCII is a character encoding which uses a 7-bit number to encode a single character. In ASCII, we associate each character with a unique number between 0 and 127. On modern machines, each character is stored as a byte where the most significant bit is always 0. Let's see it in action!

// Go uses UTF-8 for string encoding, but this string will be encoded in ASCII,
// trust me (and keep reading to learn why!).
hello := "Hello!"
fmt.Println([]byte(hello)) // Prints the string hello as an array of bytes
// Prints:
// [72 101 108 108 111 33]

We can deduce from this test program that the character H is encoded as 72, e is 101, etc...

For the full ASCII table showing every character and its corresponding encoding, see man ascii, or http://www.asciitable.com.

The ASCII encoding is very compact, with 1 byte per character, but it only has space for 128 different characters. This is enough for digits, the English alphabet (lowercase and uppercase), certain punctuation, and spacing/control characters. For representing languages other than English, this is not good enough.

Unicode

Enter Unicode, the one code to rule them all. Unicode is a standard for text encoding and representation. The Unicode standard defines a way to represent 1,112,064 possible characters. Of these, there are roughly 140,000 characters in use currently. The Unicode standard defines several different encodings: UTF-32, UTF-16, and UTF-8 (and some other ones we won't discuss). UTF-8 is the most widely used, but we'll talk about UTF-32 first.

Terminology

Things get confusing because it's no longer clear what a "character" means. In C, a char is 1 byte. In text, a character is a symbol, and not necessarily encoded as one byte. To make things clearer we will use the following terminology:

  • A "byte" is one byte (8 bits).
  • A "code point" or "rune" is one of the 1,112,064 possible symbols defined by the Unicode standard. To refer to a code point, we use the notation U+xxxxxx, where xxxxxx is a hexadecimal number referring to that code point's index in the overall list of code points.
  • A "grapheme" is one cell of text. For example, as we will see later certain symbols such as characters with accents can be one grapheme but multiple code points.

UTF-32

Since we only need to encode 1,112,064 possible code points, the simplest choice is to use a 32-bit integer to encode a code point. This is clearly large enough to encode any possible code point. In Go, a "rune" refers to a Unicode code point, and is exactly the same as the "int32" datatype, which means an array of runes is using a UTF-32 encoding.

Let's try it out:

symbols := "αβ"
fmt.Println([]rune(symbols)) // Prints symbols as an array of runes/code points
// Prints:
// [945 946]

Great, we can encode any Unicode code point now! Plus, it is a fixed width encoding, so determining the location of the nth code point in a string is simple. There are some downsides though:

  • Every symbol takes 4 bytes to encode, which means every file encoded in UTF-32 will be 4 times larger than the same file encoded in ASCII (if it's possible to encode the file in ASCII).
  • Any file that was encoded in ASCII before will have to be converted to UTF-32. Likewise, UTF-32 encoded files cannot be viewed by a program that can only decode ASCII (even if the UTF-32 file only contains code points that can be represented by ASCII).

UTF-8

With UTF-8, we will try to fix the two downsides incurred by UTF-32. UTF-8 is a variable-width encoding. This means that some code points are represented using anywhere between 1 and 4 bytes of space. Let's look at how this works (using a table from Wikipedia). The table shows how various ranges of code points are represented in the UTF-8 encoding.

Number of bytes First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
1 U+0000 U+007F 0xxxxxxx
2 U+0080 U+07FF 110xxxxx 10xxxxxx
3 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

What does this mean? To encode a code point from U+0000 to U+007F (0 to 127), we just use one byte. This is also extremely convenient because it is the same encoding as ASCII for the first 127 code points, so any ASCII file will also be a valid UTF-8 file.

For later code points, we use prefixes on the first byte to know how many bytes this code point will use for encoding. Since all prefixes uniquely identify the kind of byte being encoded (e.g. first byte of a 2-byte encoded code point, or first byte of a 3-byte encoded code point...) there is never ambiguity during decoding about how many bytes to look at. For code points with multi-byte encodings, each later byte begins with 10. This is to prevent problems where starting to read the text in the middle of a code point (because of an error) would completely throw off the decoding of later code points as well.

Let's see it in action:

symbols := "αβ" // in Go, strings are encoded in UTF-8 by default
fmt.Println([]byte(symbols))
// Prints:
// [206 177 206 178]

For the two symbols αβ, we see that bytes with values greater than 127 are used. For a string like "Hello", we would see one byte per code point, since all of the code points in that string are between U+0000 and U+007F.

This is great because it allows our files to be compact in the common case (most files use English characters), and it is backwards compatible with ASCII! The downside is that figuring out where the nth code point begins requires decoding until that code point is found. Finding the number of code points in a string is also an O(n) algorithm, even if the length in bytes of the string is known.

Graphemes

One difficulty in Unicode is encoding accents. Should we make a separate code point for every character with an accent? For example, should é, á, ´ a, e all have one unique code point? Surely they need to be encoded differently, but dedicating a codepoint for every possible combination of character and accent starts using up many code points. Unicode solves this with the concept of a "combining code point." When the code point for a is followed by the code point for ´, the accent is combined with the character by the program that is displaying the text. This is because the code point for the accent is defined to be a "combining" code point.

Let's look at the bytes in Go:

symbols := "X̿" // in Go, strings are encoded in UTF-8 by default
fmt.Println([]byte(symbols))
fmt.Println([]rune(symbols))
// Prints:
// [88 204 191]
// [88 831]

We convert to UTF-32 by casting to a rune array to look at each individual code point. Here's what we can determine: this string has two code points even though it displays as one cell, with the accent symbol above the symbol encoded before it, and the accent symbol's code point is encoded using two bytes. Cool!

Why did I suddenly pick X̿ for my example instead of á like I was talking about previously? This is because for á the story is more complicated: this accent is more common, so for space efficiency it actually does get a unique code point. This means there are two ways of encoding the character á: using an a and a combining ´, or using the dedicated code point á. Let's see it in Go:

combiningAccent := "á"
fmt.Println("Combining:")
fmt.Println([]byte(combiningAccent))
fmt.Println([]rune(combiningAccent))
// Prints:
// Combining:
// [97 204 129]
// [97 769]

dedicatedAccent := "á"
fmt.Println("Dedicated:")
fmt.Println([]byte(dedicatedAccent))
fmt.Println([]rune(dedicatedAccent))
// Prints:
// Dedicated:
// [195 161]
// [225]

Even though these two strings look exactly the same when displayed by your editor, they are encoded differently! To read more about this, look up "text normalization." If you try pasting this code into your editor, you might see different results if your computer is automatically changing the combined accent to the dedicated one. You can try directly printing the bytes from the combined accent above to see the combining accent in action:

b := []byte{97, 204, 129}
fmt.Println(string(b))
// Prints:
// á

More than one combining character can be applied to form a single grapheme. For example, here is an interesting string:

interesting := "H̼̥̯͇͙̕͘͞e̸̦̞̠̣̰͙̼̥̦̼̖̬͕͕̰̯̫͇̕ĺ̜̠̩̯̯͙̼̭̠͕̮̞͜l̶͓̫̞̮͈͞ͅo̸͔͙̳̠͈̮̼̳͙̥̲͜͠"
fmt.Printf("Number of bytes: %d\n", len(interesting))
fmt.Printf("Number of code points: %d\n", utf8.RuneCountInString(interesting))
fmt.Printf("Number of graphemes: %d\n", uniseg.GraphemeClusterCount(interesting))
// Prints:
// Number of bytes: 132
// Number of code points: 68
// Number of graphemes: 5

Wow! This string is 132 bytes, 68 code points, and only 5 graphemes! To count the number of graphemes, I used the uniseg library, which you can find at https://github.com/rivo/uniseg.

There are other ways characters can be marked to display differently from normal. Some characters are "double-width" which means they take up two text cells when displayed. For example, many chinese characters are double width: .

UTF-16

UTF-16 is an encoding similar to UTF-8 but it uses blocks of 16 bits instead of 8 bits. Back when Unicode only supported fewer than 65536 code points, this worked fairly well, but since then the standard has grown. This means it incurs most of the downsides of both UTF-32 and UTF-8. It still sees use as a legacy encoding format because it existed before UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment