Perhaps the most prominent, if not the most important, property of any Unicode character is its name. These immutable labels are much more descriptive than their hexadecimal codes.
But these names aren’t used often in most programming languages. Why is that? Perhaps the biggest reason why is the sheer size of the list of names, weighing 1.769 MB; embedding such a file in every language runtime would impose a large burden on resource-constrained environments such as mobile devices. But can this burden be mitigated with data compression? How compressible is this list? Can we create a succinct data structure from this list?
Precedent for this project comes from GNU Uniname, a GNU Project utility, written by Bill Poser of British Columbia, which can look up the characters of names and the names of characters. Its source may be read in uniname.c
and test-uninames.c
. According to its documentation, the size of the compiled program is less than 350 kB, which is a good benchmark goal for our project.
- Names-to-points lookup: We want to be able to retrieve the character corresponding to a given Unicode name.
- Points-to-names lookup: We want to retrieve the Unicode names corresponding to characters.
- No false positives: Looking up invalid names has to return a null value; it can’t return a valid code point.
- Named sequences: We want to be able to retrieve both code points and named character sequences.
- Spatial efficiency: We want our lookup data structure to succinct: to occupy as little storage and memory as possible.
- Temporal efficiency: We want the speed of retrieval to be fast, although we care less about speed than about storage and memory.
- No sequential iteration: Sequential iteration over Unicode names – whether by code point or alphabetically by name – is anticipated to be much less common than lookup of individual characters or names.
- No dynamic updates: We will never have to do dynamic updates to the data structure: The data structure may be only-annually rebuilt from scratch.
- No fast compilation: The building of the data structure happens before the program actually runs and so can be much slower than lookup operations.
The majority of Unicode names are strict Unicode character names, being formally assigned to characters by the Name
/na
property. There are also other Unicode names, including character name aliases and character-sequence names.
No matter their type, all Unicode names are permanent and immutable for ever and ever; each also denotes exactly one string (either a single code point or a sequence of them). They all also follow several simple orthographic rules as declared in The Unicode Standard, Chapter 4’s clauses R1–R4:
- Names will always contain only (case-insensitive) Latin letters, ASCII number digits, spaces, and ASCII hyphens: that’s 38 total characters. (We’ll call chunks of names separated by spaces “words”. For instance, “Zero Width Space” consists of three words.)
- Words in names will never start with number digits.
- Whole names will never start or end with ASCII hyphens; also, they will never have double hyphens.
- #3 also applies to spaces too.
A corollary of these rules is that all whitespace is ignorable. Furthermore, all medial ASCII hyphens (where “medial” means “inside a word, not at its beginning or its end”) are ignorable, with one current exception: the Korean character U+1180 Hangul Junseong O-E, whose name needs its hyphen to be distinguishable from U+116C Hangul Junseong OE.
Strict Unicode character names (the Name
/na
property) follow the above restrictions, but they’re actually determined by the same chapter’s clauses NR1–NR4:
-
Precomposed Korean/Hangul-syllable code points (---) always are named “Hangul Syllable ” followed by certain sounds called jamo. Korean/Hangul-syllable names are discussed more below.
-
Most ideographs are named “CJK Unified Ideograph-”, “Tangut Ideograph-”, “Nushu Character-”, or “CJK Compatibility Ideograph-”, followed by its hexadecimal code point. For example, U+4E00 is automatically named “CJK Unified Ideograph-4E00”. This rule applies to: U+3400–4DB5: “CJK Unified ”, U+4E00–9FEA: “CJK Unified Ideograph-”, U+F900–FA6D: “CJK Compatibility Ideograph-”, U+FA70–FAD9: “CJK Compatibility Ideograph-”, U+17000–187EC: “Tangut Ideograph-”, U+1B170–1B2FB: “Nushu Character-”, U+20000–2A6D6: “CJK Unified Ideograph-”, U+2A700–2B734: “CJK Unified Ideograph-”, U+2B740–2B81D: “CJK Unified Ideograph-”, U+2B820–2CEA1: “CJK Unified Ideograph-”, U+2CEB0–2BBE0: “CJK Unified Ideograph-”, and U+2F800–2FA1D: “CJK Compatibility Ideograph-”.
-
If a character has an entry in Unicode Character Database’s
UnicodeData.txt
with a name in the first field, then that’s its name.UnicodeData.txt
makes up the bulk of the data that we need to compress. -
If none of the rules above apply, then that means it doesn’t have a strict Unicode name. This applies to control characters, private-use characters, surrogate points, noncharacters, and reserved characters. However, these characters do have code-point labels, and some of these may have character name aliases.
Strict Unicode character names therefore form an injective, non-surjective function over all code points.
The longest name is “Arabic Ligature Uighur Kirghiz Yeh with Hamza Above with Alef Maksura Isolated Form” (83 characters long). The shortest name is “Ox”.
The bulk of the names are in the Unicode Character Database’s UnicodeData.txt
. It’s a 32292-line text file that looks like this:
0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
0002;<control>;Cc;0;BN;;;;;N;START OF TEXT;;;;
0003;<control>;Cc;0;BN;;;;;N;END OF TEXT;;;;
0004;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION;;;;
0005;<control>;Cc;0;BN;;;;;N;ENQUIRY;;;;
0006;<control>;Cc;0;BN;;;;;N;ACKNOWLEDGE;;;;
0007;<control>;Cc;0;BN;;;;;N;BELL;;;;
0008;<control>;Cc;0;BN;;;;;N;BACKSPACE;;;;
…and then this:
001E;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR TWO;;;;
001F;<control>;Cc;0;S;;;;;N;INFORMATION SEPARATOR ONE;;;;
0020;SPACE;Zs;0;WS;;;;;N;;;;;
0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
0024;DOLLAR SIGN;Sc;0;ET;;;;;N;;;;;
…and then this:
004C;LATIN CAPITAL LETTER L;Lu;0;L;;;;;N;;;;006C;
004D;LATIN CAPITAL LETTER M;Lu;0;L;;;;;N;;;;006D;
004E;LATIN CAPITAL LETTER N;Lu;0;L;;;;;N;;;;006E;
004F;LATIN CAPITAL LETTER O;Lu;0;L;;;;;N;;;;006F;
0050;LATIN CAPITAL LETTER P;Lu;0;L;;;;;N;;;;0070;
0051;LATIN CAPITAL LETTER Q;Lu;0;L;;;;;N;;;;0071;
…and then later this:
090D;DEVANAGARI LETTER CANDRA E;Lo;0;L;;;;;N;;;;;
090E;DEVANAGARI LETTER SHORT E;Lo;0;L;;;;;N;;;;;
090F;DEVANAGARI LETTER E;Lo;0;L;;;;;N;;;;;
0910;DEVANAGARI LETTER AI;Lo;0;L;;;;;N;;;;;
0911;DEVANAGARI LETTER CANDRA O;Lo;0;L;;;;;N;;;;;
0912;DEVANAGARI LETTER SHORT O;Lo;0;L;;;;;N;;;;;
…then this:
2F9FD;CJK COMPATIBILITY IDEOGRAPH-2F9FD;Lo;0;L;29496;;;;N;;;;;
2F9FE;CJK COMPATIBILITY IDEOGRAPH-2F9FE;Lo;0;L;980B;;;;N;;;;;
2F9FF;CJK COMPATIBILITY IDEOGRAPH-2F9FF;Lo;0;L;980B;;;;N;;;;;
2FA00;CJK COMPATIBILITY IDEOGRAPH-2FA00;Lo;0;L;9829;;;;N;;;;;
2FA01;CJK COMPATIBILITY IDEOGRAPH-2FA01;Lo;0;L;295B6;;;;N;;;;;
2FA02;CJK COMPATIBILITY IDEOGRAPH-2FA02;Lo;0;L;98E2;;;;N;;;;;
2FA03;CJK COMPATIBILITY IDEOGRAPH-2FA03;Lo;0;L;4B33;;;;N;;;;;
…until getting to the end:
E01EC;VARIATION SELECTOR-253;Mn;0;NSM;;;;;N;;;;;
E01ED;VARIATION SELECTOR-254;Mn;0;NSM;;;;;N;;;;;
E01EE;VARIATION SELECTOR-255;Mn;0;NSM;;;;;N;;;;;
E01EF;VARIATION SELECTOR-256;Mn;0;NSM;;;;;N;;;;;
F0000;<Plane 15 Private Use, First>;Co;0;L;;;;;N;;;;;
FFFFD;<Plane 15 Private Use, Last>;Co;0;L;;;;;N;;;;;
100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;
10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;
We could write a command-line utility in JavaScript using Node.js that uses this file to look up Unicode names. If we put this code into an executable file in the same directory as a UnicodeData.txt
file, then running the file with the arguments get [name]
will sequentially read UnicodeData.txt
, searching each line for the given name
, then prints the character and its code point to the console.
#!/usr/bin/env node --no-warnings
const fs = require('fs');
const readline = require('readline');
const { promisify, inspect } = require('util');
// The console is passed as an injected dependency into all functions that need to use it
// in order to make testing the functions easier. It’s set as undefined here to prevent
// accidental direct use of the actual console rather than the injected parameter.
const console = undefined;
const dataSourceFile = 'UnicodeData.txt';
const helpText = `
Looks up the character with a given Unicode name.
Commands:
help Prints this help message.
get [name] Prints the code point(s) of the character with the given name,
followed by the character itself between two delimiters,
or “<none>” if no character with that name exists.
Note that you must surround the name argument with quotes if the
name contains spaces.
`;
const noDataSourceErrorText = `
🛑 Error: Attempted to build database using UniData.txt, but no
UniData.txt source file was found.`;
function isNoDataSourceError (error) {
return error.code === 'ENOENT' && error.path === dataSourceFile;
}
function getDelimiters (character) {
switch (character) {
case '“':
case '”':
return ['[', ']'];
default:
return ['“', '”'];
}
}
async function mainGet (console, targetName) {
const normalizedTargetName = targetName.toUpperCase();
const input = fs.createReadStream(databaseFile);
const lineReader = readline.createInterface({input});
const invalidNameOutput = '<none>';
for await (const line of lineReader) {
const [ codePointHex, name ] = line.split('\t');
if (name === normalizedTargetName) {
logCharacterMessage(codePointHex, name);
return;
}
}
console.log(invalidNameOutput);
}
async function main (argv, {console = globalConsole} = {}) {
try {
const [argument0, argument1] = argv;
switch (argument0) {
case 'get':
await mainGet(console, argument1);
break;
case 'help':
case '--help':
case '-h':
default:
console.error(helpText);
}
} catch (error) {
if (isNoDataSourceError(error))
console.error(noDataSourceErrorText);
else
console.error(`${unexpectedErrorText}${inspect(error)}\n`);
}
}
if (require.main === module)
main(process.argv.slice(2));
Putting this script into an executable script – in the same directory as a copy of UnicodeData.txt
– then entering ./[script-name.js] get 'devanagari letter e'
into a terminal – will output:
U+090F DEVANAGARI LETTER E “ए”
However, UnicodeData.txt
weighs 1.769 MB – in other words, 1769 kB. That is far too large for many resource-constrained computing environments such as many mobile devices. But this file obviously has a lot of discardable non-name data and numerous recurring phrases. In addition, finding the character for a name could involve sequentially searching nearly the entirety of UnicodeData.txt
in the worst case (for U+E01EF Variation Selector-256). How much can this situation be improved?
Let’s modify the script to generate a UnicodeData.gz
file from UnicodeData.txt
, then using UnicodeData.gz
instead.
#!/usr/bin/env node --no-warnings
const fs = require('fs');
const https = require('https');
const readline = require('readline');
const zlib = require('zlib');
const { promisify, inspect } = require('util');
// The console is passed as an injected dependency into all functions that need to use it
// in order to make testing the functions easier. It’s set as undefined here to prevent
// accidental direct use of the actual console rather than the injected parameter.
const console = undefined;
const dataSourceFile = 'UnicodeData.txt';
const databaseFile = 'UnicodeData.gz';
const helpText = `
Looks up the character with a given Unicode name.
Commands:
help Prints this help message.
get [name] Prints the code point(s) of the character with the given name,
followed by the character itself between two delimiters,
or "<none>" if no character with that name exists.
Note that you must surround the name argument with quotes if the
name contains spaces.
build Builds the database of names into the script’s directory.
This requires a UnicodeData.txt file in the same directory as the script.
It processes its data and outputs it into a compressed format.
`;
const noDataSourceErrorText = `
🛑 Error: Attempted to build database using UniData.txt, but no
UniData.txt source file was found.`;
const noDatabaseErrorText = `
🛑 Error: No names database was found.
The database needs to be built using the "build" command.
`;
const unexpectedErrorText = `
🛑 Error: An unexpected internal error occurred:
`;
const buildingStartText = `
Building Unicode-names database...`;
const buildingEndText = `Database building complete.
`;
// # Building
function isNoDataSourceError (error) {
return error.code === 'ENOENT' && error.path === dataSourceFile;
}
async function mainBuild (console) {
console.log(buildingStartText);
const output = fs.createWriteStream(databaseFile);
const gzipTransform = zlib.createGzip();
const input = fs.createReadStream(dataSourceFile)
.on('error', reject);
input.pipe(gzipTransform).pipe(output);
await promisify(gzipTransform.bind.flush(gzipTransform))();
output.close();
console.log(buildingEndText);
}
// # Getting names’ characters
function isNoDatabaseError (error) {
return error.code === 'ENOENT' && error.path === databaseFile;
}
function getDelimiters (character) {
switch (character) {
case '"':
case '"':
return ['[', ']'];
default:
return ['"', '"'];
}
}
function logCharacterMessage (console, codePointHex, character, name) {
const [openingDelimiter, closingDelimiter] = getDelimiters(character);
console.log(
`U+${codePointHex} ${name} ${openingDelimiter}${character}${closingDelimiter}`);
}
async function mainGet (console, targetName) {
return new Promise(async (resolve, reject) => {
try {
const input = zlib.createUnzip();
const fileInput = fs.createReadStream(databaseFile)
.on('error', reject);
fileInput.pipe(input);
const lineReader = readline.createInterface({input});
const normalizedTargetName = targetName.toUpperCase();
const invalidNameOutput = '<none>';
for await (const line of lineReader) {
const [ codePointHex, name ] = line.split('\t');
const character = String.fromCharCode(parseInt(codePointHex, 16));
if (name === normalizedTargetName) {
logCharacterMessage(console, codePointHex, character, name);
resolve(character);
return;
}
}
console.log(invalidNameOutput);
resolve();
} catch (error) {
reject(error);
}
});
}
// # Main program
const globalConsole = global.console;
async function main (argv, {console = globalConsole} = {}) {
try {
const [argument0, argument1] = argv;
switch (argument0) {
case 'get':
await mainGet(console, argument1);
break;
case 'build':
await mainBuild(console);
break;
case 'help':
case '--help':
case '-h':
default:
console.error(helpText);
}
} catch (error) {
if (isNoDataSourceError(error))
console.error(noDataSourceErrorText);
else if (isNoDatabaseError(error))
console.error(noDatabaseErrorText);
else
console.error(`${unexpectedErrorText}${inspect(error)}\n`);
}
}
if (require.main === module)
main(process.argv.slice(2));
Running ./[script-name.js] build
will create a UnicodeData.gz
file from any UnicodeData.txt
in the same directory. Then – as with before – running ./[script-name.js] get 'devanagari letter e'
will print that character and its code point.
Compressing UnicodeData.txt
using gzip
with LZ77 squeezes the file from 1769 kB to 266 kB, which is a dramatic 665% improvement. Brotli would do even better, squeezing it into 174 kB, for a 1020% improvement, but we are not using it here – we’re keeping the implementation simple, and Node.js includes a gzip
implementation built in.
Let’s modify the script to generate a smaller data file, by reading UnicodeData.txt
, discarding its extraneous information, then putting it into a smaller database.gz
file.
We’ll add another import:
const stream = import('streams');
…tweak a filename constant:
const databaseFile = 'database.gz';
…create a new helper function readUnicodeData
that reads UnicodeData.txt
, discard everything but code points and character names, then puts their pairs into one big array:
async function readUnicodeData () {
return new Promise(async (resolve, reject) => {
try {
const input = fs.createReadStream(dataSourceFile)
.on('error', reject);
const lineReader = readline.createInterface({input});
const labelPattern = /^<.+>$/; // Like "<control>".
const result = [];
for await (const line of lineReader) {
const [codePointHex, name] = line.split(';');
if (!name.match(labelPattern))
result.push([codePointHex, name]);
}
resolve(result);
} catch (error) {
reject(error);
}
});
}
…and change mainBuild
to put readUnicodeData
’s array into a serializer transformation (createNameDataSerializer
), whose output is then piped into a database.gz
file:
function createNameDataSerializer (codePoint_name_pairs, console) {
return new stream.Readable({
read (size) {
for (const [codePoint, name] of codePoint_name_pairs) {
const codePointHex = codePoint.toString(16).toUpperCase();
const entry = `${codePointHex}\t${name}\n`;
this.push(entry);
}
this.push(null);
}
});
}
async function mainBuild (console) {
console.log(buildingStartText);
const codePoint_name_pairs = await readUnicodeData();
const nameSerializer = createNameSerializer(codePoint_name_pairs, console);
const gzipTransform = zlib.createGzip();
const output = fs.createWriteStream(databaseFile);
await promisify(stream.pipeline)(nameSerializer, gzipTransform, output);
output.close();
console.log(buildingEndText);
}
…and then change mainGet
to handle the new format:
async function mainGet (console, targetName) {
return new Promise(async (resolve, reject) => {
try {
const input = zlib.createUnzip();
const fileInput = fs.createReadStream(databaseFile)
.on('error', reject);
fileInput.pipe(input);
const lineReader = readline.createInterface({input});
const normalizedTargetName = targetName.toUpperCase();
const invalidNameOutput = '<none>';
for await (const line of lineReader) {
const [ codePointHex, name ] = line.split('\t');
const character = String.fromCharCode(parseInt(codePointHex, 16));
if (name === normalizedTargetName) {
logCharacterMessage(console, codePointHex, character, name);
resolve(character);
return;
}
}
console.log(invalidNameOutput);
resolve();
} catch (error) {
reject(error);
}
});
}
When unzipped, database.gz
is a 32197-line file that looks like:
0020 SPACE
0021 EXCLAMATION MARK
0022 QUOTATION MARK
0023 NUMBER SIGN
0024 DOLLAR SIGN
0025 PERCENT SIGN
0026 AMPERSAND
0027 APOSTROPHE
…which dramatically reduces its unzipped weight from 1769 kB to 1030 kB: a 172% improvement. With gzip
it decreases from 266 kB to 203 kB and with Brotli it becomes 174 kB to 138 kB.
1410 ideographic characters with generated names are included in UnicodeData.txt
for other properties. These include Han compatibility ideographs and Nushu characters. We can discard these using the third rule from name characteristics:
function isNonnameLabel (name) {
const labelPattern = /^<.+>$/; // Like “<control>”.
return name.match(labelPattern)
}
function hasGeneratedIdeographicName (codePoint) {
return (0x3400 <= codePoint && codePoint <= 0x4db5)
|| (0x4e00 <= codePoint && codePoint <= 0x9fea)
|| (0xf900 <= codePoint && codePoint <= 0xfa6d)
|| (0xfa70 <= codePoint && codePoint <= 0xfad9)
|| (0x17000 <= codePoint && codePoint <= 0x187ec)
|| (0x1b170 <= codePoint && codePoint <= 0x1b2fb)
|| (0x20000 <= codePoint && codePoint <= 0x2a6d6)
|| (0x2a700 <= codePoint && codePoint <= 0x2b734)
|| (0x2b820 <= codePoint && codePoint <= 0x2cea1)
|| (0x2ceb0 <= codePoint && codePoint <= 0x2bbe0)
|| (0x2f800 <= codePoint && codePoint <= 0x2fa1d)
}
function isSkippedCharacter (name, codePoint) {
return isNonnameLabel(name) || hasGeneratedIdeographicName(codePoint);
}
async function readUnicodeData () {
return new Promise(async (resolve, reject) => {
try {
const input = fs.createReadStream(dataSourceFile)
.on('error', reject);
const lineReader = readline.createInterface({input});
const entries = [];
for await (const line of lineReader) {
const [codePointHex, name] = line.split(';');
const codePoint = parseInt(codePointHex, 16);
if (!isSkippedCharacter(name, codePoint))
entries.push([name, codePoint]);
}
resolve(new Map(entries));
} catch (error) {
reject(error);
}
});
}
database.gz
goes from 32197 to 30787 lines, which reduces it from 1030 kB to 980 kB, a 105% improvement. With gzip
this becomes 199 kB and with Brotli it becomes 135 kB. These improvements are more modest. But we can still squeeze our data further.
Code points in the Basic Multilingual Plane (the BMP: U+0000 to U+FFFF) currently use four hexadecimal digits. And code points in the Supplemental Multilingual Planes (the SMPs: U+010000 to U+10FFFF) currently use six hexadecimal digits. Yet, even after sorting by name, the majority of data in database.txt
might consist of contiguous (or, at least, mutually close) blocks code points. All those poorly compressible hexadecimal codes can be converted into a huge, very compressible set of 1
increments (for contiguous code points), with occasional larger increments. This can be done by changing createNameSerializer
to:
function createNameSerializer (codePoint_name_pairs, console) {
return new stream.Readable({
read (size) {
let previousCodePoint = 0;
for (const [codePoint, name] of codePoint_name_pairs) {
const codePointDifference = codePoint - previousCodePoint;
const codePointDifferenceHex = codePointDifference.toString(16).toUpperCase();
const entry = `${codePointDifferenceHex}\t${name}\n`;
this.push(entry);
previousCodePoint = codePoint;
}
this.push(null);
}
});
}
…and the inner loop of mainGet
to:
let previousCodePoint = 0;
for await (const line of lineReader) {
const [ codePointIncrementHex, name ] = line.split('\t');
const codePointIncrement = parseInt(codePointIncrementHex, 16);
const codePoint = previousCodePoint + codePointIncrement;
const codePointHex = codePointToHex(codePoint);
const character = String.fromCharCode(codePoint);
if (name === normalizedTargetName) {
logCharacterMessage(console, codePointHex, character, name);
resolve(character);
return;
}
previousCodePoint = codePoint;
}
This changes database.gz
to look like this 32197-line text file when uncompressed:
20 SPACE
1 EXCLAMATION MARK
1 QUOTATION MARK
1 NUMBER SIGN
1 DOLLAR SIGN
1 PERCENT SIGN
1 AMPERSAND
1 APOSTROPHE
1 LEFT PARENTHESIS
1 RIGHT PARENTHESIS
1 ASTERISK
…and later this:
1 CJK COMPATIBILITY IDEOGRAPH-2FA18
1 CJK COMPATIBILITY IDEOGRAPH-2FA19
1 CJK COMPATIBILITY IDEOGRAPH-2FA1A
1 CJK COMPATIBILITY IDEOGRAPH-2FA1B
1 CJK COMPATIBILITY IDEOGRAPH-2FA1C
1 CJK COMPATIBILITY IDEOGRAPH-2FA1D
B05E4 LANGUAGE TAG
1F TAG SPACE
1 TAG EXCLAMATION MARK
1 TAG QUOTATION MARK
1 TAG NUMBER SIGN
1 TAG DOLLAR SIGN
1 TAG PERCENT SIGN
1 TAG AMPERSAND
1 TAG APOSTROPHE
1 TAG LEFT PARENTHESIS
1 TAG RIGHT PARENTHESIS
This dramatically increases compression again. --- The gzip
ped file now weighs 134 kB, which is an additional 150% improvement. (The Brotli version would decrease further to 97 kB, which is a 140% improvement from the previous Brotli version.)
Currently, in the worst cases (U+E01EF Variation Selector-256 or nonexistent characters) our script might have to search the entire database before determining the answer. This is not desirable. We want to be able to look up the characters of given names and the names of given characters in nearly constant time, regardless of how many entries there are in the database.
Further complicating matters is our application of gzip
or Brotli to the entire database. This unfortunately precludes sequential search, because gzip
and Brotli both output linear streams of data.
We can fix this by splitting our database into several sections, each of which is individually compressed and each of which is individually addressable:
-
An array of randomly accessible data blobs, together which form a binary search tree (BST), and each which represents one node in that BST. The BST is perfectly balanced, so a simple array of its nodes can act as an implicit representation of the tree. The array’s zeroth element is the tree’s root; its first and second elements are the root’s lesser and greater children; its third, fourth, fifth, and sixth elements are those children’s respective children; and so on.
-
There are two kinds of nodes: leaves and branches. All nodes start with header data, and the first bit of a node’s header data is a kind bit. The kind bit indicates whether the node is a leaf (0) or a branch (1).
-
A leaf node contains a header (made of a kind bit then an initial code point) followed by a Brotli-compressed sequence of names and code-point increments, interleaved with one another. The first name belongs to the initial code point, then the next name belongs to the initial code point plus the following increment, and then so on.
-
A branch node contains a header (made of a kind bit then a threshold code point). The threshold code point determines whether a binary search will next select the node’s lesser child or its greater child.
-
-
A names-to-BST-indices lookup table, which is able to map Unicode names to the indices of the BST nodes that contain the names’ characters.
The BST is perfectly balanced in order to support implicit indexing of its nodes. In order to construct this tree, we will divide the 30787 non-skipped entries of UnicodeData.txt
into two nodes, several times.
How many times should we bifurcate our nodes? We expect the compression to be less effective as the number of divisions increases. But we also expect the lookup speed to increase: a trade-off. However, it is difficult to predict precisely by how much the space will increase with each successive division. We will make the number of divisions configurable and then chart how it affects both spatial and temporal efficiency.
In addition, how should we bifurcate our nodes? The simplest way would be into equal halves. But this may not be the best way. The space of Unicode code points is divided into "blocks" each consisting of 0x80 (128) points. The Standard attempts to group similar characters in the same Unicode block, although this has not been the case for many characters for historical reasons. Similar characters are more likely to have similar names, and therefore splitting Unicode blocks between BST nodes may worsen their compression. So an alternative approach that we should test would be to bifurcate nodes into roughly equal halves but rounded to the nearest block (that is, such that the point of splitting is a multiple of 0x80).
Hash tables are perhaps the most popular lookup structure today. In particular, minimal perfect hash functions might be great for looking up with minimal space. The information-theoretical minimum size of a minimal perfect hash function over n string keys, as n → ∞, is (n log2(e) − O(log2(n))) bits ≈ 1.44 bits · n. In our case, n = 32197, so our hash function would require approximately 6.54 kB of overhead (for a proof of this, see Belo Horizonte’s 2008 thesis). The 6.54 kB would be overhead cost, in addition to the cost of storing 32197 · log2(32197 / 2 ** l) bits for their corresponding BST-node indices. For l = 0, 1, 2, and 3, this is 66.8 kB, 62.8 kB, 58.8 kB, and 54.8 kB. (Consider if we stored code points directly, instead of BST-node indices. There are 0x110000 (1,114,112) code points, so it takes a minimum of log2(1,114,112) bits to represent one code point. So our overhead would become 6.54 kB + 32197 · log2(0x110000) bits ≈ 83.38 kB, which is significantly larger.)
Hash functions come with an important disadvantage: false positives. Hashes are not necessarily unique. Any efficient hash function might yield the same value for some two keys. It’s why passwords need to be salted with a secret value, and it’s why secure cryptographic hash functions are so difficult to create. Any minimal perfect hash function that rejects invalid keys would would somehow have to include copies of all their keys somewhere anyway. Fortunately, we will be keeping our keys in the BST anyway. Furthermore, we can save much room in the hash table by encoding the indices of the BST leaves that contain characters, rather than the characters’ code points directly, since we’re going to have to retrieve the characters from the BST anyway.
Tries / radix trees / prefix trees were also considered as an alternative. However, the consensus was that, even when compressed; tries are necessarily much larger than hash tables. They have an advantage in substring matching, but [we do not care about substring matching][goals].
Both gzip
and Brotli compression are third-party dependencies that, being sophisticated and complex, would add large overhead to client-side scripts. gzip
is built into Node.js but it is not yet built into any DOM web API; the most popular browser implementation, ---, weighs --- kB when minified. Brotli is neither ---.
Can we compress the data to a level comparable to either compression method but with less overhead? ---
The algorithm for generating Korean/Hangul syllable is defined in the Standard’s third chapter, § 3.12. It also conveniently gives some sample Java code:
public static String getHangulName(char s) {
int SIndex = s - SBase;
if (0 > SIndex || SIndex >= SCount) {
throw new IllegalArgumentException("Not a Hangul Syllable: "
+ s);
}
int LIndex = SIndex / NCount;
int VIndex = (SIndex % NCount) / TCount;
int TIndex = SIndex % TCount;
return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex]
+ JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex];
}
static private String[] JAMO_L_TABLE = {
"G", "GG", "N", "D", "DD", "R", "M", "B", "BB",
"S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H"
};
static private String[] JAMO_V_TABLE = {
"A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O",
"WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI",
"YU", "EU", "YI", "I"
};
static private String[] JAMO_T_TABLE = {
"", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM",
"LB", "LS", "LT", "LP", "LH", "M", "B", "BS",
"S", "SS", "NG", "J", "C", "K", "T", "P", "H"
};
From this, a parser of their names in JavaScript can trivially be derived. ---
Some Unicode names are aliases to characters, separate from their strict character names. Oftentimes this is because the strict character name has a misspelling, lke the infamous U+FE18 Presentation Form for Vertical Right White Lenticular Brakcet. U+FE18 got an alias with the correct spelling “Bracket”. Other characters have other types of aliases: for widely used alternate names and for abbreviations.
A character may have zero, one, or many aliases: Although strict character names are an injective function, character name aliases are a non-injective function. ---
Special character sequences have also received Unicode names. These sequences are strings of multiple code points. ---
Many characters do not have any strict Unicode name. These include the control characters, private-use characters, surrogate points, noncharacters, and reserved characters. The Standard therefore gives them labels that look like U+0009 < control-0009 > and U+FFFF < noncharacter-FFFF >. Although they technically are in a separate namespace from the Unicode names, it still would be useful to refer to them in program source code, and it is very unlikely that there will be a collision between any future Unicode name and an already-valid code-point label. Code-point labels are also easily determined by several rules: ---