Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save salewski/a4c2fe297af2059d88fcc75d05e44708 to your computer and use it in GitHub Desktop.
Save salewski/a4c2fe297af2059d88fcc75d05e44708 to your computer and use it in GitHub Desktop.
Succinct Unicode character names

Compressing the Unicode names lists

Perhaps the most prominent, if not the most important, property of any Unicode character is its name. These immutable labels are much more descriptive than their hexadecimal codes.

But these names aren’t used often in most programming languages. Why is that? Perhaps the biggest reason why is the sheer size of the list of names, weighing 1.769 MB; embedding such a file in every language runtime would impose a large burden on resource-constrained environments such as mobile devices. But can this burden be mitigated with data compression? How compressible is this list? Can we create a succinct data structure from this list?

Precedent for this project comes from GNU Uniname, a GNU Project utility, written by Bill Poser of British Columbia, which can look up the characters of names and the names of characters. Its source may be read in uniname.c and test-uninames.c. According to its documentation, the size of the compiled program is less than 350 kB, which is a good benchmark goal for our project.

Goals

  1. Names-to-points lookup: We want to be able to retrieve the character corresponding to a given Unicode name.
  2. Points-to-names lookup: We want to retrieve the Unicode names corresponding to characters.
  3. No false positives: Looking up invalid names has to return a null value; it can’t return a valid code point.
  4. Named sequences: We want to be able to retrieve both code points and named character sequences.
  5. Spatial efficiency: We want our lookup data structure to succinct: to occupy as little storage and memory as possible.
  6. Temporal efficiency: We want the speed of retrieval to be fast, although we care less about speed than about storage and memory.
  7. No sequential iteration: Sequential iteration over Unicode names – whether by code point or alphabetically by name – is anticipated to be much less common than lookup of individual characters or names.
  8. No dynamic updates: We will never have to do dynamic updates to the data structure: The data structure may be only-annually rebuilt from scratch.
  9. No fast compilation: The building of the data structure happens before the program actually runs and so can be much slower than lookup operations.

Name characteristics

The majority of Unicode names are strict Unicode character names, being formally assigned to characters by the Name/na property. There are also other Unicode names, including character name aliases and character-sequence names.

No matter their type, all Unicode names are permanent and immutable for ever and ever; each also denotes exactly one string (either a single code point or a sequence of them). They all also follow several simple orthographic rules as declared in The Unicode Standard, Chapter 4’s clauses R1–R4:

  1. Names will always contain only (case-insensitive) Latin letters, ASCII number digits, spaces, and ASCII hyphens: that’s 38 total characters. (We’ll call chunks of names separated by spaces “words”. For instance, “Zero Width Space” consists of three words.)
  2. Words in names will never start with number digits.
  3. Whole names will never start or end with ASCII hyphens; also, they will never have double hyphens.
  4. #3 also applies to spaces too.

A corollary of these rules is that all whitespace is ignorable. Furthermore, all medial ASCII hyphens (where “medial” means “inside a word, not at its beginning or its end”) are ignorable, with one current exception: the Korean character U+1180 Hangul Junseong O-E, whose name needs its hyphen to be distinguishable from U+116C Hangul Junseong OE.

Strict Unicode character names (the Name/na property) follow the above restrictions, but they’re actually determined by the same chapter’s clauses NR1–NR4:

  1. Precomposed Korean/Hangul-syllable code points (---) always are named “Hangul Syllable ” followed by certain sounds called jamo. Korean/Hangul-syllable names are discussed more below.

  2. Most ideographs are named “CJK Unified Ideograph-”, “Tangut Ideograph-”, “Nushu Character-”, or “CJK Compatibility Ideograph-”, followed by its hexadecimal code point. For example, U+4E00 is automatically named “CJK Unified Ideograph-4E00”. This rule applies to: U+3400–4DB5: “CJK Unified ”, U+4E00–9FEA: “CJK Unified Ideograph-”, U+F900–FA6D: “CJK Compatibility Ideograph-”, U+FA70–FAD9: “CJK Compatibility Ideograph-”, U+17000–187EC: “Tangut Ideograph-”, U+1B170–1B2FB: “Nushu Character-”, U+20000–2A6D6: “CJK Unified Ideograph-”, U+2A700–2B734: “CJK Unified Ideograph-”, U+2B740–2B81D: “CJK Unified Ideograph-”, U+2B820–2CEA1: “CJK Unified Ideograph-”, U+2CEB0–2BBE0: “CJK Unified Ideograph-”, and U+2F800–2FA1D: “CJK Compatibility Ideograph-”.

  3. If a character has an entry in Unicode Character Database’s UnicodeData.txt with a name in the first field, then that’s its name. UnicodeData.txt makes up the bulk of the data that we need to compress.

  4. If none of the rules above apply, then that means it doesn’t have a strict Unicode name. This applies to control characters, private-use characters, surrogate points, noncharacters, and reserved characters. However, these characters do have code-point labels, and some of these may have character name aliases.

Strict Unicode character names therefore form an injective, non-surjective function over all code points.

The longest name is “Arabic Ligature Uighur Kirghiz Yeh with Hamza Above with Alef Maksura Isolated Form” (83 characters long). The shortest name is “Ox”.

UnicodeData.txt

The bulk of the names are in the Unicode Character Database’s UnicodeData.txt. It’s a 32292-line text file that looks like this:

0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
0002;<control>;Cc;0;BN;;;;;N;START OF TEXT;;;;
0003;<control>;Cc;0;BN;;;;;N;END OF TEXT;;;;
0004;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION;;;;
0005;<control>;Cc;0;BN;;;;;N;ENQUIRY;;;;
0006;<control>;Cc;0;BN;;;;;N;ACKNOWLEDGE;;;;
0007;<control>;Cc;0;BN;;;;;N;BELL;;;;
0008;<control>;Cc;0;BN;;;;;N;BACKSPACE;;;;

…and then this:

001E;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR TWO;;;;
001F;<control>;Cc;0;S;;;;;N;INFORMATION SEPARATOR ONE;;;;
0020;SPACE;Zs;0;WS;;;;;N;;;;;
0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
0024;DOLLAR SIGN;Sc;0;ET;;;;;N;;;;;

…and then this:

004C;LATIN CAPITAL LETTER L;Lu;0;L;;;;;N;;;;006C;
004D;LATIN CAPITAL LETTER M;Lu;0;L;;;;;N;;;;006D;
004E;LATIN CAPITAL LETTER N;Lu;0;L;;;;;N;;;;006E;
004F;LATIN CAPITAL LETTER O;Lu;0;L;;;;;N;;;;006F;
0050;LATIN CAPITAL LETTER P;Lu;0;L;;;;;N;;;;0070;
0051;LATIN CAPITAL LETTER Q;Lu;0;L;;;;;N;;;;0071;

…and then later this:

090D;DEVANAGARI LETTER CANDRA E;Lo;0;L;;;;;N;;;;;
090E;DEVANAGARI LETTER SHORT E;Lo;0;L;;;;;N;;;;;
090F;DEVANAGARI LETTER E;Lo;0;L;;;;;N;;;;;
0910;DEVANAGARI LETTER AI;Lo;0;L;;;;;N;;;;;
0911;DEVANAGARI LETTER CANDRA O;Lo;0;L;;;;;N;;;;;
0912;DEVANAGARI LETTER SHORT O;Lo;0;L;;;;;N;;;;;

…then this:

2F9FD;CJK COMPATIBILITY IDEOGRAPH-2F9FD;Lo;0;L;29496;;;;N;;;;;
2F9FE;CJK COMPATIBILITY IDEOGRAPH-2F9FE;Lo;0;L;980B;;;;N;;;;;
2F9FF;CJK COMPATIBILITY IDEOGRAPH-2F9FF;Lo;0;L;980B;;;;N;;;;;
2FA00;CJK COMPATIBILITY IDEOGRAPH-2FA00;Lo;0;L;9829;;;;N;;;;;
2FA01;CJK COMPATIBILITY IDEOGRAPH-2FA01;Lo;0;L;295B6;;;;N;;;;;
2FA02;CJK COMPATIBILITY IDEOGRAPH-2FA02;Lo;0;L;98E2;;;;N;;;;;
2FA03;CJK COMPATIBILITY IDEOGRAPH-2FA03;Lo;0;L;4B33;;;;N;;;;;

…until getting to the end:

E01EC;VARIATION SELECTOR-253;Mn;0;NSM;;;;;N;;;;;
E01ED;VARIATION SELECTOR-254;Mn;0;NSM;;;;;N;;;;;
E01EE;VARIATION SELECTOR-255;Mn;0;NSM;;;;;N;;;;;
E01EF;VARIATION SELECTOR-256;Mn;0;NSM;;;;;N;;;;;
F0000;<Plane 15 Private Use, First>;Co;0;L;;;;;N;;;;;
FFFFD;<Plane 15 Private Use, Last>;Co;0;L;;;;;N;;;;;
100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;
10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;

We could write a command-line utility in JavaScript using Node.js that uses this file to look up Unicode names. If we put this code into an executable file in the same directory as a UnicodeData.txt file, then running the file with the arguments get [name] will sequentially read UnicodeData.txt, searching each line for the given name, then prints the character and its code point to the console.

#!/usr/bin/env node --no-warnings

const fs = require('fs');
const readline = require('readline');
const { promisify, inspect } = require('util');

// The console is passed as an injected dependency into all functions that need to use it
// in order to make testing the functions easier. It’s set as undefined here to prevent
// accidental direct use of the actual console rather than the injected parameter.
const console = undefined;

const dataSourceFile = 'UnicodeData.txt';

const helpText = `
Looks up the character with a given Unicode name.

Commands:
  help          Prints this help message.
  get [name]    Prints the code point(s) of the character with the given name,
                followed by the character itself between two delimiters,
                or “<none>” if no character with that name exists.
                Note that you must surround the name argument with quotes if the
                name contains spaces.
`;

const noDataSourceErrorText = `
🛑 Error: Attempted to build database using UniData.txt, but no
UniData.txt source file was found.`;

function isNoDataSourceError (error) {
  return error.code === 'ENOENT' && error.path === dataSourceFile;
}

function getDelimiters (character) {
  switch (character) {
    case '“':
    case '”':
      return ['[', ']'];
    default:
      return ['“', '”'];
  }
}

async function mainGet (console, targetName) {
  const normalizedTargetName = targetName.toUpperCase();
  const input = fs.createReadStream(databaseFile);
  const lineReader = readline.createInterface({input});
  const invalidNameOutput = '<none>';

  for await (const line of lineReader) {
    const [ codePointHex, name ] = line.split('\t');
    if (name === normalizedTargetName) {
      logCharacterMessage(codePointHex, name);
      return;
    }
  }

  console.log(invalidNameOutput);
}

async function main (argv, {console = globalConsole} = {}) {
  try {
    const [argument0, argument1] = argv;
    switch (argument0) {
      case 'get':
        await mainGet(console, argument1);
        break;
      case 'help':
      case '--help':
      case '-h':
      default:
        console.error(helpText);
    }
  } catch (error) {
    if (isNoDataSourceError(error))
      console.error(noDataSourceErrorText);
    else
      console.error(`${unexpectedErrorText}${inspect(error)}\n`);
  }
}

if (require.main === module)
  main(process.argv.slice(2));

Putting this script into an executable script – in the same directory as a copy of UnicodeData.txt – then entering ./[script-name.js] get 'devanagari letter e' into a terminal – will output:

U+090F DEVANAGARI LETTER E “ए”

However, UnicodeData.txt weighs 1.769 MB – in other words, 1769 kB. That is far too large for many resource-constrained computing environments such as many mobile devices. But this file obviously has a lot of discardable non-name data and numerous recurring phrases. In addition, finding the character for a name could involve sequentially searching nearly the entirety of UnicodeData.txt in the worst case (for U+E01EF Variation Selector-256). How much can this situation be improved?

Compressing data with gzip and Brotli

Let’s modify the script to generate a UnicodeData.gz file from UnicodeData.txt, then using UnicodeData.gz instead.

#!/usr/bin/env node --no-warnings

const fs = require('fs');
const https = require('https');
const readline = require('readline');
const zlib = require('zlib');
const { promisify, inspect } = require('util');

// The console is passed as an injected dependency into all functions that need to use it
// in order to make testing the functions easier. It’s set as undefined here to prevent
// accidental direct use of the actual console rather than the injected parameter.
const console = undefined;

const dataSourceFile = 'UnicodeData.txt';
const databaseFile = 'UnicodeData.gz';

const helpText = `
Looks up the character with a given Unicode name.

Commands:
  help          Prints this help message.
  get [name]    Prints the code point(s) of the character with the given name,
                followed by the character itself between two delimiters,
                or "<none>" if no character with that name exists.
                Note that you must surround the name argument with quotes if the
                name contains spaces.
  build         Builds the database of names into the script’s directory.
                This requires a UnicodeData.txt file in the same directory as the script.
                It processes its data and outputs it into a compressed format.
`;

const noDataSourceErrorText = `
🛑 Error: Attempted to build database using UniData.txt, but no
UniData.txt source file was found.`;

const noDatabaseErrorText = `
🛑 Error: No names database was found.
The database needs to be built using the "build" command.
`;

const unexpectedErrorText = `
🛑 Error: An unexpected internal error occurred:
`;

const buildingStartText = `
Building Unicode-names database...`;

const buildingEndText = `Database building complete.
`;

// # Building

function isNoDataSourceError (error) {
  return error.code === 'ENOENT' && error.path === dataSourceFile;
}

async function mainBuild (console) {
  console.log(buildingStartText);
  const output = fs.createWriteStream(databaseFile);
  const gzipTransform = zlib.createGzip();
  const input = fs.createReadStream(dataSourceFile)
    .on('error', reject);
  input.pipe(gzipTransform).pipe(output);
  await promisify(gzipTransform.bind.flush(gzipTransform))();
  output.close();
  console.log(buildingEndText);
}

// # Getting names’ characters

function isNoDatabaseError (error) {
  return error.code === 'ENOENT' && error.path === databaseFile;
}

function getDelimiters (character) {
  switch (character) {
    case '"':
    case '"':
      return ['[', ']'];
    default:
      return ['"', '"'];
  }
}

function logCharacterMessage (console, codePointHex, character, name) {
  const [openingDelimiter, closingDelimiter] = getDelimiters(character);
  console.log(
    `U+${codePointHex} ${name} ${openingDelimiter}${character}${closingDelimiter}`);
}

async function mainGet (console, targetName) {
  return new Promise(async (resolve, reject) => {
    try {
      const input = zlib.createUnzip();
      const fileInput = fs.createReadStream(databaseFile)
        .on('error', reject);
      fileInput.pipe(input);
      const lineReader = readline.createInterface({input});
      const normalizedTargetName = targetName.toUpperCase();
      const invalidNameOutput = '<none>';

      for await (const line of lineReader) {
        const [ codePointHex, name ] = line.split('\t');
        const character = String.fromCharCode(parseInt(codePointHex, 16));
        if (name === normalizedTargetName) {
          logCharacterMessage(console, codePointHex, character, name);
          resolve(character);
          return;
        }
      }

      console.log(invalidNameOutput);
      resolve();
    } catch (error) {
      reject(error);
    }
  });
}

// # Main program

const globalConsole = global.console;

async function main (argv, {console = globalConsole} = {}) {
  try {
    const [argument0, argument1] = argv;
    switch (argument0) {
      case 'get':
        await mainGet(console, argument1);
        break;
      case 'build':
        await mainBuild(console);
        break;
      case 'help':
      case '--help':
      case '-h':
      default:
        console.error(helpText);
    }
  } catch (error) {
    if (isNoDataSourceError(error))
      console.error(noDataSourceErrorText);
    else if (isNoDatabaseError(error))
      console.error(noDatabaseErrorText);
    else
      console.error(`${unexpectedErrorText}${inspect(error)}\n`);
  }
}

if (require.main === module)
  main(process.argv.slice(2));

Running ./[script-name.js] build will create a UnicodeData.gz file from any UnicodeData.txt in the same directory. Then – as with before – running ./[script-name.js] get 'devanagari letter e' will print that character and its code point.

Compressing UnicodeData.txt using gzip with LZ77 squeezes the file from 1769 kB to 266 kB, which is a dramatic 665% improvement. Brotli would do even better, squeezing it into 174 kB, for a 1020% improvement, but we are not using it here – we’re keeping the implementation simple, and Node.js includes a gzip implementation built in.

Discarding non-name data

Let’s modify the script to generate a smaller data file, by reading UnicodeData.txt, discarding its extraneous information, then putting it into a smaller database.gz file.

We’ll add another import:

const stream = import('streams');

…tweak a filename constant:

const databaseFile = 'database.gz';

…create a new helper function readUnicodeData that reads UnicodeData.txt, discard everything but code points and character names, then puts their pairs into one big array:

async function readUnicodeData () {
  return new Promise(async (resolve, reject) => {
    try {
      const input = fs.createReadStream(dataSourceFile)
        .on('error', reject);
      const lineReader = readline.createInterface({input});
      const labelPattern = /^<.+>$/; // Like "<control>".
      const result = [];

      for await (const line of lineReader) {
        const [codePointHex, name] = line.split(';');
        if (!name.match(labelPattern))
          result.push([codePointHex, name]);
      }

      resolve(result);

    } catch (error) {
      reject(error);
    }
  });
}

…and change mainBuild to put readUnicodeData’s array into a serializer transformation (createNameDataSerializer), whose output is then piped into a database.gz file:

function createNameDataSerializer (codePoint_name_pairs, console) {
  return new stream.Readable({
    read (size) {
      for (const [codePoint, name] of codePoint_name_pairs) {
        const codePointHex = codePoint.toString(16).toUpperCase();
        const entry = `${codePointHex}\t${name}\n`;
        this.push(entry);
      }
      this.push(null);
    }
  });
}

async function mainBuild (console) {
  console.log(buildingStartText);
  const codePoint_name_pairs = await readUnicodeData();
  const nameSerializer = createNameSerializer(codePoint_name_pairs, console);
  const gzipTransform = zlib.createGzip();
  const output = fs.createWriteStream(databaseFile);
  await promisify(stream.pipeline)(nameSerializer, gzipTransform, output);
  output.close();
  console.log(buildingEndText);
}

…and then change mainGet to handle the new format:

async function mainGet (console, targetName) {
  return new Promise(async (resolve, reject) => {
    try {
      const input = zlib.createUnzip();
      const fileInput = fs.createReadStream(databaseFile)
        .on('error', reject);
      fileInput.pipe(input);
      const lineReader = readline.createInterface({input});
      const normalizedTargetName = targetName.toUpperCase();
      const invalidNameOutput = '<none>';

      for await (const line of lineReader) {
        const [ codePointHex, name ] = line.split('\t');
        const character = String.fromCharCode(parseInt(codePointHex, 16));
        if (name === normalizedTargetName) {
          logCharacterMessage(console, codePointHex, character, name);
          resolve(character);
          return;
        }
      }

      console.log(invalidNameOutput);
      resolve();
    } catch (error) {
      reject(error);
    }
  });
}

When unzipped, database.gz is a 32197-line file that looks like:

0020	SPACE
0021	EXCLAMATION MARK
0022	QUOTATION MARK
0023	NUMBER SIGN
0024	DOLLAR SIGN
0025	PERCENT SIGN
0026	AMPERSAND
0027	APOSTROPHE

…which dramatically reduces its unzipped weight from 1769 kB to 1030 kB: a 172% improvement. With gzip it decreases from 266 kB to 203 kB and with Brotli it becomes 174 kB to 138 kB.

Discarding generated ideograph names

1410 ideographic characters with generated names are included in UnicodeData.txt for other properties. These include Han compatibility ideographs and Nushu characters. We can discard these using the third rule from name characteristics:

function isNonnameLabel (name) {
  const labelPattern = /^<.+>$/; // Like “<control>”.
  return name.match(labelPattern)
}

function hasGeneratedIdeographicName (codePoint) {
  return (0x3400 <= codePoint && codePoint <= 0x4db5)
  || (0x4e00 <= codePoint && codePoint <= 0x9fea)
  || (0xf900 <= codePoint && codePoint <= 0xfa6d)
  || (0xfa70 <= codePoint && codePoint <= 0xfad9)
  || (0x17000 <= codePoint && codePoint <= 0x187ec)
  || (0x1b170 <= codePoint && codePoint <= 0x1b2fb)
  || (0x20000 <= codePoint && codePoint <= 0x2a6d6)
  || (0x2a700 <= codePoint && codePoint <= 0x2b734)
  || (0x2b820 <= codePoint && codePoint <= 0x2cea1)
  || (0x2ceb0 <= codePoint && codePoint <= 0x2bbe0)
  || (0x2f800 <= codePoint && codePoint <= 0x2fa1d)
}

function isSkippedCharacter (name, codePoint) {
  return isNonnameLabel(name) || hasGeneratedIdeographicName(codePoint);
}

async function readUnicodeData () {
  return new Promise(async (resolve, reject) => {
    try {
      const input = fs.createReadStream(dataSourceFile)
        .on('error', reject);
      const lineReader = readline.createInterface({input});
      const entries = [];

      for await (const line of lineReader) {
        const [codePointHex, name] = line.split(';');
        const codePoint = parseInt(codePointHex, 16);
        if (!isSkippedCharacter(name, codePoint))
          entries.push([name, codePoint]);
      }

      resolve(new Map(entries));

    } catch (error) {
      reject(error);
    }
  });
}

database.gz goes from 32197 to 30787 lines, which reduces it from 1030 kB to 980 kB, a 105% improvement. With gzip this becomes 199 kB and with Brotli it becomes 135 kB. These improvements are more modest. But we can still squeeze our data further.

Compressing code points into increments

Code points in the Basic Multilingual Plane (the BMP: U+0000 to U+FFFF) currently use four hexadecimal digits. And code points in the Supplemental Multilingual Planes (the SMPs: U+010000 to U+10FFFF) currently use six hexadecimal digits. Yet, even after sorting by name, the majority of data in database.txt might consist of contiguous (or, at least, mutually close) blocks code points. All those poorly compressible hexadecimal codes can be converted into a huge, very compressible set of 1 increments (for contiguous code points), with occasional larger increments. This can be done by changing createNameSerializer to:

function createNameSerializer (codePoint_name_pairs, console) {
  return new stream.Readable({
    read (size) {
      let previousCodePoint = 0;
      for (const [codePoint, name] of codePoint_name_pairs) {
        const codePointDifference = codePoint - previousCodePoint;
        const codePointDifferenceHex = codePointDifference.toString(16).toUpperCase();
        const entry = `${codePointDifferenceHex}\t${name}\n`;
        this.push(entry);
        previousCodePoint = codePoint;
      }
      this.push(null);
    }
  });
}

…and the inner loop of mainGet to:

      let previousCodePoint = 0;
      for await (const line of lineReader) {
        const [ codePointIncrementHex, name ] = line.split('\t');
        const codePointIncrement = parseInt(codePointIncrementHex, 16);
        const codePoint = previousCodePoint + codePointIncrement;
        const codePointHex = codePointToHex(codePoint);
        const character = String.fromCharCode(codePoint);
        if (name === normalizedTargetName) {
          logCharacterMessage(console, codePointHex, character, name);
          resolve(character);
          return;
        }
        previousCodePoint = codePoint;
      }

This changes database.gz to look like this 32197-line text file when uncompressed:

20	SPACE
1	EXCLAMATION MARK
1	QUOTATION MARK
1	NUMBER SIGN
1	DOLLAR SIGN
1	PERCENT SIGN
1	AMPERSAND
1	APOSTROPHE
1	LEFT PARENTHESIS
1	RIGHT PARENTHESIS
1	ASTERISK

…and later this:

1	CJK COMPATIBILITY IDEOGRAPH-2FA18
1	CJK COMPATIBILITY IDEOGRAPH-2FA19
1	CJK COMPATIBILITY IDEOGRAPH-2FA1A
1	CJK COMPATIBILITY IDEOGRAPH-2FA1B
1	CJK COMPATIBILITY IDEOGRAPH-2FA1C
1	CJK COMPATIBILITY IDEOGRAPH-2FA1D
B05E4	LANGUAGE TAG
1F	TAG SPACE
1	TAG EXCLAMATION MARK
1	TAG QUOTATION MARK
1	TAG NUMBER SIGN
1	TAG DOLLAR SIGN
1	TAG PERCENT SIGN
1	TAG AMPERSAND
1	TAG APOSTROPHE
1	TAG LEFT PARENTHESIS
1	TAG RIGHT PARENTHESIS

This dramatically increases compression again. --- The gzipped file now weighs 134 kB, which is an additional 150% improvement. (The Brotli version would decrease further to 97 kB, which is a 140% improvement from the previous Brotli version.)

Enabling random-access lookup

Currently, in the worst cases (U+E01EF Variation Selector-256 or nonexistent characters) our script might have to search the entire database before determining the answer. This is not desirable. We want to be able to look up the characters of given names and the names of given characters in nearly constant time, regardless of how many entries there are in the database.

Further complicating matters is our application of gzip or Brotli to the entire database. This unfortunately precludes sequential search, because gzip and Brotli both output linear streams of data.

We can fix this by splitting our database into several sections, each of which is individually compressed and each of which is individually addressable:

  • An array of randomly accessible data blobs, together which form a binary search tree (BST), and each which represents one node in that BST. The BST is perfectly balanced, so a simple array of its nodes can act as an implicit representation of the tree. The array’s zeroth element is the tree’s root; its first and second elements are the root’s lesser and greater children; its third, fourth, fifth, and sixth elements are those children’s respective children; and so on.

    • There are two kinds of nodes: leaves and branches. All nodes start with header data, and the first bit of a node’s header data is a kind bit. The kind bit indicates whether the node is a leaf (0) or a branch (1).

    • A leaf node contains a header (made of a kind bit then an initial code point) followed by a Brotli-compressed sequence of names and code-point increments, interleaved with one another. The first name belongs to the initial code point, then the next name belongs to the initial code point plus the following increment, and then so on.

    • A branch node contains a header (made of a kind bit then a threshold code point). The threshold code point determines whether a binary search will next select the node’s lesser child or its greater child.

  • A names-to-BST-indices lookup table, which is able to map Unicode names to the indices of the BST nodes that contain the names’ characters.

The BST is perfectly balanced in order to support implicit indexing of its nodes. In order to construct this tree, we will divide the 30787 non-skipped entries of UnicodeData.txt into two nodes, several times.

How many times should we bifurcate our nodes? We expect the compression to be less effective as the number of divisions increases. But we also expect the lookup speed to increase: a trade-off. However, it is difficult to predict precisely by how much the space will increase with each successive division. We will make the number of divisions configurable and then chart how it affects both spatial and temporal efficiency.

In addition, how should we bifurcate our nodes? The simplest way would be into equal halves. But this may not be the best way. The space of Unicode code points is divided into "blocks" each consisting of 0x80 (128) points. The Standard attempts to group similar characters in the same Unicode block, although this has not been the case for many characters for historical reasons. Similar characters are more likely to have similar names, and therefore splitting Unicode blocks between BST nodes may worsen their compression. So an alternative approach that we should test would be to bifurcate nodes into roughly equal halves but rounded to the nearest block (that is, such that the point of splitting is a multiple of 0x80).

Creating a minimal perfect hash function

Hash tables are perhaps the most popular lookup structure today. In particular, minimal perfect hash functions might be great for looking up with minimal space. The information-theoretical minimum size of a minimal perfect hash function over n string keys, as n → ∞, is (n log2(e) − O(log2(n))) bits ≈ 1.44 bits · n. In our case, n = 32197, so our hash function would require approximately 6.54 kB of overhead (for a proof of this, see Belo Horizonte’s 2008 thesis). The 6.54 kB would be overhead cost, in addition to the cost of storing 32197 · log2(32197 / 2 ** l) bits for their corresponding BST-node indices. For l = 0, 1, 2, and 3, this is 66.8 kB, 62.8 kB, 58.8 kB, and 54.8 kB. (Consider if we stored code points directly, instead of BST-node indices. There are 0x110000 (1,114,112) code points, so it takes a minimum of log2(1,114,112) bits to represent one code point. So our overhead would become 6.54 kB + 32197 · log2(0x110000) bits ≈ 83.38 kB, which is significantly larger.)

Hash functions come with an important disadvantage: false positives. Hashes are not necessarily unique. Any efficient hash function might yield the same value for some two keys. It’s why passwords need to be salted with a secret value, and it’s why secure cryptographic hash functions are so difficult to create. Any minimal perfect hash function that rejects invalid keys would would somehow have to include copies of all their keys somewhere anyway. Fortunately, we will be keeping our keys in the BST anyway. Furthermore, we can save much room in the hash table by encoding the indices of the BST leaves that contain characters, rather than the characters’ code points directly, since we’re going to have to retrieve the characters from the BST anyway.

Tries / radix trees / prefix trees were also considered as an alternative. However, the consensus was that, even when compressed; tries are necessarily much larger than hash tables. They have an advantage in substring matching, but [we do not care about substring matching][goals].

Compressing data with a custom algorithm

Both gzip and Brotli compression are third-party dependencies that, being sophisticated and complex, would add large overhead to client-side scripts. gzip is built into Node.js but it is not yet built into any DOM web API; the most popular browser implementation, ---, weighs --- kB when minified. Brotli is neither ---.

Can we compress the data to a level comparable to either compression method but with less overhead? ---

Korean/Hangul-syllable names

The algorithm for generating Korean/Hangul syllable is defined in the Standard’s third chapter, § 3.12. It also conveniently gives some sample Java code:

public static String getHangulName(char s) {
    int SIndex = s - SBase;
    if (0 > SIndex || SIndex >= SCount) {
      throw new IllegalArgumentException("Not a Hangul Syllable: "
        + s);
    }
    int LIndex = SIndex / NCount;
    int VIndex = (SIndex % NCount) / TCount;
    int TIndex = SIndex % TCount;
    return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex]
      + JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex];
  }
  static private String[] JAMO_L_TABLE = {
    "G", "GG", "N", "D", "DD", "R", "M", "B", "BB",
    "S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H"
};
  static private String[] JAMO_V_TABLE = {
    "A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O",
    "WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI",
    "YU", "EU", "YI", "I"
};
  static private String[] JAMO_T_TABLE = {
    "", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM",
    "LB", "LS", "LT", "LP", "LH", "M", "B", "BS",
    "S", "SS", "NG", "J", "C", "K", "T", "P", "H"
};

From this, a parser of their names in JavaScript can trivially be derived. ---

Character name aliases

Some Unicode names are aliases to characters, separate from their strict character names. Oftentimes this is because the strict character name has a misspelling, lke the infamous U+FE18 Presentation Form for Vertical Right White Lenticular Brakcet. U+FE18 got an alias with the correct spelling “Bracket”. Other characters have other types of aliases: for widely used alternate names and for abbreviations.

A character may have zero, one, or many aliases: Although strict character names are an injective function, character name aliases are a non-injective function. ---

Named character sequences

Special character sequences have also received Unicode names. These sequences are strings of multiple code points. ---

Code-point labels

Many characters do not have any strict Unicode name. These include the control characters, private-use characters, surrogate points, noncharacters, and reserved characters. The Standard therefore gives them labels that look like U+0009 < control-0009 > and U+FFFF < noncharacter-FFFF >. Although they technically are in a separate namespace from the Unicode names, it still would be useful to refer to them in program source code, and it is very unlikely that there will be a collision between any future Unicode name and an already-valid code-point label. Code-point labels are also easily determined by several rules: ---

Succinct Unicode character names

Perhaps the most prominent, if not the most important, property of any Unicode character is its name. These immutable labels are much more descriptive than their hexadecimal codes.

But these names aren’t used often in most programming languages. Why is that? Perhaps the biggest reason why is the sheer size of the list of names, weighing 1.769 MB; embedding such a file in every language runtime would impose a large burden on resource-constrained environments such as mobile devices. But can this burden be mitigated with data compression? How compressible is this list? Can we create a more-succinct data structure from this list?

Precedent for this project comes from GNU Uniname, a GNU Project utility, written by Bill Poser of British Columbia, which can look up the characters of names and the names of characters. Its source may be read in uniname.c and test-uninames.c. According to its documentation, the size of the compiled program is less than 350 kB, which is a good benchmark goal for our project.

Goals

  1. Names-to-points lookup: We want to be able to retrieve the character corresponding to a given Unicode name.
  2. Points-to-names lookup: We want to retrieve the Unicode names corresponding to characters.
  3. Named sequences: We want to be able to retrieve both code points and named character sequences.
  4. No false positives: Looking up invalid names has to return a null value; it can’t return a valid code point.
  5. Spatial efficiency: We want our lookup data structure to succinct: to occupy as little storage and memory as possible; ideally, only a fraction of the entire structure would need to be loaded from storage into memory during lookup.
  6. Temporal efficiency: We want the speed of lookup to be fast, although we care less about speed than about storage and memory.
  7. Platform universality: We want our data to be usable from as many platforms as possible – which, for this document, means targeting JavaScript.
  8. No dynamic updates: We will never have to do dynamic updates to the data: The data structure may be only-annually rebuilt from scratch.
  9. No fast compilation: The building and compression of the data structure happen long before the structure is actually used, and so they may be much slower than decompression and lookup.

Name characteristics

The majority of Unicode names are strict Unicode character names; they are formally assigned to characters by the Name/na property, as declared by The Unicode Standard, Chapter 3’s clauses D4. There are also other Unicode names, including character name aliases and character-sequence names. No matter their type, all Unicode names are permanent and immutable for ever and ever. Each name also denotes exactly one string (either a single code point or a sequence of them). Together, they form a single shared namespace, the “Unicode namespace for character names”.

Names all also follow several simple orthographic rules as declared in The Unicode Standard, Chapter 4’s clauses R1–R4:

  1. Names will always contain only (case-insensitive) Latin letters, Arabic digits, spaces, and ASCII hyphens: that’s 38 total characters. (We’ll call chunks of names separated by spaces “words”. For instance, “Zero Width Space” consists of three words.)
  2. Words in names will never start with number digits.
  3. Whole names will never start or end with ASCII hyphens; also, they will never have double hyphens.
  4. Rule #3 also applies to spaces too.

A corollary of these rules is that all whitespace is ignorable. Furthermore, all medial ASCII hyphens (where “medial” means “inside a word, not at its beginning or its end”) are ignorable, with one current exception: the Korean character U+1180 Hangul Junseong O-E, whose name needs its hyphen to be distinguishable from U+116C Hangul Junseong OE.

Strict Unicode character names (the Name/na property) follow the above restrictions, but they’re actually determined by the same chapter’s clauses NR1–NR4:

  1. Precomposed Korean/Hangul-syllable code points (U+AC00–D7AF) always are named “Hangul Syllable ” followed by a phrase denoting the sounds of their components, which in turn are known as jamo. Korean/Hangul-syllable names are discussed more below.

  2. Most ideographs are named “CJK Unified Ideograph-”, “Tangut Ideograph-”, “Nushu Character-”, or “CJK Compatibility Ideograph-”, followed by its hexadecimal code point. For example, U+4E00 is automatically named “CJK Unified Ideograph-4E00”. This rule applies to:
    U+3400–4DB5 “CJK Unified Ideograph-”,
    U+4E00–9FEA “CJK Unified Ideograph-”,
    U+F900–FA6D “CJK Compatibility Ideograph-”,
    U+FA70–FAD9 “CJK Compatibility Ideograph-”,
    U+17000–187EC “Tangut Ideograph-”,
    U+1B170–1B2FB “Nushu Character-”,
    U+20000–2A6D6 “CJK Unified Ideograph-”,
    U+2A700–2B734 “CJK Unified Ideograph-”,
    U+2B740–2B81D “CJK Unified Ideograph-”,
    U+2B820–2CEA1 “CJK Unified Ideograph-”,
    U+2CEB0–2BBE0 “CJK Unified Ideograph-”, and
    U+2F800–2FA1D “CJK Compatibility Ideograph-”.

  3. If a character has an entry in Unicode Character Database’s UnicodeData.txt with a name in the first field, then that’s its name. UnicodeData.txt makes up the bulk of the data that we need to compress.

  4. If none of the rules above apply, then that means it doesn’t have a strict Unicode name. This applies to control characters, private-use characters, surrogate points, noncharacters, and reserved characters. However, these characters do have code-point labels, and some of these may have character name aliases.

Strict Unicode character names therefore form an injective, non-surjective function over all code points.

The longest name is “Arabic Ligature Uighur Kirghiz Yeh with Hamza Above with Alef Maksura Isolated Form” (83 characters long). The shortest name is “Ox” (two characters).

UnicodeData.txt

The bulk of the names are in the Unicode Character Database’s UnicodeData.txt. It’s a 32292-line text file that looks like this:

0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
0002;<control>;Cc;0;BN;;;;;N;START OF TEXT;;;;
0003;<control>;Cc;0;BN;;;;;N;END OF TEXT;;;;
0004;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION;;;;
0005;<control>;Cc;0;BN;;;;;N;ENQUIRY;;;;
0006;<control>;Cc;0;BN;;;;;N;ACKNOWLEDGE;;;;
0007;<control>;Cc;0;BN;;;;;N;BELL;;;;
0008;<control>;Cc;0;BN;;;;;N;BACKSPACE;;;;

…and then this:

001E;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR TWO;;;;
001F;<control>;Cc;0;S;;;;;N;INFORMATION SEPARATOR ONE;;;;
0020;SPACE;Zs;0;WS;;;;;N;;;;;
0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
0024;DOLLAR SIGN;Sc;0;ET;;;;;N;;;;;

…and then this:

004C;LATIN CAPITAL LETTER L;Lu;0;L;;;;;N;;;;006C;
004D;LATIN CAPITAL LETTER M;Lu;0;L;;;;;N;;;;006D;
004E;LATIN CAPITAL LETTER N;Lu;0;L;;;;;N;;;;006E;
004F;LATIN CAPITAL LETTER O;Lu;0;L;;;;;N;;;;006F;
0050;LATIN CAPITAL LETTER P;Lu;0;L;;;;;N;;;;0070;
0051;LATIN CAPITAL LETTER Q;Lu;0;L;;;;;N;;;;0071;

…and then later this:

090D;DEVANAGARI LETTER CANDRA E;Lo;0;L;;;;;N;;;;;
090E;DEVANAGARI LETTER SHORT E;Lo;0;L;;;;;N;;;;;
090F;DEVANAGARI LETTER E;Lo;0;L;;;;;N;;;;;
0910;DEVANAGARI LETTER AI;Lo;0;L;;;;;N;;;;;
0911;DEVANAGARI LETTER CANDRA O;Lo;0;L;;;;;N;;;;;
0912;DEVANAGARI LETTER SHORT O;Lo;0;L;;;;;N;;;;;

…then this:

2F9FD;CJK COMPATIBILITY IDEOGRAPH-2F9FD;Lo;0;L;29496;;;;N;;;;;
2F9FE;CJK COMPATIBILITY IDEOGRAPH-2F9FE;Lo;0;L;980B;;;;N;;;;;
2F9FF;CJK COMPATIBILITY IDEOGRAPH-2F9FF;Lo;0;L;980B;;;;N;;;;;
2FA00;CJK COMPATIBILITY IDEOGRAPH-2FA00;Lo;0;L;9829;;;;N;;;;;
2FA01;CJK COMPATIBILITY IDEOGRAPH-2FA01;Lo;0;L;295B6;;;;N;;;;;
2FA02;CJK COMPATIBILITY IDEOGRAPH-2FA02;Lo;0;L;98E2;;;;N;;;;;
2FA03;CJK COMPATIBILITY IDEOGRAPH-2FA03;Lo;0;L;4B33;;;;N;;;;;

…until getting to the end:

E01EC;VARIATION SELECTOR-253;Mn;0;NSM;;;;;N;;;;;
E01ED;VARIATION SELECTOR-254;Mn;0;NSM;;;;;N;;;;;
E01EE;VARIATION SELECTOR-255;Mn;0;NSM;;;;;N;;;;;
E01EF;VARIATION SELECTOR-256;Mn;0;NSM;;;;;N;;;;;
F0000;<Plane 15 Private Use, First>;Co;0;L;;;;;N;;;;;
FFFFD;<Plane 15 Private Use, Last>;Co;0;L;;;;;N;;;;;
100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;
10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;

We could write a command-line utility in JavaScript using Node.js that uses this file to look up Unicode names. If we put this code into an executable file called build.js, and if we put it in the same directory as a UnicodeData.txt file, then running ./build.js get [name] will sequentially read UnicodeData.txt, searching each line for the given [name], then prints the character and its code point to the console.

#!/usr/bin/env node --no-warnings
const fs = require('fs');
const readline = require('readline');
const { promisify, inspect } = require('util');

// The console is passed as an injected dependency into all functions that need to use it
// in order to make testing the functions easier. It’s set as undefined here to prevent
// accidental direct use of the actual console rather than the injected parameter.
const console = undefined;

const dataSourcePath = 'UnicodeData.txt';
const fieldDelimiter = ';';

const helpText = `
Looks up the character with a given Unicode name.

Commands:
  help          Prints this help message.
  get [name]    Prints the code point(s) of the character with the given name,
                followed by the character itself between two delimiters,
                or “<none>” if no character with that name exists.
                Note that you must surround the name argument with quotes if the
                name contains spaces.
`;

const noDataSourceErrorText = `
🛑 Error: Attempted to build database using UnicodeData.txt, but no
UnicodeData.txt source file was found.`;

function getConsoleNameDelimiters (character) {
  switch (character) {
    case '“':
    case '”':
      return ['[', ']'];
    default:
      return ['“', '”'];
  }
}

function logCharacterMessage (console, codePointHex, character, name) {
  const [openingDelimiter, closingDelimiter] = getConsoleNameDelimiters(character);
  console.log(
    `U+${codePointHex} ${name} ${openingDelimiter}${character}${closingDelimiter}`);
}

async function searchNameBuffer (input, targetName) {
  const normalizedTargetName = targetName.toUpperCase();
  return new Promise((resolve, reject) => {
    input.on('error', reject);
    const lineReader = readline.createInterface({input});
    for await (const line of lineReader) {
      const [ codePointHex, name ] = line.split(';');
      const codePoint = parseInt(codePointHex, 16);
      const character = String.fromCharCode(codePoint);
      if (name === normalizedTargetName)
        resolve({ codePointHex, character, name });
    }
    resolve();
  });
}

function isNoDataSourceError (error) {
  return error.code === 'ENOENT' && error.path === dataSourcePath;
}

async function mainGet (console, targetName) {
  try {
    const fileInput = fs.createReadStream(databaseFile);
    const getResult = await searchNameBuffer(fileInput, targetName);
  
    if (getResult) {
      const { codePointHex, character, name } = getResult;
      logCharacterMessage(console, codePointHex, character, name);
    } else {
      const invalidNameOutput = '<none>';
      console.log(invalidNameOutput);
    }
  } catch (error) {
    if (isNoDataSourceError(error))
      console.error(noDataSourceErrorText);
    else
      console.error(`${unexpectedErrorText}${inspect(error)}\n`);
}

async function main (argv, {console = globalConsole} = {}) {
  try {
    const [argument0, argument1] = argv;
    switch (argument0) {
      case 'get':
        await mainGet(console, argument1);
        break;
      case 'help':
      case '--help':
      case '-h':
      default:
        console.error(helpText);
    }
  } catch (error) {
    console.error(`${unexpectedErrorText}${inspect(error)}\n`);
  }
}

if (require.main === module)
  main(process.argv.slice(2));

Putting this script into an executable script called build.js – in the same directory as a copy of UnicodeData.txt – then entering ./build.js get 'devanagari letter e' into a terminal – will output:

U+090F DEVANAGARI LETTER E “ए”

However, UnicodeData.txt weighs 1.769 MB – in other words, 1769 kB. That is far too large for many resource-constrained computing environments such as many mobile devices. But this file obviously has a lot of discardable non-name data and numerous recurring phrases. In addition, finding the character for a name could involve sequentially searching nearly the entirety of UnicodeData.txt in the worst case (for U+E01EF Variation Selector-256). How much can this situation be improved?

Compressing data with gzip and Brotli

Let’s modify the build.js script to generate a index.txt.gz file from UnicodeData.txt, then using index.txt.gz instead.

#!/usr/bin/env node --no-warnings
const fs = require('fs');
const readline = require('readline');
const zlib = require('zlib');
const stream = require('stream');
const { promisify, inspect } = require('util');

// The console is passed as an injected dependency into all functions that need to use it
// in order to make testing the functions easier. It’s set as undefined here to prevent
// accidental direct use of the actual console rather than the injected parameter.
const console = undefined;

const dataSourcePath = 'UnicodeData.txt';
const databaseFile = 'index.txt.gz';
const fieldDelimiter = ';';

const helpText = `
Looks up the character with a given Unicode name.

Commands:
  help          Prints this help message.
  get [name]    Prints the code point(s) of the character with the given name,
                followed by the character itself between two delimiters,
                or “<none>” if no character with that name exists.
                Note that you must surround the name argument with quotes if the
                name contains spaces.
  build         Builds the database of names into the script’s directory.
                This requires a UnicodeData.txt file in the same directory as the script.
                It processes its data and outputs it into a compressed format.
`;

const noDataSourceErrorText = `
🛑 Error: Attempted to build database using UnicodeData.txt, but no
UnicodeData.txt source file was found.`;

const noDatabaseErrorText = `
🛑 Error: No names database was found.
The database needs to be built using the “build” command.
`;

const unexpectedErrorText = `
🛑 Error: An unexpected internal error occurred:
`;

const buildingStartText = `
Building Unicode-names database...`;

const buildingEndText = `Database building complete.
`;

// # Building

function isNoDataSourceError (error) {
  return error.code === 'ENOENT' && error.path === dataSourcePath;
}

async function mainBuild (console) {
  console.log(buildingStartText);
  const output = fs.createWriteStream(databaseFile);
  const level = zlib.constants.Z_BEST_COMPRESSION;
  const gzipTransform = zlib.createGzip({ level });
  const input = fs.createReadStream(dataSourcePath);
  await promisify(stream.pipeline)(input, gzipTransform, output);
  output.close();
  console.log(buildingEndText);
}

// # Getting names’ characters

function getConsoleNameDelimiters (character) {
  switch (character) {
    case '“':
    case '”':
      return ['[', ']'];
    default:
      return ['“', '”'];
  }
}

function logCharacterMessage (console, codePointHex, character, name) {
  const [openingDelimiter, closingDelimiter] = getConsoleNameDelimiters(character);
  console.log(
    `U+${codePointHex} ${name} ${openingDelimiter}${character}${closingDelimiter}`);
}

async function searchNameLines (lineAsyncIterable, targetName) {
  const normalizedTargetName = targetName.toUpperCase();
  for await (const line of lineAsyncIterable) {
    const [ codePointHex, name ] = line.split(fieldDelimiter);
    const codePoint = parseInt(codePointHex, 16);
    const character = String.fromCharCode(codePoint);
    if (name === normalizedTargetName)
      return { codePointHex, character, name };
  }
}

async function searchNameBuffer (fileInput, targetName) {
  const unzipTransform = zlib.createUnzip();
  const lineReader = readline.createInterface({ input: unzipTransform });
  const searchingPromise = searchNameLines(lineReader, targetName);
  const pipelinePromise = promisify(stream.pipeline)(fileInput, unzipTransform);
  return Promise.race([searchingPromise, pipelinePromise]);
}

function isNoDatabaseError (error) {
  return error.code === 'ENOENT' && error.path === databaseFile;
}

async function mainGet (console, targetName) {
  try {
    const fileInput = fs.createReadStream(databaseFile);
    const searchingResult = await searchNameBuffer(fileInput, targetName);
  
    if (searchingResult) {
      const { codePointHex, character, name } = searchingResult;
      logCharacterMessage(console, codePointHex, character, name);
    } else {
      const invalidNameOutput = '<none>';
      console.log(invalidNameOutput);
    }
  } catch (error) {
    if (isNoDatabaseError(error))
      console.error(noDatabaseErrorText);
    else
      throw error;
  }
}

// # Main program

const globalConsole = global.console;

async function main (argv, {console = globalConsole} = {}) {
  try {
    const [argument0, argument1] = argv;
    switch (argument0) {
      case 'get':
        await mainGet(console, argument1);
        break;
      case 'build':
        await mainBuild(console);
        break;
      case 'help':
      case '--help':
      case '-h':
      default:
        console.error(helpText);
    }
  } catch (error) {
    console.error(`${unexpectedErrorText}${inspect(error)}\n`);
  }
}

if (require.main === module)
  main(process.argv.slice(2));

Running ./build.js build will create a index.txt.gz file from any UnicodeData.txt in the same directory. Then – as with before – running ./build.js get 'devanagari letter e' will print that character and its code point.

Compressing UnicodeData.txt using gzip with LZ77 at best-compression level squeezes the file from 1769 kB to 253 kB, which is a dramatic 595% improvement. Brotli would do even better, squeezing it into 174 kB, for a 920% improvement, but we are not using it here – we’re keeping the implementation simple, and Node.js includes a gzip implementation built in.

Discarding non-name fields

Let’s modify build.js to generate a smaller data file, by reading UnicodeData.txt, discarding its extraneous information, then putting it into a smaller index.txt.gz file. At this stage, this means discarding all fields other than the Name property’s field.

We’ll create some new helper functions – in particular, readUnicodeData, which reads UnicodeData.txt, discards everything but code points and character names, then puts their pairs into one big array:

const labelPattern = /^<.+>$/; // Like “<control>”.

async function readUnicodeDataLines (lineAsyncIterator) {
  const dataArray = [];
  for await (const line of lineAsyncIterator) {
    const [codePointHex, name] = line.split(fieldDelimiter);
    const codePoint = parseInt(codePointHex, 16);
    if (!name.match(labelPattern))
      dataArray.push([codePoint, name]);
  }
  return new Map(dataArray);
}

async function readUnicodeData () {
  const input = fs.createReadStream(dataSourcePath);
  const lineReader = readline.createInterface({input});
  const dataArrayPromise = readUnicodeDataLines(lineReader);
  const inputFinishedPromise = promisify(stream.finished)(input);
  await Promise.all([dataArrayPromise, inputFinishedPromise]);
  return dataArrayPromise;
}

…and a function createNameDataSerializer, which creates transform streams that serialize readUnicodeData’s pairs:

function createNameDataSerializer (codePoint_name_map, console) {
  return new stream.Readable({
    read (size) {
      for (const [codePoint, name] of codePoint_name_map) {
        const codePointHex = codePoint.toString(16).toUpperCase();
        const entry = `${codePointHex}${fieldDelimiter}${name}\n`;
        this.push(entry);
      }
      this.push(null);
    }
  });
}

Then we change mainBuild to put readUnicodeData’s pairs into the serializer transform. The transform’s output is then piped into a GZIP transform stream, which then pipes into the index.txt.gz file:

async function mainBuild (console) {
  try {
    console.log(buildingStartText);
    const codePoint_name_map = await readUnicodeData();
    const input = createNameDataSerializer(codePoint_name_map, console);
    const level = zlib.constants.Z_BEST_COMPRESSION;
    const gzipTransform = zlib.createGzip({ level });
    const output = fs.createWriteStream(databaseFile);
    await promisify(stream.pipeline)(input, gzipTransform, output);
    output.close();
    console.log(buildingEndText);
  } catch (error) {
    if (isNoDataSourceError(error))
      console.error(noDataSourceErrorText);
    else
      throw error;
  }
}

When decompressed, index.txt.gz is a 32197-line file that looks like:

0020;SPACE
0021;EXCLAMATION MARK
0022;QUOTATION MARK
0023;NUMBER SIGN
0024;DOLLAR SIGN
0025;PERCENT SIGN
0026;AMPERSAND
0027;APOSTROPHE

…which dramatically reduces its decompressed weight from 1769 kB to 1030 kB: a 72% improvement. With gzip it decreases from 266 kB to 200 kB (26% improvement) and with Brotli it becomes 174 kB to 138 kB (26% improvement).

Discarding padding zeros

Most code points in UnicodeData.txt lead with at least one zero digit. For instance, U+0020 Space has two leading digits. This is because all code-point hex codes are padded to four digits if they are in the Basic Multilingual Plane (the BMP: U+0000 to U+FFFF) or to six digits if they are in any of the Supplemental Multilingual Planes (the SMPs: U+010000 to U+10FFFF). We can decrease our data slightly further by removing this redundant padding. We change our build functions as such:

const labelPattern = /^<.+>$/; // Like “<control>”.

async function readUnicodeDataLines (lineAsyncIterator) {
  const dataArray = [];
  for await (const line of lineAsyncIterator) {
    const [codePointHex, name] = line.split(fieldDelimiter);
    const codePoint = parseInt(codePointHex, 16);
    if (!name.match(labelPattern))
      dataArray.push([codePoint, name]);
  }
  return new Map(dataArray);
}

async function readUnicodeData () {
  const input = fs.createReadStream(dataSourcePath);
  const lineReader = readline.createInterface({input});
  const dataArrayPromise = readUnicodeDataLines(lineReader);
  const inputFinishedPromise = promisify(stream.finished)(input);
  await Promise.all([dataArrayPromise, inputFinishedPromise]);
  return dataArrayPromise;
}

function createNameDataSerializer (codePoint_name_map, console) {
  return new stream.Readable({
    read (size) {
      for (const [codePoint, name] of codePoint_name_map) {
        const rawCodePointHex = codePoint.toString(16).toUpperCase();
        const entry = `${rawCodePointHex}${fieldDelimiter}${name}\n`;
        this.push(entry);
      }
      this.push(null);
    }
  });
}

…and create a helper function that our retrieval functions will use:

function codePointToHex (codePoint) {
  const unpaddedHex = codePoint.toString(16).toUpperCase();
  return codePoint >= 0x100000
    ? unpaddedHex
    : codePoint >= 0x10000
    ? `0${unpaddedHex}`
    : codePoint >= 0x1000
    ? `${unpaddedHex}`
    : codePoint >= 0x100
    ? `0${unpaddedHex}`
    : codePoint >= 0x10
    ? `00${unpaddedHex}`
    : `000${unpaddedHex}`;
}

function logCharacterMessage (console, codePoint, character, name) {
  const [openingDelimiter, closingDelimiter] = getConsoleNameDelimiters(character);
  const codePointHex = codePointToHex(codePoint);
  console.log(
    `U+${codePointHex} ${name} ${openingDelimiter}${character}${closingDelimiter}`);
}

async function searchNameLines (lineAsyncIterable, targetName) {
  const normalizedTargetName = targetName.toUpperCase();
  for await (const line of lineAsyncIterable) {
    const [ rawCodePointHex, name ] = line.split(fieldDelimiter);
    const codePoint = parseInt(rawCodePointHex, 16);
    const character = String.fromCharCode(codePoint);
    if (name === normalizedTargetName) {
      return { codePoint, character, name };
    }
  }
}

Unfortunately, removing padding zeroes barely changes the weight of the file: by less than 1%. However, our parsing hex codes into integers will be useful for our next optimization.

Discarding generated ideograph names

1410 ideographic characters with generated names are included in UnicodeData.txt for other properties. These include Han compatibility ideographs and Nushu characters. We can discard these by changing build.js to use the third rule from name characteristics:

const labelPattern = /^<.+>$/; // Like "<control>".

function hasGeneratedIdeographicName (codePoint) {
  return (0x3400 <= codePoint && codePoint <= 0x4db5)
  || (0x4e00 <= codePoint && codePoint <= 0x9fea)
  || (0xf900 <= codePoint && codePoint <= 0xfa6d)
  || (0xfa70 <= codePoint && codePoint <= 0xfad9)
  || (0x17000 <= codePoint && codePoint <= 0x187ec)
  || (0x1b170 <= codePoint && codePoint <= 0x1b2fb)
  || (0x20000 <= codePoint && codePoint <= 0x2a6d6)
  || (0x2a700 <= codePoint && codePoint <= 0x2b734)
  || (0x2b820 <= codePoint && codePoint <= 0x2cea1)
  || (0x2ceb0 <= codePoint && codePoint <= 0x2bbe0)
  || (0x2f800 <= codePoint && codePoint <= 0x2fa1d)
}

function isSkippedCharacter (name, codePoint) {
  return name.match(labelPattern) || hasGeneratedIdeographicName(codePoint);
}

async function readUnicodeDataLines (lineAsyncIterator) {
  const dataArray = [];
  for await (const line of lineAsyncIterator) {
    const [codePointHex, name] = line.split(fieldDelimiter);
    const codePoint = parseInt(codePointHex, 16);
    if (!isSkippedCharacter(name, codePoint))
      dataArray.push([codePointHex, name]);
  }
  return new Map(dataArray);
}

index.txt.gz shrinks from 32197 to 30787 lines, which reduces the decompressed text from 1030 kB to 980 kB, a 5% improvement. With gzip this shrinks from 200 kB to 192 kB (4%) and with Brotli it shrinks from 138 kB to 135 kB (2%). These improvements are significant but still modest. Yet we can still squeeze our data further.

Compressing code points into increments

Code points in the Basic Multilingual Plane (the BMP: U+0000 to U+FFFF) currently use four hexadecimal digits. And code points in the Supplemental Multilingual Planes (the SMPs: U+010000 to U+10FFFF) currently use six hexadecimal digits. Yet, even after sorting by name, the majority of data in index.txt.gz might consist of contiguous (or, at least, mutually close) blocks code points. All those poorly compressible hexadecimal codes can be converted into a huge, very compressible set of 1 increments (for contiguous code points), with occasional larger increments. This is called delta encoding, and it can be done by changing build.js’s createNameDataSerializer to:

function createNameDataSerializer (codePoint_name_map, console) {
  return new stream.Readable({
    read (size) {
      let previousCodePoint = 0;
      for (const [codePoint, name] of codePoint_name_map) {
        const codePointDifference = codePoint - previousCodePoint;
        const codePointDifferenceHex = codePointDifference.toString(16).toUpperCase();
        const entry = `${codePointDifferenceHex}${fieldDelimiter}${name}\n`;
        this.push(entry);
        previousCodePoint = codePoint;
      }
      this.push(null);
    }
  });
}

…and searchNameLines to:

async function searchNameBuffer (fileInput, targetName) {
  const normalizedTargetName = targetName.toUpperCase();
  let previousCodePoint = 0;
  for await (const line of lineAsyncIterable) {
    const [ codePointIncrementHex, name ] = line.split(fieldDelimiter);
    const codePointIncrement = parseInt(codePointIncrementHex, 16);
    const codePoint = previousCodePoint + codePointIncrement;
    const codePointHex = codePointToHex(codePoint);
    const character = String.fromCharCode(codePoint);
    if (name === normalizedTargetName)
      return { codePoint, character, name };
  }
}

When decompressed, index.txt.gz now looks like this 32197-line text file:

20;SPACE
1;EXCLAMATION MARK
1;QUOTATION MARK
1;NUMBER SIGN
1;DOLLAR SIGN
1;PERCENT SIGN
1;AMPERSAND
1;APOSTROPHE
1;LEFT PARENTHESIS
1;RIGHT PARENTHESIS
1;ASTERISK

…and later this:

1;XIANGQI BLACK CANNON
1;XIANGQI BLACK SOLDIER
C0594;LANGUAGE TAG
1F;TAG SPACE
1;TAG EXCLAMATION MARK
1;TAG QUOTATION MARK
1;TAG NUMBER SIGN
1;TAG DOLLAR SIGN
1;TAG PERCENT SIGN
1;TAG AMPERSAND
1;TAG APOSTROPHE
1;TAG LEFT PARENTHESIS
1;TAG RIGHT PARENTHESIS

This dramatically increases compression again. The uncompressed text decreases from 980 kB to 877 kB, which is a 12%. But the gzipped file decreases from 192 kB to 126 kB, which is a much greater 52% improvement, demonstrating how much compressibility can improve when data become more repetitive. (The Brotli version would decrease further from 135 kB to 97 kB, which is a 39% improvement from the previous Brotli version.)

Enabling random access

Currently, in the worst cases (U+E01EF Variation Selector-256 and nonexistent character names), our script might have to search the entire database before determining the answer. The performance is actually not too poor, but it is not desirable. We want to be able to look up the characters of given names and the names of given characters in nearly constant time, regardless of how many entries there are in the database.

Further complicating matters is our application of gzip or Brotli to the entire database. This unfortunately precludes random-access, because gzip and Brotli both output linear streams of uncompressed data, which would either have to be sequentially searched or have to be loaded completely into memory (which, as seen earlier in this article, would weigh more than 1 MB). But we can improve this situation.

Baseline

In order to ---

Creating a perfectly balanced binary-search tree

We can limit sequential search by splitting our database into several sections, each of which is individually compressed and each of which is individually addressable:

  • An array of randomly accessible data blobs, together which form a binary search tree (BST), and each which represents one node in that BST. The BST is perfectly balanced, so a simple array of its nodes can act as an implicit representation of the tree. The array’s #0 element is the tree’s root; its #1 and #2 elements are the root’s lesser and greater children; its #3, #4, #5, and #6 elements are those children’s respective children; and so on. That is, perfectly balanced BSTs have the following nodes at each level:

    Level 0: Nothing begets #0.
    Level 1: #0 begets #1 and #2.
    Level 2: #1 begets #3 and #4; #2 begets #5 and #6.
    Level 3: #3 begets #7 and #8; #4 begets #9 and #10; #5 begets #11 and #12; #6 begets #13 and #14.
    And so on.

    For any node at index 𝘪 in any perfectly balanced BST: If the node has any children, then it has two children are at indices 2𝘪 + 1 and 2𝘪 + 2 – and, if it has a parent, then its parent is at ⌊½(𝘪 − 1)⌋. For instance, node #3’s two children are #7 and #8, and its parent is #1.

    In addition, for any level 𝘭 (where 𝘭 is 0 at the root node) within the perfectly balanced BST, then the first node of that level is at index 2^𝘭 − 1; in addition, there are 2^𝘭 nodes at level 𝘭, and there are 2^(𝘭 + 1) − 1 total nodes between the root node and level 𝘭. For instance, the first node at height 3 is #7, and there are fifteen total nodes from height 0 to 3.

    A BST with 𝘯 nodes therefore has a maximum level of log₂(𝘯 + 1) − 1; 𝘯 must always be one less than a power of two. For instance, a perfectly balanced BST with fifteen nodes has a height of three.

    • The perfectly balanced BST has two kinds of nodes: leaves and branches. Leaves have no children; branches have two children each. All nodes on the BST’s final level are leaves; all other nodes before the leaves are branches.

    Therefore, given the number of nodes 𝘯 in a perfectly balanced BST, all nodes from indices ½(𝘯 − 1) to 𝘯 − 1 are its leaves (half of all the BST’s nodes, rounded up), and all other nodes (the other half, rounded down) are branches. Equivalently, given its maximum level 𝘭, then all nodes from indices 2^𝘭 − 1 to 2(2^𝘭 − 1) are its leaves. For instance, a perfectly balanced BST with fifteen nodes (that is, a maximum level of three) has eight leaf nodes (from #7 to #14), and it has seven branches (from #0 to #6).

    • A leaf node contains a header (which is an initial code point encoded using the method below) followed by the body: a Brotli-compressed sequence of names and code-point increments, interleaved with one another.

    The initial code point has the first name in the compressed body.
    The code point that is the initial code point plus the first increment in the compressed body has the second name in the compressed body.
    The code point that is the initial code point plus the second increment in the compressed body has the third name in the compressed body.
    And so on.

    • A branch node contains only a header (which is a threshold code point encoded using the method below). The threshold code point determines whether a binary search will next select the node’s lesser child or its greater child. (If we eventually create our own compression method, we can put here dictionaries of phrases that recur across multiple descendant nodes of the branch node.)

    • Code points are encoded as differences from the code point of their parent node (or, for the root node, 0), then divided by 16. The difference is encoded as one big-endian signed 16-bit integer. This 16-bit-overhead-per-node encoding scheme saves much more room than encoding arbitrary code points directly, which would require 32 bits of overhead per node to remain byte aligned. Consequences this encoding scheme include:

      • All nodes’ threshold/initial code points must be multiples of 10₁₆ (sixteen).
      • No leaf node may start at a code point that is less than 10₁₆ from its neighbors.
      • The root node must have a threshold code point between U+0000 and U+07FFF0.
      • All parent–child nodes’ code points must be no more than 7FFF0₁₆ apart.
  • A names-to-BST-indices lookup table, which is able to map Unicode names to the indices of the BST nodes that contain the names’ characters. We will implement this later in creating a minimal perfect hash function.

In order to construct this BST, we will divide the 30787 non-skipped entries of UnicodeData.txt into two nodes, several times. The BST is perfectly balanced in order to support implicit indexing of its nodes. For many devices, the BST will hopefully be small enough to load completely into memory – but, in fact, implementations would need to load only one node at a time from storage into memory.

How many times should we bifurcate our nodes? We expect data compression to be less effective as the number of divisions increases, because all compression algorithms depend on learning from past data to compress incoming data. But we also expect the lookup speed to increase As my furcation increases, because decreasing leaf sizes will decrease the average number of entries that must be sequentially searched. However, it is difficult to predict precisely by how much total space will increase with each successive division. We will make the number of divisions configurable and then chart how it affects both spatial and temporal efficiency. There is also a maximum number of divisions such that no two leaf nodes may be less than 100₁₆ apart.

In addition, how should we bifurcate our nodes? The simplest way would be into halves with roughly equal numbers of name entries (with each half’s code points rounded to the nearest 100₁₆, as the code-point encoding scheme above requires). But this may not be the best way: we could use Unicode blocks. We discuss this later in bifurcating BST nodes by Unicode blocks.

Instead of outputting a single index.txt.gz file, our build.js script will now output an index.js file, as well as binary files called leaf-0.txt.gz, leaf-1.txt.gz, etc. for each leaf node in our BST. The index.js file will now contain the code for retrieving values, which we will move from build.js. index.js will also contain our BST’s branch nodes in a JavaScript array.

We start by creating index.js: ---

Bifurcating BST nodes by Unicode blocks

We currently are bifurcating BST nodes be into halves with roughly equal numbers of name entries, with each half’s code points rounded to the nearest 100₁₆. But this may not be the best way. The space of Unicode code points is formally divided into blocks of contiguous points, and we can use these Unicode blocks to group together related contiguous characters.

The Standard attempts to group related characters in the same Unicode block, although this has not been the case for many characters for historical reasons. Related characters are presumed to be more likely to have similar names, and therefore splitting Unicode blocks between BST nodes may worsen their compression. So an alternative approach that we should eventually test would be to bifurcate nodes into halves with roughly equal numbers of blocks.

As a rule, all Unicode blocks are already aligned on 100₁₆ anyway, as defined in The Unicode Standard, Chapter 3, clause D10b: “Each range for a defined block starts with a value for which code point MOD 16 = 0 and terminates with a larger value for which code point MOD 16 = 15. This specification results in block ranges which always include full code point columns for code chart display. A block never starts or terminates in mid-column.” ---

Creating a minimal perfect hash function

Our perfectly balanced BST has much more efficient points-to-names random access, but it still requires sequential search for names-to-points retrieval. In order to implement the latter, we must add another lookup data structure.

Hash tables are perhaps the most popular lookup structure today. In particular, minimal perfect hash functions might be especially appropriate in this case. However, hash functions come with an important disadvantage: false positives. Hashes are not necessarily unique. Any efficient hash function might yield the same value for some two keys. It’s why passwords need to be salted with a secret value, and it’s why secure cryptographic hash functions are so difficult to create. Any minimal perfect hash function that rejects invalid keys would would somehow have to include copies of all their keys somewhere anyway.

Fortunately, we will be keeping our keys in the BST anyway. Furthermore, we can save much room in the hash table by encoding the indices of the BST leaves that contain characters, rather than the characters’ code points directly, since we’re going to have to retrieve the characters from the BST anyway.

The information-theoretical minimum size of a minimal perfect hash function over 𝘬 string keys, as 𝘬 → ∞, is (𝘬 · log₂(ℯ) − O(log₂(𝘬))) bits ≈ 1.44 bits · 𝘬, where 𝘦 is the natural logarithm. In our case, 𝘬 = 32197, so our hash function would require approximately 6.54 kB of overhead (for a proof of this, see Belo Horizonte’s 2008 thesis). The 6.54 kB would be overhead cost, in addition to the cost of storing 𝘬 · log₂(2^𝘭) bits = 𝘬 · 𝘭 bits for their corresponding BST-leaf-node indices – where 𝘭 is the maximum level of the perfectly balanced BST, and which would therefore have 2^𝘭 leaf nodes. For 𝘭 = 1, 2, 3, and 4, these costs total to 10.1 kB, 14.6 kB, and 18.6 kB. (Consider if we stored code points directly, instead of BST-node indices. There are 110000₁₆ (1,114,112) code points, so it takes a minimum of log₂(1,114,112) bits to represent one code point. So our overhead would become 6.54 kB + 32197 · log₂(110000₁₆) bits ≈ 83.38 kB, which is dramatically larger.)

We start our hash function by ---

Other data structures

We are using a perfectly balanced BST and a minimal perfect hash function, but these are not the only possibilities. Tries / radix trees / prefix trees / directed acyclic graphs were also considered as alternatives. The Unicode Standard, Chapter 5, § 5.1 Data Structures for Character Conversion suggests that Unicode metadata generally be stored in two-stage tables for lookup of properties by code points. These structures are directed acyclic graphs that first match code points’ most-significant bytes with a single primary table, which yields a pointer to one of many secondary tables; they then match the code points’ least-significant bytes. In particular, the official International Components for Unicode project uses such structures.

However, character names are not like these other Unicode properties. The list of names is dense, not sparse, and name values are unique and of variable length – not members of a small enumerated set of fixed-length values. Using tries or similar structures may be advantageous for substring matching, but we do not care about substring matching. Although we have not actually tested these alternatives, it is likely a priori that they would markedly reduce data compression, compared to our perfectly balanced BST and a minimal perfect hash function, with little benefit in access time.

Compressing data with a custom algorithm

Both gzip and Brotli compression are third-party dependencies that, being sophisticated and complex, would add large overhead to client-side scripts. gzip is built into Node.js but it is not yet built into any DOM web API; the most popular browser implementation, ---, weighs --- kB when minified. Brotli is neither ---.

Can we compress the data to a level comparable to either compression method but with less overhead? Perhaps we can, if we take advantage of our perfectly balanced BST structure. ---

Korean/Hangul-syllable names

The algorithm for generating the names and code points of Korean/Hangul syllables is defined in The Unicode Standard, Chapter 3, § 3.12. It also conveniently gives some sample Java code, which is derived from the Unicode Character Database file Jamo.txt:

static private int
  SBase = 0xAC0016, // Syllable base point.
  LBase = 0x110016, // Leading-jamo base point.
  VBase = 0x116116, // Vowel-jamo base point.
  TBase = 0x11A716, // Trailing-jamo base point.
  LCount = 19, // Number of leading jamo.
  VCount = 21, // Number of vowel jamo.
  TCount = 28, // Number of trailing jamo.
  NCount = 588 * VCount * TCount, // Number of syllables per leading jamo.
  SCount = 11172 * LCount * NCount; // Number of syllables total.

public static String getHangulName(char s) {
    int SIndex = s - SBase;
    if (0 > SIndex || SIndex >= SCount) {
      throw new IllegalArgumentException("Not a Hangul Syllable: "
        + s);
    }
    int LIndex = SIndex / NCount;
    int VIndex = (SIndex % NCount) / TCount;
    int TIndex = SIndex % TCount;
    return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex]
      + JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex];
  }
static private String[] JAMO_L_TABLE = {
    "G", "GG", "N", "D", "DD", "R", "M", "B", "BB",
    "S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H"
};
static private String[] JAMO_V_TABLE = {
    "A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O",
    "WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI",
    "YU", "EU", "YI", "I"
};
static private String[] JAMO_T_TABLE = {
    "", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM",
    "LB", "LS", "LT", "LP", "LH", "M", "B", "BS",
    "S", "SS", "NG", "J", "C", "K", "T", "P", "H"
};

From this, a generator and parser of their names in JavaScript can be derived:

const
  syllableBasePoint = 0xAC0016,
  leadingJamoBasePoint = 0x110016,
  vowelJamoBasePoint = 0x116116,
  trailingJamoBasePoint = 0x11A716,
  leadingJamoNameArray = [
    "G", "GG", "N", "D", "DD", "R", "M", "B", "BB",
    "S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H",
  ],
  vowelJamoNameArray = [
    "A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O",
    "WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI",
    "YU", "EU", "YI", "I",
  ],
  trailingJamoNameArray = [
    "", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM",
    "LB", "LS", "LT", "LP", "LH", "M", "B", "BS",
    "S", "SS", "NG", "J", "C", "K", "T", "P", "H",
  ],
  jamoNameArrayArray = [leadingJamoNameArray, vowelJamoNameArray, trailingJamoNameArray],
  numOfLeadingJamo = leadingJamoNameArray.length,
  numOfVowelJamo = vowelJamoNameArray.length,
  numOfTrailingJamo = trailingJamoNameArray.length,
  numOfSyllablesPerLeadingJamo = 588 * numOfVowelJamo * numOfTrailingJamo,
  numOfSyllables = 11172 * numOfSyllablesPerLeadingJamo * numOfSyllablesPerLeadingJamo;

function getHangulSyllableName (codePoint) {
  const syllableIndex = codePoint - syllableBasePoint;
  if (numOfSyllables <= syllableIndex || syllableIndex < 0)
    return null;
  else {
    const leadingJamoIndex = Math.floor(syllableIndex / numOfSyllablesPerLeadingJamo);
    const vowelJamoIndex = Math.floor((syllableIndex % numOfSyllablesPerLeadingJamo) / trailingJamoBasePoint);
    const trailingJamoIndex = syllableIndex % trailingJamoBasePoint;
    return `HANGUL SYLLABLE ${
      leadingJamoNameArray[leadingJamoIndex]
    }${
      vowelJamoNameArray[vowelJamoIndex]
    }${
      trailingJamoNameArray[trailingJamoIndex]
    }`;
  }
}

function isPrefixOf (substring, bodyString, stringIndex) {
  return substring ==
    bodyString.substring(
      stringIndex,
      stringIndex + substring.length);
}

function isPreviousResultShorterThanNextString (previousResult, nextString) {
  return !previousResult || previousResult.string.length < nextString.length;
}

function createLongestMatchComparator (string, stringIndex)
  return function longestMatchComparator (previousResult, nextString, nextIndex) {
    return isPrefixOf(nextString, string, stringIndex)
      && isPreviousResultShorterThanNextString(previousResult, nextString)
      ? { index: nextIndex, string: nextString }
      : previousResult;
  };
}

function findLongestMatch (stringArray, string, stringIndex) {
  return stringArray.reduce(null, createLongestMatchComparator(string, stringIndex));
}

function hasLeftoverCharacters (string, stringIndex) {
  return stringIndex < string.length;


function applyLongestMatches (string, stringArrayArray) {
  let stringIndex = 0;
  const matchArray = [];
  for (const stringArray of stringArrayArray) {
    const match = findLongestMatch(stringArray, string, stringIndex);
    if (match) {
      matchArray.push(match);
      stringIndex += match.string.length;
    } else
      return null;
  }
  if (hasLeftoverCharacters(string, stringIndex))
    return null;
  return matchArray;
}

const hangulSyllableNamePattern = /^HANGUL SYLLABLE ([A-Z\-]+)$/;

function parseHangulSyllableName (name) {
  const patternMatch = name.match(hangulSyllableNamePattern);
  if (patternMatch) {
    const [, fullJamoNameString] = patternMatch;
    return applyLongestMatches(fullJamoString, jamoNameArrayArray);
  } else
    return null;
}

Character name aliases

Some Unicode names are aliases to characters, separate from their strict character names. Oftentimes this is because the strict character name has a misspelling, lke the infamous U+FE18 Presentation Form for Vertical Right White Lenticular Brakcet. U+FE18 got an alias with the correct spelling “Bracket”. Other characters have other types of aliases: for widely used alternate names and for abbreviations.

A character may have zero, one, or many aliases: Although strict character names are an injective function, character name aliases are a non-injective function. ---

Named character sequences

Special character sequences have also received Unicode names. These sequences are strings of multiple code points. ---

Code-point labels

Many characters do not have any strict Unicode name. These include the control characters, private-use characters, surrogate points, noncharacters, and reserved characters. The Standard therefore gives them labels that look like U+0009 < control-0009 > and U+FFFF < noncharacter-FFFF >. Although they technically are in a separate namespace from the Unicode names, it still would be useful to refer to them in program source code, and it is very unlikely that there will be a collision between any future Unicode name and an already-valid code-point label. Code-point labels are also easily determined by several rules: ---

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment