J. S. Choi, 2022
All programmers must manipulate text; JavaScript programmers are no exception. Text manipulation often refers to specific characters, usually by their code points in hexadecimal – or by embedding the characters directly in source code.
Almost every encoded character used by programmers has one or more standard, unique, and immutable Unicode names. Unicode names are the most prominent, if not the most important, properties of any Unicode character.
Unicode names are much more memorable, readable, visually distinctive, and self-documenting than hexadecimal codes. And they will always be visible, printable, and supported by system fonts, unlike most directly embedded characters.
Yet Unicode names are not often used in JavaScript programming, despite JavaScript being perhaps the most ubiquitous programming language in the world. This is perhaps most due to the sheer number of Unicode names (285,275 names for 284,805 named Unicode values, as of Unicode 14.0 in 2021). The canonical source of those 285,275 Unicode names, UnicodeData.txt, weighs 1,897.8 kB. Embedding such large files in every language runtime would impose a large storage and memory burden on resource-constrained environments such as mobile devices.
But that heavy memory can be made light, without compromising retrieval time, by losslessly compressing UnicodeData.txt into a concise data structure.
When manipulating text, developers often need to refer to non-ASCII characters, non-printable characters, precomposed characters (or character sequences that may be precomposed), or RTL characters.
Embedding the characters directly in source code can often be problematic:
- Non-ASCII characters might be supported by no font in the human reader’s system, making it unable to directly display them to the human reader.
- Non-printable characters are, by their nature, invisible, which also makes both reading and editing source code that directly embeds them inscrutable.
- Precomposed characters and character sequences that may be precomposed
are visually ambiguous. Whether or not a character sequence is precomposed
(e.g.,
é
) or not (e.g.,e
then combining◌́
) is often significant, such as when matching text. - JavaScript syntax is based on English, which uses Latin letters. Many
characters resemble these Latin letters yet have very different meanings;
for instance, С (
U+0421
“Cyrillic Capital Letter Es”) resembles C (U+0043
“Latin Capital Letter C”), and soasync () => 'С'
(which would return the Cyrillic letter) is difficult to distinguish fromasync () => 'C'
(which would return the Latin letter). - Directly embedding RTL characters in source code (like Arabic or Hebrew letters) can cause the flow of its text to switch unpredictably. Directly embedded RTL characters can even cause security bugs.
Instead of directly embedding those characters, developers may instead refer
to these characters with \u
syntax. But these require using the characters’
hex codes, which can be inscrutable to human readers, unless there are nearby
comments redundantly explaining which character belongs to each hex code.
With Gzemnid, we can see how people already use these characters in their JavaScript source code.
From [email protected]/src/node.js:
const colorCode = '\u001B[3' + (c < 8 ? c : '8;5;' + c);
From [email protected]/lib/js-yaml/loader.js:
(c === 0x4C/* L */) ? '\u2028' :
(c === 0x50/* P */) ? '\u2029' : '';
From [email protected]/lib/output.js:
str = str.replace(/[\\\b\f\n\r\v\t\x22\x27\u2028\u2029\0\ufeff]/g,
function(s, i) {
switch (s) {
case '"': ++dq; return '"';
case "'": ++sq; return "'";
case "\\": return "\\\\";
case "\n": return "\\n";
case "\r": return "\\r";
case "\t": return "\\t";
case "\b": return "\\b";
case "\f": return "\\f";
case "\x0B": return options.ie8 ? "\\x0B" : "\\v";
case "\u2028": return "\\u2028";
case "\u2029": return "\\u2029";
case "\ufeff": return "\\ufeff";
case "\0":
return /[0-9]/.test(str.charAt(i+1)) ? "\\x00" : "\\0";
}
return s;
});
From [email protected]
:
while (/[\s\uFEFF\u00A0]/.test(str[tail - 1])) {
From [email protected]/lib/languages/fix.js
:
begin: /[^\u2401\u0001]+/,
end: /[\u2401\u0001]/,
From [email protected]/src/diacritics.js
:
{ base:'AA',letters:/[\uA732]/g },
{ base:'AE',letters:/[\u00C6\u01FC\u01E2]/g },
{ base:'AO',letters:/[\uA734]/g },
{ base:'AU',letters:/[\uA736]/g },
{ base:'AV',letters:/[\uA738\uA73A]/g },
{ base:'AY',letters:/[\uA73C]/g },
{ base:'B', letters:/[\u0042\u24B7\uFF22\u1E02\u1E04\u1E06\u0243\u0182\u0181]/g
},
From [email protected]/src/unicodeSymbols.js
:
"\u014f": "\u006f\u0306", // ŏ = \u{o}
From [email protected]/test/test.js
:
'\u{1F937}\u{1F3FB}\u200D',
'\u{1F937}\u{1F3FC}\u200D',
'\u{1F937}\u{1F3FD}\u200D',
'\u{1F937}\u{1F3FE}\u200D',
'\u{1F937}\u{1F3FF}\u200D',
'\u{1F937}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FB}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FC}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FD}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FE}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FF}\u200D\u2640\uFE0F'
From [email protected]/lib/pattern.js
:
['^\\u{20bb7}$', 'u', '𠮷', true, 'test'],
As defined in the Unicode Standard, a code point is an integer between
0
and 0x10_FFFF
– this range of 16**5 + 16**6
(or 17,825,792) unsigned
integers is also called the Unicode codespace. We represent code points
with hexes (uppercase hexadecimal strings); two hexes (16**2
) make up one
byte (2**8
).
Each code point encodes one of three things:
- A scalar character,
- A noncharacter, or
- A UTF-16 surrogate
– or the code point may yet be unassigned.
A UTF-16 surrogate is a code point that is between D800
and DFFF
. There
are therefore 16**4 - (16**2 + 8 * 16**3)
(or 32,512) UTF-16 surrogates.
UTF-16 surrogates are not characters, and they are not even valid code points
on their own, in well-formed Unicode text. They are used only as special byte
units to encode large code points in a certain form of Unicode, UTF-16.
A code point that is not a UTF-16 surrogate is also called a scalar;. A scalar always encodes a scalar character or a noncharacter – or is yet unassigned. We will prefer the term “scalar” where possible.
A character is a string that represents a single unit of writing.
Most characters that we care about are encoded by single scalars, which we
call scalar characters. All scalar characters have at least one name. In
addition, we can refer to scalar characters by U+
with their code points’ hex
codes, like U+0000
or U+10FFFF
.
Note that not all characters are encoded by only single scalars. Named character sequences – which are encoded by sequences of scalars – are also a type of character, in the sense that they are string that each represent a single unit of writing. They are characters in the same way that precomposed characters (which are encoded by single scalars but may decompose into character sequences) are also characters. Our algorithms will often split each named character sequence into its single head scalar (the zeroth scalar in the sequence) and their tail scalars (the rest of the scalars in the sequence).
A noncharacter is a code point reserved for usage internally within an
computer system. They are not meant to mean any character but are rather meant
to be useful as “sentinel values” for internal string manipulation.
Noncharacters occur in a regular pattern: they are FFFE
and FFFF
, 1FFFE
and 1FFFF
, 2FFFE
and 2FFFF
, and so on, until 10FFFE
and 10FFFF
.
A private-use character is a character to which the Unicode Standard itself
has given no specific meaning. It can be freely used for any purpose. There are
three sets of private-use characters: U+E000
–F900
(totaling 6,400),
U+F0000
–FFFFD
(totaling 65,534), and U+100000
–10FFFD
(totaling 65,534).
When we combine names of many different types (including UTF-16 surrogates) together, we will refer to their code points as head points. The head point of a named character sequence is its head scalar.
A character may have zero or more character names, which includes strict character names, character name aliases, and the names of named character sequences.
Every scalar character has one strict character name that is defined as its
Name
/na
property, as declared by the Unicode Standard’s clause D4
(in Chapter 3) and § 4.8 Name (in Chapter 4). For example, the strict character
name of U+0020
is “SPACE”, and the strict character name of U+0021
is
“EXCLAMATION MARK”.
Every scalar character also has zero or more character name aliases. For
example, U+0020
has one alias: the abbreviation SP
. In contrast, U+0021
has no aliases.
Several characters made of multiple scalars are named character sequences, which have their own unique names.
For each named character sequence, any scalars following its first scalar are
its tail scalars. It is notable that many named character sequences share
the same tail scalars, varying only in their first code scalars. For instance,
all Keycap named character sequences share the same tail U+FE0F 20E3
.
Lastly, many scalar characters (and all noncharacters and UTF-16 surrogates) do
not have any strict character name. These include the control characters,
private-use characters, UTF-16 surrogates, noncharacters, and reserved
characters. The Standard therefore gives them code-point labels such as
CONTROL-0009
for U+0009
, SURROGATE-D800
for U+D800
, and
NONCHARACTER-FFFF
for U+FFFF
.
Character names and code-point labels are collectively called Unicode
names. Other libraries sometimes call these uname
s.
Lastly, in order to distinguish general Unicode strings from Unicode strings that have names, we call the latter named Unicode values. We avoid simply calling named Unicode values “characters” because not all named Unicode values are characters – i.e., the UTF-16 surrogates and the noncharacters.
The Unicode codespace is made of seventeen equally sized planes, each made
of 16**4
(or 65,536) code points. The design is such that, for any code
point, the most significant byte of its three-byte (six-hex) representation is
the plane number: from 00
(#0) to 10
(#16). For example:
- The code point
A1CC
(00A1CC
with three bytes or six hexes) would belong to plane00
(#0). - The code point
BA1CC
(0BA1CC
) would belong to plane0B
(#11). - The code point
10A1CC
would belong to plane10
(#16).
There are two noncharacters at the end of each plane.
Some planes have special names:
- Plane
00
is the Basic Multilingual Plane (BMP). - Plane
01
is the Supplementary Multilingual Plane (SMP). - Plane
02
is the Supplementary Ideographic Plane (SIP). - Plane
03
is the Tertiary Ideographic Plane (TIP). - (Planes
04
–0D
are completely unassigned except for the two noncharacters at the end of each plane.) - Plane
0E
is the Supplementary Special-purpose Plane (SSP). - Planes
0F
and10
are the Private Use Planes.
The majority of Unicode names are strict character names, which are names
that are assigned to scalar characters by the Name
/na
property, as
declared by the Unicode Standard’s clause D4 (in Chapter 3) and § 4.8 Name
(in Chapter 4). There are also other kinds of Unicode names, including
character name aliases, the names of named character
sequences, and code-point labels.
No matter their type, all Unicode names are permanent and immutable for ever and ever. Each name also denotes exactly one string (either a single code point or a sequence of them). Together, they form a single shared namespace, the Unicode namespace for character names.
The Unicode Name property values are unique for all non-null values, but not every Unicode code point has a unique Unicode Name property value. Furthermore, because Unicode character names, character name aliases, named character sequences, and code point labels constitute a single, unique namespace, the Name property value uniqueness requirement applies to all three kinds of names and to code point labels.
All Unicode names also follow several simple orthographic rules as declared in § 4.8 Name’s clauses R1–R4, so that they may “easily be transposed into formal identifiers in another context, such as a computer language.”
- Names will always contain only (case-insensitive) Latin letters, Arabic digits, spaces, and ASCII hyphens: that’s 38 total characters. (We’ll call chunks of names separated by spaces “words”. For instance, “Zero Width Space” consists of three words.)
- Words in names will never start with number digits.
- Names will never start or end with ASCII hyphens; also, they will never have double hyphens.
- Rule #3 also applies to spaces too.
A corollary of these rules is that all spaces are ignorable. Furthermore, all
medial ASCII hyphens (where “medial” means “inside a space-delimited word, not
at the word’s beginning or its end”) are ignorable, with one current exception:
the Korean character U+1180
Hangul Jungseong O-E, whose name needs its hyphen
to be distinguishable from U+116C
Hangul Jungseong OE. Fuzzy name
matching is defined in UAX #44, and we implement its rules in the
#core/fuzzy-matcher
module.
Several Unicode names require the formatting of code points into specially
formatted 4-, 5-, or 6-digit hexadecimal strings (or hexes), which we will call
code hexes. A hex that would be less than four digits long (e.g., 1F
) is
padded by 0
s at the start (e.g., 001F
).
We implement these rules in the #core/util/hex
module:
// This function converts hex strings into their code-point integers.
// If the hex string is invalid malformed (e.g., `'123XYZ'`),
// then it returns `undefined`.
export function getscalarFromHex (scalarHex) { … }
// This function converts code-point integers into their hex strings. The
// string is padded with zeroes as necessary to make it at least four digits
// long.
export function getHexFromscalar (scalar) { … }
Strict character names (the Name
/na
property) follow the above
restrictions, but they’re actually determined by the same chapter’s clauses
NR1–NR4:
-
Precomposed Korean Hangul syllables (
U+AC00
–D7AF
) always are named “Hangul Syllable ” followed by a phrase denoting the sounds of their components, which in turn are known as jamo. Korean Hangul-syllable names are discussed more in Hangul syllables. We implement their rules in the#core/hangul-syllables
module. -
Most ideographs have dynamically generated hex names: each hex name is a prefix “CJK Unified Ideograph-”, “Tangut Ideograph-”, “Nushu Character-”, or “CJK Compatibility Ideograph-” followed by a hexadecimal number indicating its scalar. For example,
U+4E00
is automatically named “CJK Unified Ideograph-4E00”.These hex names are defined in the Unicode Standard’s § 4.8 Name’s Table 4-8. In addition, their ranges are also defined in UnicodeData.txt, some of which are in a compressed format. We implement hex names’ rules in the
#core/name-counter
module. -
If a character has an entry in the Unicode Character Database’s UnicodeData.txt with a value explicitly in its first field, then that is its strict character name (unless there the value is delimited by
<
and>
, in which case it is just an informal label). We call these explicitly defined strict character names, rather than the strict character names that are dynamically generated by the previous two rules.UnicodeData.txt makes up the vast bulk of the data that we need to compress. We implement these rules in the
#core/name-range/ucd
module. -
If none of the rules above apply, then that means it has no strict character name. This applies to control characters, private-use characters, UTF-16 surrogates, noncharacters, and reserved characters. However, these characters do have code-point labels, and some of these may have character name aliases.
Strict Unicode character names therefore form an injective, non-surjective function over all code points.
The longest name is “Arabic Ligature Uighur Kirghiz Yeh with Hamza Above with Alef Maksura Isolated Form” (83 characters long). The shortest name is “Ox” (two characters).
The bulk of the names are in the Unicode Character Database’s UnicodeData.txt source. It has 34,627 lines weighing 1,897.8 kB, and it looks like this (line breaks are significant):
0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
0002;<control>;Cc;0;BN;;;;;N;START OF TEXT;;;;
0003;<control>;Cc;0;BN;;;;;N;END OF TEXT;;;;
0004;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION;;;;
0005;<control>;Cc;0;BN;;;;;N;ENQUIRY;;;;
0006;<control>;Cc;0;BN;;;;;N;ACKNOWLEDGE;;;;
0007;<control>;Cc;0;BN;;;;;N;BELL;;;;
0008;<control>;Cc;0;BN;;;;;N;BACKSPACE;;;;
…and then this:
001E;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR TWO;;;;
001F;<control>;Cc;0;S;;;;;N;INFORMATION SEPARATOR ONE;;;;
0020;SPACE;Zs;0;WS;;;;;N;;;;;
0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
0024;DOLLAR SIGN;Sc;0;ET;;;;;N;;;;;
…and then this:
004C;LATIN CAPITAL LETTER L;Lu;0;L;;;;;N;;;;006C;
004D;LATIN CAPITAL LETTER M;Lu;0;L;;;;;N;;;;006D;
004E;LATIN CAPITAL LETTER N;Lu;0;L;;;;;N;;;;006E;
004F;LATIN CAPITAL LETTER O;Lu;0;L;;;;;N;;;;006F;
0050;LATIN CAPITAL LETTER P;Lu;0;L;;;;;N;;;;0070;
0051;LATIN CAPITAL LETTER Q;Lu;0;L;;;;;N;;;;0071;
…and then later this:
090D;DEVANAGARI LETTER CANDRA E;Lo;0;L;;;;;N;;;;;
090E;DEVANAGARI LETTER SHORT E;Lo;0;L;;;;;N;;;;;
090F;DEVANAGARI LETTER E;Lo;0;L;;;;;N;;;;;
0910;DEVANAGARI LETTER AI;Lo;0;L;;;;;N;;;;;
0911;DEVANAGARI LETTER CANDRA O;Lo;0;L;;;;;N;;;;;
0912;DEVANAGARI LETTER SHORT O;Lo;0;L;;;;;N;;;;;
…then this:
33FF;SQUARE GAL;So;0;ON;<square> 0067 0061 006C;;;;N;;;;;
3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
4DC0;HEXAGRAM FOR THE CREATIVE HEAVEN;So;0;ON;;;;;N;;;;;
4DC1;HEXAGRAM FOR THE RECEPTIVE EARTH;So;0;ON;;;;;N;;;;;
4DC2;HEXAGRAM FOR DIFFICULTY AT THE BEGINNING;So;0;ON;;;;;N;;;;;
…then this:
4DFE;HEXAGRAM FOR AFTER COMPLETION;So;0;ON;;;;;N;;;;;
4DFF;HEXAGRAM FOR BEFORE COMPLETION;So;0;ON;;;;;N;;;;;
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
A000;YI SYLLABLE IT;Lo;0;L;;;;;N;;;;;
A001;YI SYLLABLE IX;Lo;0;L;;;;;N;;;;;
…then this:
ABF8;MEETEI MAYEK DIGIT EIGHT;Nd;0;L;;8;8;8;N;;;;;
ABF9;MEETEI MAYEK DIGIT NINE;Nd;0;L;;9;9;9;N;;;;;
AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
D7B0;HANGUL JUNGSEONG O-YEO;Lo;0;L;;;;;N;;;;;
D7B1;HANGUL JUNGSEONG O-O-I;Lo;0;L;;;;;N;;;;;
…then this:
D7FA;HANGUL JONGSEONG PHIEUPH-SIOS;Lo;0;L;;;;;N;;;;;
D7FB;HANGUL JONGSEONG PHIEUPH-THIEUTH;Lo;0;L;;;;;N;;;;;
D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
DB80;<Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DBFF;<Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;;
DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;;
E000;<Private Use, First>;Co;0;L;;;;;N;;;;;
F8FF;<Private Use, Last>;Co;0;L;;;;;N;;;;;
F900;CJK COMPATIBILITY IDEOGRAPH-F900;Lo;0;L;8C48;;;;N;;;;;
F901;CJK COMPATIBILITY IDEOGRAPH-F901;Lo;0;L;66F4;;;;N;;;;;
F902;CJK COMPATIBILITY IDEOGRAPH-F902;Lo;0;L;8ECA;;;;N;;;;;
F903;CJK COMPATIBILITY IDEOGRAPH-F903;Lo;0;L;8CC8;;;;N;;;;;
F904;CJK COMPATIBILITY IDEOGRAPH-F904;Lo;0;L;6ED1;;;;N;;;;;
…then this:
16FF0;VIETNAMESE ALTERNATE READING MARK CA;Mc;6;L;;;;;N;;;;;
16FF1;VIETNAMESE ALTERNATE READING MARK NHAY;Mc;6;L;;;;;N;;;;;
17000;<Tangut Ideograph, First>;Lo;0;L;;;;;N;;;;;
187F7;<Tangut Ideograph, Last>;Lo;0;L;;;;;N;;;;;
18800;TANGUT COMPONENT-001;Lo;0;L;;;;;N;;;;;
18801;TANGUT COMPONENT-002;Lo;0;L;;;;;N;;;;;
18802;TANGUT COMPONENT-003;Lo;0;L;;;;;N;;;;;
18803;TANGUT COMPONENT-004;Lo;0;L;;;;;N;;;;;
…then this:
1FBF8;SEGMENTED DIGIT EIGHT;Nd;0;EN;<font> 0038;8;8;8;N;;;;;
1FBF9;SEGMENTED DIGIT NINE;Nd;0;EN;<font> 0039;9;9;9;N;;;;;
20000;<CJK Ideograph Extension B, First>;Lo;0;L;;;;;N;;;;;
2A6DF;<CJK Ideograph Extension B, Last>;Lo;0;L;;;;;N;;;;;
2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
2B738;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
2B820;<CJK Ideograph Extension E, First>;Lo;0;L;;;;;N;;;;;
2CEA1;<CJK Ideograph Extension E, Last>;Lo;0;L;;;;;N;;;;;
2CEB0;<CJK Ideograph Extension F, First>;Lo;0;L;;;;;N;;;;;
2EBE0;<CJK Ideograph Extension F, Last>;Lo;0;L;;;;;N;;;;;
2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;;
2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;;
…then this:
2FA1B;CJK COMPATIBILITY IDEOGRAPH-2FA1B;Lo;0;L;9F16;;;;N;;;;;
2FA1C;CJK COMPATIBILITY IDEOGRAPH-2FA1C;Lo;0;L;9F3B;;;;N;;;;;
2FA1D;CJK COMPATIBILITY IDEOGRAPH-2FA1D;Lo;0;L;2A600;;;;N;;;;;
30000;<CJK Ideograph Extension G, First>;Lo;0;L;;;;;N;;;;;
3134A;<CJK Ideograph Extension G, Last>;Lo;0;L;;;;;N;;;;;
E0001;LANGUAGE TAG;Cf;0;BN;;;;;N;;;;;
E0020;TAG SPACE;Cf;0;BN;;;;;N;;;;;
E0021;TAG EXCLAMATION MARK;Cf;0;BN;;;;;N;;;;;
E0022;TAG QUOTATION MARK;Cf;0;BN;;;;;N;;;;;
…until getting to the end:
E01EC;VARIATION SELECTOR-253;Mn;0;NSM;;;;;N;;;;;
E01ED;VARIATION SELECTOR-254;Mn;0;NSM;;;;;N;;;;;
E01EE;VARIATION SELECTOR-255;Mn;0;NSM;;;;;N;;;;;
E01EF;VARIATION SELECTOR-256;Mn;0;NSM;;;;;N;;;;;
F0000;<Plane 15 Private Use, First>;Co;0;L;;;;;N;;;;;
FFFFD;<Plane 15 Private Use, Last>;Co;0;L;;;;;N;;;;;
100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;
10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;
1,410 (about 4.4% as of Unicode 14.0) lines in UnicodeData.txt are ideographic
characters with dynamically generated hex names, like CJK Compatibility
Ideograph-2F9FD. They are included in UnicodeData.txt in order to declare those
characters’ values for other Unicode properties. We will be able to discard
these characters, since they are already generated by the [#core/hex-names/
module][].
The Korean Hangul script is made up of jamo characters which may be composed into syllable characters.
The jamo’s names are encoded in UnicodeData.txt, as usual. However, many
composed syllables have been encoded in U+AC00
–D7A3
, and their names are
generated from their jamo’s sounds.
The algorithm for generating the names and scalars of these composed syllables
is defined in the Unicode Standard’s § 3.12. For example, U+D4DB
Hangul Syllable Pwilh 퓛 is made of three jamo:
U+1111
Hangul Choseong Phieuph (“P”),U+1171
Hangul Jungseong Wi (“WI”), andU+11B6
Hangul Jongseong Rieul-Hieul (“LH”).
Those three jamo’s sounds compose into a single name: P + WI + LH → PWILH: hence, the name “Hangul Syllable Pwilh”.
§ 3.12 also conveniently gives some sample Java code. From this, a producer and
parser of their names in JavaScript can be derived. We implement these in the
[#core/hangul-syllables/
module][].
Several characters are sequences made of multiple scalars. They are stored in the Unicode Character Database’s NamedSequences.txt source. It has 612 lines weighing 20.8 kB, and it looks like this (line breaks are significant):
# NamedSequences-13.0.0.txt
# Date: 2020-01-22, 19:12:00 GMT [KW, LI]
# © 2020 Unicode®, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html
#
# Unicode Character Database
# For documentation, see http://www.unicode.org/reports/tr44/
#
# Unicode Named Character Sequences
#
# This file is a normative contributory data file in the Unicode
# Character Database.
…then this:
# Named keycap sequences for telephone keypad (used for emoji)
# Provisional, 2015-05-05
# FE0F added to the sequences, 2016-05-11
# Approved 2017-05-12
KEYCAP NUMBER SIGN;0023 FE0F 20E3
KEYCAP ASTERISK;002A FE0F 20E3
KEYCAP DIGIT ZERO;0030 FE0F 20E3
KEYCAP DIGIT ONE;0031 FE0F 20E3
KEYCAP DIGIT TWO;0032 FE0F 20E3
KEYCAP DIGIT THREE;0033 FE0F 20E3
KEYCAP DIGIT FOUR;0034 FE0F 20E3
KEYCAP DIGIT FIVE;0035 FE0F 20E3
KEYCAP DIGIT SIX;0036 FE0F 20E3
KEYCAP DIGIT SEVEN;0037 FE0F 20E3
KEYCAP DIGIT EIGHT;0038 FE0F 20E3
KEYCAP DIGIT NINE;0039 FE0F 20E3
…and eventually ends with this:
KHMER VOWEL SIGN OM;17BB 17C6
KHMER VOWEL SIGN AAM;17B6 17C6
# Entries for JIS X 0213 compatibility mapping.
# Provisional 2008-11-07, Approved 2010-05-14
#
# Two of these were part of the original set of approved named sequences
# for Unicode 4.1. 2005.
HIRAGANA LETTER BIDAKUON NGA;304B 309A
HIRAGANA LETTER BIDAKUON NGI;304D 309A
HIRAGANA LETTER BIDAKUON NGU;304F 309A
HIRAGANA LETTER BIDAKUON NGE;3051 309A
HIRAGANA LETTER BIDAKUON NGO;3053 309A
KATAKANA LETTER BIDAKUON NGA;30AB 309A
KATAKANA LETTER BIDAKUON NGI;30AD 309A
KATAKANA LETTER BIDAKUON NGU;30AF 309A
KATAKANA LETTER BIDAKUON NGE;30B1 309A
KATAKANA LETTER BIDAKUON NGO;30B3 309A
KATAKANA LETTER AINU CE;30BB 309A
KATAKANA LETTER AINU TU;30C4 309A
KATAKANA LETTER AINU TO;30C8 309A
KATAKANA LETTER AINU P;31F7 309A
MODIFIER LETTER EXTRA-HIGH EXTRA-LOW CONTOUR TONE BAR;02E5 02E9
MODIFIER LETTER EXTRA-LOW EXTRA-HIGH CONTOUR TONE BAR;02E9 02E5
# EOF
For each named character sequence, any scalars following its first scalar are
its tail scalars. It is notable that many named character sequences share
the same tail scalars, varying only in their first code scalars. For instance,
all Keycap named character sequences share the same tail U+FE0F U+20E3
.
Some Unicode names are aliases to scalar characters. These character name
aliases are separately declared from strict character names, which are
immutable. Sometimes an alias is added because the strict character name has a
misspelling, like the infamous U+FE18
Presentation Form for Vertical Right
White Lenticular Brakcet. U+FE18
got an alias with the correct spelling
“Bracket”. Other characters have other types of aliases: for widely used
alternate names and for abbreviations.
A character may have zero, one, or many aliases: Although strict character names are an injective function, character name aliases are a non-injective function.
Aliases are stored in the Unicode Character Database’s NameAliases.txt source. It has 567 lines weighing 16.1 kB, and it looks like this (line breaks are significant):
# NameAliases-13.0.0.txt
# Date: 2019-09-09, 19:47:00 GMT [KW, LI]
# © 2019 Unicode®, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html
#
# Unicode Character Database
# For documentation, see http://www.unicode.org/reports/tr44/
#
# This file is a normative contributory data file in the
# Unicode Character Database.
#
# This file defines the formal name aliases for Unicode characters.
…then like this:
0000;NULL;control
0000;NUL;abbreviation
0001;START OF HEADING;control
0001;SOH;abbreviation
0002;START OF TEXT;control
0002;STX;abbreviation
0003;END OF TEXT;control
0003;ETX;abbreviation
0004;END OF TRANSMISSION;control
0004;EOT;abbreviation
0005;ENQUIRY;control
0005;ENQ;abbreviation
0006;ACKNOWLEDGE;control
0006;ACK;abbreviation
…and eventually ending with:
E01EA;VS251;abbreviation
E01EB;VS252;abbreviation
E01EC;VS253;abbreviation
E01ED;VS254;abbreviation
E01EE;VS255;abbreviation
E01EF;VS256;abbreviation
# EOF
Many scalar characters (and all noncharacters and UTF-16 surrogates) do not
have any strict character name. These include the control characters,
private-use characters, UTF-16 surrogates, noncharacters, and reserved
characters. The Standard therefore gives them labels such as “control-0009”
(U+0009
) and “noncharacter-FFFF” (U+FFFF
). These code-point labels are
also determined by several rules:
- A code point has a “special code-point type” if it fulfills its respective
condition:
control
: The code point is withinU+0000
–001F
, or it isU+007F
Delete. (The code point is considered a scalar.)reserved
: The code point not a noncharacter. (Although the Standard defines reserved code points as those with a General Category ofCn
(“not assigned”) as defined in the third field of UnicodeData.txt, this definition causes formerly reserved characters to no longer be reserved characters, over time. In order to keep application code backwards compatible, we use a broader definition of reserved character.noncharacter
: The code point is withinU+FDD0
–FDEF
, or its hex ends withFFFE
orFFFF
. This code point does not count encode a character.private-use
: The code point is withinU+E000
–F8FF
, or it is a character from one of the two Private Use Planes (i.e.,U+F0000
–FFFFD
,U+100000
–10FFFD
). (The code point is always a scalar.)surrogate
: The code point is withinU+D800
–DFFF
. (The code point does not count as a scalar.)
- When a code point has a “special code-point type”, then it has a code-point
label that is its special code-point type—concatenated with a
-
, then its code hex.
We implement these rules in the [#core/name-counter/
module][], as well as
the #core/name-range/ucd
module and #core/name-range/noncharacter
module.
The vast majority of scalars are currently unassigned. Nearly every scalar in
planes 04
–0D
is currently unassigned (the only exceptions are the two
noncharacters at the end of each plane).
Of the relatively few scalars that have been assigned, many of them have only
dynamically generated names. These include all noncharacters, all scalars for
private-use characters, and nearly all scalars for ideographic and
Hangul-syllable characters. Additionally, every single scalar in blocks 03
and 04
(the SIP and TIP) has an ideograph with a dynamically generated name.
When we exclude scalars with dynamically generated names, every scalar left –
i.e., scalars with explicitly defined names – is in the first two planes 00
and 01
(the BMP and SMP) or the antepenultimate plane (the 0E
SSP). Of
these, the large majority of characters belong to the BMP or SMP; the SSP has
relatively few characters.
In other words, virtually every scalar character is in the BMP, SMP, SIP, and TIP, and virtually every scalar character with an explicitly defined name is in the BMP and SMP – with a large gap of scalars separating them from the few characters with explicitly defined names in the SSP.
We want a Node.js API that looks like this:
import UninameLibrary from 'uniname';
// Create a Uniname library object from a database file.
const database = await fetch('/uniname/database');
const u = new UninameLibrary(database);
{
// Get the character that has a certain Unicode name, with fuzzy matching.
const character0 = u.get('Devanagari Letter E');
// Get multiple characters in a single string.
const character1 = u.get('Latin Small Letter O', 'Combining Breve');
}
{
const character = '\u0000';
// Get the preferred Unicode name: 'NULL'.
const name = u.getPreferredName(character);
// Get the strict value of the character Name property: undefined.
const strictName = u.getStrictName(character);
// Get all names of a certain character with their types:
// `[ [ 'NULL', 'control' ], [ 'NUL', 'abbreviation' ],
// [ 'control-0000', 'label' ] ]`.
const nameEntries = u.getNameEntries(character);
}
We can see what API would look like with the previous real-world code.
From [email protected]/src/node.js:
const colorCode = '\u001B[3' + (c < 8 ? c : '8;5;' + c);
const colorCode = `${ u.get('escape') }[3${ c < 8 ? c : '8;5;' + c }`;
const colorCode = `\u[escape][3${ c < 8 ? c : '8;5;' + c }`;
From [email protected]/lib/js-yaml/loader.js:
(c === 0x4C/* L */) ? '\u2028' :
(c === 0x50/* P */) ? '\u2029' : '';
(c === 0x4C) ? u.get('line separator') :
(c === 0x50) ? u.get('paragraph separator') :
'';
From [email protected]/lib/output.js:
str = str.replace(/[\\\b\f\n\r\v\t\x22\x27\u2028\u2029\0\ufeff]/g,
function(s, i) {
switch (s) {
case '"': ++dq; return '"';
case "'": ++sq; return "'";
case "\\": return "\\\\";
case "\n": return "\\n";
case "\r": return "\\r";
case "\t": return "\\t";
case "\b": return "\\b";
case "\f": return "\\f";
case u.get('line tabulation'):
return options.ie8 ? "\\x0B" : "\\v";
case u.get('line separator'):
return u.get('backslash', 'line separator');
case u.get('paragraph separator'):
return u.get('backslash', 'paragraph separator');
case u.get('zero width no-break space'):
return u.get('backslash', 'zero width no-break space');
case "\0":
return /[0-9]/.test(str.charAt(i+1)) ? "\\x00" : "\\0";
}
return s;
});
str = str.replace(
new RegExp(
"[${
u.get(
"backslash", "backspace", "form feed", "line feed", "carriage return",
"vertical tabulation", "quotation mark", "apostrophe",
"line separator", "paragraph separator", "null", "byte order mark",
)
}]",
"g",
),
function(s, i) {
switch (s) {
case u.get("quotation mark"): ++dq; return s;
case u.get("apostrophe"): ++sq; return s;
case u.get("backslash"): return s;
case u.get("line feed"): return s;
case u.get("carriage return"): return s;
case u.get("tab"): return s;
case u.get("backspace"): return s;
case u.get("backslash"): return s;
case u.get("form feed"): return s;
case u.get("vertical tabulation"): return options.ie8 ? "\\x0B" : s;
case u.get("line separator"): return "\\u2028";
case u.get("paragraph separator"): return "\\u2029";
case u.get("byte order mark"): return "\\ufeff";
case u.get("null"):
return /[0-9]/.test(str.charAt(i+1)) ? "\\x00" : "\\0";
}
return s;
});
From [email protected]
:
while (/[\s\uFEFF\u00A0]/.test(str[tail - 1])) {
const rightTrimmableRegExp = new RegExp('[\s${
u.get('zero width no-break space', 'no-break space')
}]'
);
while (rightTrimmableRegExp.test(str[tail - 1])) {
From [email protected]/lib/languages/fix.js
:
begin: /[^\u2401\u0001]+/,
end: /[\u2401\u0001]/,
begin: /[^${
u.get('Symbol for Start of Heading')
}${
u.get('Start of Heading')
}]+/,
end: /[${
u.get('Symbol for Start of Heading')
}${
u.get('Start of Heading')
}]+/,
From [email protected]/src/diacritics.js
:
{ base:'AA',letters:/[\uA732]/g },
{ base:'AE',letters:/[\u00C6\u01FC\u01E2]/g },
{ base:'AO',letters:/[\uA734]/g },
{ base:'AU',letters:/[\uA736]/g },
{ base:'AV',letters:/[\uA738\uA73A]/g },
{ base:'AY',letters:/[\uA73C]/g },
{ base:'B', letters:/[\u0042\u24B7\uFF22\u1E02\u1E04\u1E06\u0243\u0182\u0181]/g
},
{
base: 'AA',
letters: new RegExp(`[${
u.get(
'Latin Capital Letter AE',
)
}]`, 'g')
},
{
base: 'AE',
letters: new RegExp(`[${
u.get(
'Latin Capital Letter AE',
'Latin Capital Letter AE with Acute',
'Latin Capital Letter AE with Macron',
)
}]`, 'g')
},
{
base: 'AO',
letters: new RegExp(`[${
u.get(
'Latin Capital Letter AO',
)
}]`, 'g')
},
{
base: 'AU',
letters: new RegExp(`[${
u.get(
'Latin Capital letter AU',
)
}]`, 'g')
},
{
base: 'AV',
letters: new RegExp(`[${
u.get(
'Latin Capital Letter AV',
'Latin Capital Letter AV with Horizontal Bar',
)
}]`, 'g')
},
{
base: 'AY',
letters: new RegExp(`[${
u.get(
'Latin Capital Letter AY',
)
}]`, 'g')
},
{
base: 'B',
letters: new RegExp(`[${
u.get(
'Latin Capital Letter B',
'Circled Latin Capital Letter B',
'Fullwidth Latin Capital Letter B',
)
}]`, 'g')
},
{ base:'B', letters:/[\u0042\u24B7\uFF22\u1E02\u1E04\u1E06\u0243\u0182\u0181]/g
},
From [email protected]/src/unicodeSymbols.js
:
"\u014f": "\u006f\u0306", // ŏ = \u{o}
u.get('Latin Small Letter O with Breve'):
u.get('Latin Small Letter O', 'Combining Breve'),
From [email protected]/test/test.js
:
'\u{1F937}\u{1F3FB}\u200D',
'\u{1F937}\u{1F3FC}\u200D',
'\u{1F937}\u{1F3FD}\u200D',
'\u{1F937}\u{1F3FE}\u200D',
'\u{1F937}\u{1F3FF}\u200D',
'\u{1F937}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FB}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FC}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FD}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FE}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FF}\u200D\u2640\uFE0F'
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-1-2', 'Zero Width Joiner'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-3', 'Zero Width Joiner'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-4', 'Zero Width Joiner'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-5', 'Zero Width Joiner'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-6', 'Zero Width Joiner'),
u.get('Shrug', 'Zero Width Joiner', 'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Zero Width Joiner', 'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-1-2', 'Zero Width Joiner',
'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-3', 'Zero Width Joiner',
'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-3', 'Zero Width Joiner',
'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-3', 'Zero Width Joiner',
'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-3', 'Zero Width Joiner',
'Female Sign', 'Variation Selector-16'),
From [email protected]/lib/pattern.js
:
['^\\u{20bb7}$', 'u', '𠮷', true, 'test'],
['^\\u{u.get('CJK Unified Ideograph-20BB7')}$', 'u', '𠮷', true, 'test'],
Our API (specifically the getNameEntries
method) uses eight name types:
null
: Strict character names – i.e., property values of the Name property.'sequence'
: Names of named character sequences.'label'
: Code-point labels.'correction'
: Corrections for serious problems in strict character names.'control'
: Commonly occurring names for control codes.'alternate'
: Widely used alternate names for format characters.'figment'
: Names of control codes which were documented but never actually approved in any standard.'abbreviation'
: Common abbreviations for many characters.
Our goals include (ordered by priority):
-
Name→value lookup: We want to retrieve the character (or UTF-16 surrogate) corresponding to a given Unicode name.
-
Value→names access: We want to retrieve all of the Unicode names that correspond to a given character (or UTF-16 surrogate).
-
Named sequences, name aliases, and code-point labels: We want to be able to retrieve data about not only scalar characters’ strict character names but also named character sequences, name aliass, and code-point labels.
-
Fuzzy matching: We want our name→value lookup to support fuzzy name matching as defined in UAX #44: ignoring spaces, underscores, and most medial hyphens.
-
Memory efficiency: The data structure needs to occupy as little memory as as possible while it is being used. This is especially important for mobile devices that have constrained memory. We do not want compressed data structure that needs to be entirely decompressed from storage into memory. Ideally, the data structure would be a memory-mapped file.
-
Storage efficiency: The data structure needs to occupy as little storage as possible. This is especially important for mobile devices that have constrained network bandwidth.
-
Simplicity: We want to keep our code and data simple. In particular, we want to avoid dependency on large third-party libraries.
-
Time efficiency: This is least important. We want the speed of retrieval and random access to be fast (both in the average and worst cases), although we care less about speed than about storage and memory.
We also care more about the speed of name→value lookup than the speed of value→names access. Although both are goals, the former is more frequently useful to developers. The latter is useful for end-user-facing internationalized applications and for program transpilation.
Non-goals include:
- No dynamic updates: The data structure needs to be static. It would need to change only every several months, whenever the Unicode Consortium updates the Standard.
- No fast compilation: The building and compression of the data structure happen long before the structure is actually used, and so we will allow these to occur more slowly than data decompression and retrieval.
- No substring matching: We do not need to match names partially, including with name prefixes or with name suffixes.
- No file-format compatibility: The data structure’s file format does not need to be backwards or forwards compatible. It can have one version number that must exactly match the version number expected by the application.
In order to ensure that our code’s behavior remains the same as we refactor it,
we will use test suites. One of the test suites, the
[test/node/complete.mjs
module][], tests every single Unicode name and named
Unicode value on the database.
We also measure the temporal performance of our database, using a benchmark
suite in a [src/node/benchmark.mjs
module][]. This module first creates two
query sets: it randomly chooses a specified number of Unicode names and of
randomly chosen characters. It then calls uniname.get
on all of those names
and uniname.getNameEntries
on all of those hex sequences’ characters.
It also applies uniname.get
to a nonexistent name and
uniname.getNameEntries
to a nonexistent character – also 1,000 times each.
The first version of our database is a JSON file that weighs 2,310.4 kB (154.4 kB when compressed with Brotli v1.0.9).
The database file structure looks like this (whitespace added):
[{"scalar":0,"name":"NULL","nameType":"control"},
{"scalar":0,"name":"NUL","nameType":"abbreviation"},
{"scalar":1,"name":"START OF HEADING","nameType":"control"},
{"scalar":1,"name":"SOH","nameType":"abbreviation"},
{"scalar":2,"name":"START OFTEXT","nameType":"control"},
{"scalar":2,"name":"STX","nameType":"abbreviation"},
{"scalar":3,"name":"END OF TEXT","nameType":"control"},
{"scalar":3,"name":"ETX","nameType":"abbreviation"},
{"scalar":4,"name":"END OF TRANSMISSION","nameType":"control"},
{"scalar":4,"name":"EOT","nameType":"abbreviation"},
{"scalar":5,"name":"ENQUIRY","nameType":"control"},
{"scalar":5,"name":"ENQ","nameType":"abbreviation"},
{"scalar":6,"name":"ACKNOWLEDGE","nameType":"control"},
{"scalar":6,"name":"ACK","nameType":"abbreviation"},
{"scalar":7,"name":"ALERT","nameType":"control"},
{"scalar":7,"name":"BEL","nameType":"abbreviation"},
{"scalar":8,"name":"BACKSPACE","nameType":"control"},
{"scalar":8,"name":"BS","nameType":"abbreviation"},
{"scalar":9,"name":"CHARACTER TABULATION","nameType":"control"},
…and ending with this:
{"initialHeadPoint":917998,"nameStem":"VS255","nameType":"abbreviation"},
{"initialHeadPoint":917998,"nameStem":"VARIATION SELECTOR-255"},
{"initialHeadPoint":917999,"nameStem":"VS256","nameType":"abbreviation"},
{"initialHeadPoint":917999,"nameStem":"VARIATION SELECTOR-256"},
{"initialHeadPoint":983038,"length":2,"nameStem":"NONCHARACTER",
"nameCounterType":"hyphenHex","nameType":"label"},
{"initialHeadPoint":983040,"length":65534,
"nameCounterType":"hyphenHex","nameStem":"PRIVATE-USE","nameType":"label"},
{"initialHeadPoint":1048574,"length":2,"nameStem":"NONCHARACTER",
"nameCounterType":"hyphenHex","nameType":"label"},
{"initialHeadPoint":1048576,"length":65534,
"nameCounterType":"hyphenHex","nameStem":"PRIVATE-USE","nameType":"label"},
{"initialHeadPoint":1114110,"length":2,"nameStem":"NONCHARACTER",
"nameCounterType":"hyphenHex","nameType":"label"}
]
As of Unicode 14.0, these JSON objects correspond to 35,558 name ranges, which cover 285,275 Unicode names and 284,805 named Unicode values.
Our time benchmark (10,000 operations each) gives these results for Unicode 14.0, by Node v18.3.0, on a MacBook Air (M1, 2020) with 16 GB of memory:
Operation | 2% | 25% | 50% | 75% | 98% |
---|---|---|---|---|---|
Valid N → V | 10,243 µs | 20,020 µs | 25,122 µs | 26,467 µs | 32,408 µs |
Invalid N → no V | 25,973 µs | 28,988 µs | 29,211 µs | 31,090 µs | 33,781 µs |
Named V → N | 9,849 µs | 10,918 µs | 11,010 µs | 12,164 µs | 14,123 µs |
Unnamed V → no N | 10,839 µs | 12,194 µs | 12,576 µs | 13,688 µs | 16,892 µs |
When retrieving data, the JSON file must be loaded into memory, either in its entirety at once or sequentially as a stream. When Node v18.3.0 loads the entire JSON file, its memory heap usage increases by approximately 2,166 kB, which is about the same size as the file itself.
We already are using the concept of name ranges, but there are many singleton name ranges in UnicodeData.txt that may still be combined with their neighbors. For example:
- There are more than 1,000 CJK Compatibility characters named “CJK COMPATIBILITY IDEOGRAPH-F900”, “CJK COMPATIBILITY IDEOGRAPH-F901”, etc. that are individually named in UnicodeData.txt.
- There are 26 characters from U+1F130 Squared Latin Capital Letter A to U+1F149 Squared Latin Capital Letter Z, one for each English letter A–Z.
- There are 48 characters from U+1F031 Domino Tile Horizontal-00-00 to U+1F061 Domino Tile Horizontal-06-06, and they all start with “Domino Tile Horizontal-” followed by “00-00”, “00-01”, …, “00-06”, “01-00”, …, “06-06”. There are another 48 similar characters from U+1F062 Domino Tile Vertical-00-00 to U+1F093 Domino Tile Vertical-06-06.
- There are 766 characters from U+18801 Tangut Component-002 to U+18AFF Tangut Component-768, and they all start with “Tangut Component-” followed by a three-digit number between 002 and 768.
We save further space by consolidating these names’ singleton name ranges into multiplex name ranges.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Our next database is a sequence of lines. In this case, both name→value lookup and value→names access require linearly checking each line for matches.
The database file structure now looks like this (line breaks are significant):
0:NULL:CONTROL
0:NUL:ABBREVIATION
1:START OF HEADING:CONTROL
1:SOH:ABBREVIATION
2:START OF TEXT:CONTROL
2:STX:ABBREVIATION
3:END OF TEXT:CONTROL
3:ETX:ABBREVIATION
4:END OF TRANSMISSION:CONTROL
4:EOT:ABBREVIATION
5:ENQUIRY:CONTROL
…then later:
22:QUOTATION MARK
23:NUMBER SIGN
23:KEYCAP NUMBER SIGN:SEQUENCE:FE0F:20E3
24:DOLLAR SIGN
25:PERCENT SIGN
26:AMPERSAND
27:APOSTROPHE
…and so on.
Each line in the file looks like: ‹scalarHex›:‹name›‹nameInfo›
.
- The
‹scalarHex›
is a single scalar hex, stripped of leading0
s. (For named character sequences, this is the hex of the first scalar: the character’s head point.) - The
‹name›
is a Unicode name, in all caps, and using spaces and-
. - The
‹nameInfo›
is one of the following.- For strict Name property values: an empty string.
- For name aliases:
:CORRECTION
for correction,:CONTROL
for control,:ALTERNATE
for alternate,:FIGMENT
for figment, or:ABBREVIATION
for abbreviation. - For named character sequences:
:SEQUENCE
followed by a sequence of hexes for the remaining scalar hexes that follow the first scalar. These are called the character’s tail scalars. Each hex is also preceded by:
and is stripped of leading0
s.
For example:
- The line for
U+0021
Exclamation Mark is21:EXCLAMATION MARK
. - The two lines for
U+0000
’s two aliasesNULL
(a control alias) andNUL
(an abbreviation alias) are0:NULL:CONTROL
and0:NUL:ABBREVIATION
. - The line for the named character sequence
U+0023 U+FE0F U+20E3
Keycap Number Sign is23:KEYCAP NUMBER SIGN:SEQUENCE:FE0F:20E3
.
Scalars in the Basic Multilingual Plane (the BMP: 0000
–FFFF
) currently use
one, two, three, or four hexadecimal digits. And scalars in other planes
(10000
–10FFFF
) currently use six hexadecimal digits. Yet the majority of
entries in the database deal with contiguous (or, at least, mutually close)
blocks of scalars.
All of those poorly compressible hexadecimal codes can be converted into very
compressible runs of 1
increments (for contiguous scalars), with occasional
larger increments. This is an example of delta encoding.
The database is now a plain-text string weighing 962.9 kB (104.6 kB when compressed with Brotli v1.0.9). It looks like:
0:NULL:CONTROL
0:NUL:ABBREVIATION
1:START OF HEADING:CONTROL
0:SOH:ABBREVIATION
1:START OF TEXT:CONTROL
0:STX:ABBREVIATION
1:END OF TEXT:CONTROL
0:ETX:ABBREVIATION
1:END OF TRANSMISSION:CONTROL
0:EOT:ABBREVIATION
1:ENQUIRY:CONTROL
…then later:
1:QUOTATION MARK
1:KEYCAP NUMBER SIGN:SEQUENCE:FE0F:20E3
0:NUMBER SIGN
1:DOLLAR SIGN
1:PERCENT SIGN
1:AMPERSAND
1:APOSTROPHE
…and so on.
Each line in the file looks like: ‹scalarDeltaHex›:‹name›‹nameInfo›
.
- The
‹scalarDeltaHex›
is a single hex of the scalar delta from the previous line’s scalar. This will usually be1
(one scalar after the previous line’s) or0
(the same scalar as the previous line’s). (For named character sequences, this is the delta of the first scalar’s hex. Similarly, if the previous line was for a named character sequence, then the delta is compared to that sequence’s first scalar.) - The
‹name›
is a Unicode name, in all caps, and using spaces and-
. - The
‹nameInfo›
is one of the following.- For strict Name property values: an empty string.
- For name aliases:
:CORRECTION
for correction,:CONTROL
for control,:ALTERNATE
for alternate,:FIGMENT
for figment, or:ABBREVIATION
for abbreviation. - For named character sequences:
:SEQUENCE
followed by a sequence of hexes for the remaining scalar hexes that follow the first scalar. Each hex is also preceded by:
and is stripped of leading0
s.
For example:
- The line for
U+0021
Exclamation Mark is1:EXCLAMATION MARK
, because its preceding line is forU+0020
, and0021
−0020
=1
. - The two lines for
U+0003
’s two aliasesEND OF TRANSMISSION
(a control alias) andEOT
(an abbreviation alias) are1:END OF TRANSMISSION:CONTROL
and0:EOT:ABBREVIATION
. The first line’s‹scalarDeltaHex›
is1
because its previous line’s scalar is0003
, and0004
−0003
=1
. The second line’s‹scalarDeltaHex›
is0
because both it and its previous line have the scalarU+0003
, and0003
−0003
=0
. - The line for the named character sequence
U+0023 U+FE0F U+20E3
Keycap Number Sign is1:KEYCAP NUMBER SIGN:SEQUENCE:FE0F:20E3
, because its previous line’s scalar is0022
, and0023
−0022
=1
.
Unfortunately, neither using a sequence of text lines is not very efficient in
time. Name→value lookup and value→names access must linearly scan the
database string, checking each of its line for a match. In the worst cases
(VARIATION SELECTOR-256
and nonexistent Unicode names), our code may have to
scan the entire database before determining the answer. We want to be able to
perform random access on the file. Brotli does not solve this problem; it
only can linearly decompress the file, and is therefore useful here mostly
for transferring the database across networks.
We therefore will adopt techniques from succinct data structures that allow random access directly into compressed data. We will eventually encode custom binary file format that allows for such random access.
As of this writing, the best such data structure for our purposes (bidirectional string dictionaries) seems to be IBiSRP+DAC-L-TT, which is described in Brisaboa et al. (2019). IBiSRP+DAC-L-TT combines binary search, front coding, Re-Pair compression, and several other compression techniques. (In actuality, we will be modifying IBiSRP+DAC-L-TT to accomodate the fact that we need to map string keys to scalars, rather than to integer positions in the dictionary.)
We want to eventually get the database’s size close to the 97 kB of version 1.0’s Brotli-compressed database – or at least close to GNU Uniname’s 350 kB – while speeding up name→value lookup by name and (to a lesser extent) speeding up name→value access.
Our first step toward compressing the database into a memory-mappable, randomly accessible, concise file is to convert our data structure into an indexed text table (TT), sorted lexicographically by fuzzily folded name. The text table will allow us to randomly look up characters without having to linearly read the database.
For clarity, we will continue to use UTF-8/ASCII plain text to format the database (until later, when we will switch it to a binary format). Now the database is a 1-line plain-text file weighing 1,198.2 kB (232.5 kB when compressed with Brotli v1.0.9). This size is significantly worse than the previous version’s 962.9 kB but, as we will see, we have improved much in time.
Block | Type | Range | Total |
---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.3 kB |
TT texts (values) | ASCII str. | 25.7 B | 862.1 kB |
TT texts (separations) | Hex array | 5.0 B | 167.9 kB |
TT scalar array | Hex array | 5.0 B | 167.9 kB |
Total | 35.7 B | 1,198.2 kB |
The directory is an object that maps string keys to integer values,
stringified with JSON, then ending with the control character U+0003
Start of
Text (which is forbidden in the JSON format itself).
Most of the integer values are pointers: each indicates the position of each database block. The pointers are relative to the end of the directory, after the Start of Text control character.
The directory looks like the following. (The actual directory text has no extra
whitespace and ends with the control character U+0003
Start of Text. Here,
for illustration purposes, extra whitespace is shown, and the Start of Text
control character is displayed as ␂
.)
{
"nameTableDirectory": {
"numOfEntries": 33578,
"textSequencePointer": 0,
"textSequenceDirectory": {
"valuesPointer": 0,
"separationArrayPointer": 862095,
"separationArrayDirectory": {
"numOfHexesPerEntry": 5
},
"length": 1029987
},
"scalarArrayPointer": 1029987,
"scalarArrayDirectory": {
"numOfHexesPerEntry": 5
}
}
}
␂
The text table has 33,578 text entries (32,647 of which are for strict
character names and 931 of which are for other Unicode names). There is one
text entry for each explicitly declared Unicode name, which all correspond to
characters made of scalars (as opposed to UTF-16 surrogates or to
noncharacters). This means that some scalars will have multiple text entries
somewhere in the text table, such as two entries for U+0000
’s NULL
and its
alias NUL
.
The text entries are lexicographically ordered by fuzzily folded name, from the
entry for ABACUS
(at index 0) to the entry for ZWSP
(at index 32,727). At
the middle of the table is entry #16,789. (Because text entries are sorted by
fuzzily folded name, the entry for NORTHEAST-POINTING AIRPLANE
lexicographically precedes NORTH WEST ARROW
, because the former name’s
fuzzily folded version is NORTHEASTPOINTINGAIRPLANE
, which precedes
NORTHWESTARROW
.)
The text table is made of two column data blocks: a text sequence and a scalar array. The TT text sequence and the TT scalar array each represent a “column” of the text table, each in the same shared name-lexicographic order.
A array is an array of fixed-length values; the length of each value is declared in a directory object. We call them “arrays” instead of “arrays” to avoid confusion with JavaScript arrays.
A sequence in turn is made of a values variable-length string (made of the concatenated variable-length string values) and a separations array (made of fixed-length unsigned integers that denote the positions of the separations between the concatenated value strings).
The text sequence is a block of variable-length texts. The text values are
concatenated together in a values block, without any separator (separations
between texts are separately encoded in a following block). There is one text
for each Unicode name, and the texts are lexicographically ordered by their
Unicode names. Each text looks like: ‹name›‹nameInfo›
.
- The
‹name›
is a Unicode name, in all caps, and using spaces and-
. - The
‹nameInfo›
is one of the following.- For strict Name property values: an empty string.
- For name aliases:
:CORRECTION
for correction,:CONTROL
for control,:ALTERNATE
for alternate,:FIGMENT
for figment, or:ABBREVIATION
for abbreviation. - For named character sequences:
:SEQUENCE
followed by a sequence of hexes for the remaining scalar hexes that follow the first scalar. Each hex is also preceded by:
and is stripped of leading0
s.
For example:
- The text for
U+0021
Exclamation Mark isEXCLAMATION MARK
. - The two texts for
U+0000
’s two aliasesNULL
(a control alias) andNUL
(an abbreviation alias) areNULL:CONTROL
andNUL:ABBREVIATION
. - The texts for the named character sequence
U+0023 U+FE0F U+20E3
Keycap Number Sign isKEYCAP NUMBER SIGN:SEQUENCE:FE0F:20E3
.
The text value sequence (with •
inserted between each text for illustration
purposes; line breaks are also insignificant) starts with:
ABACUS•ACCORDION•ACCOUNT OF•AC CURRENT•ACK:ABBREVIATION•ACKNOWLEDGE:CONTROL•
ACTIVATE ARABIC FORM SHAPING•ACTIVATE SYMMETRIC SWAPPING•ACUTE ACCENT•
ACUTE ANGLE•ADDRESSED TO THE SUBJECT•ADHESIVE BANDAGE•ADI SHAKTI•
ADLAM ALIF LENGTHENER•ADLAM CAPITAL LETTER ALIF•ADLAM CAPITAL LETTER BA•
ADLAM CAPITAL LETTER BHE•ADLAM CAPITAL LETTER CHI•ADLAM CAPITAL LETTER DAALI•
ADLAM CAPITAL LETTER DHA•ADLAM CAPITAL LETTER E•ADLAM CAPITAL LETTER FA•
ADLAM CAPITAL LETTER GA•ADLAM CAPITAL LETTER GBE•ADLAM CAPITAL LETTER HA•
ADLAM CAPITAL LETTER I•ADLAM CAPITAL LETTER JIIM•ADLAM CAPITAL LETTER KAF•
ADLAM CAPITAL LETTER KHA•ADLAM CAPITAL LETTER KPO•ADLAM CAPITAL LETTER LAAM•
ADLAM CAPITAL LETTER MIIM
…and ends with:
ZANABAZAR SQUARE VOWEL SIGN REVERSED I•ZANABAZAR SQUARE VOWEL SIGN U•
ZANABAZAR SQUARE VOWEL SIGN UE•ZEBRA FACE•ZERO WIDTH JOINER•
ZERO WIDTH NO-BREAK SPACE•ZERO WIDTH NON-JOINER•ZERO WIDTH SPACE•ZEUS•
ZIPPER-MOUTH FACE•Z NOTATION BAG MEMBERSHIP•Z NOTATION DOMAIN ANTIRESTRICTION•
Z NOTATION LEFT BINDING BRACKET•Z NOTATION LEFT IMAGE BRACKET•
Z NOTATION RANGE ANTIRESTRICTION•Z NOTATION RELATIONAL COMPOSITION•
Z NOTATION RIGHT BINDING BRACKET•Z NOTATION RIGHT IMAGE BRACKET•
Z NOTATION SCHEMA COMPOSITION•Z NOTATION SCHEMA PIPING•
Z NOTATION SCHEMA PROJECTION•Z NOTATION SPOT•Z NOTATION TYPE COLON•
ZOMBIE•ZWJ:ABBREVIATION•ZWNBSP:ABBREVIATION•ZWNJ:ABBREVIATION•
ZWSP:ABBREVIATION
The TT separation array is a series of fixed-length hexes, which encodes a array of integers. There is one integer for each Unicode name, and the texts are lexicographically ordered by their Unicode names. The integers are pointers: each indicates the position of each text within the text-sequence values block. The pointers are relative to the beginning of the values block.
Each integer is five hex digits wide: hence
"separationArrayDirectory":{"numOfHexesPerEntry":5}
in the directory.
The text separations (with •
inserted between each text for illustration
purposes – line breaks are also insignificant) start with:
00006•0000F•00019•00023•00033•00046•00062•0007D•00089•00094•000AC•000BC•
000C6•000DB•000F4•0010B•00123•0013B•00155•0016D•00183•0019A•001B1•001C9
…and end with:
D05A5•D05BE•D05D3•D05E3•D05E7•D05F8•D0611•D0632•D0651•D066E•D068E•D06AF•
D06CF•D06ED•D070A•D0722•D073E•D074D•D0762•D0768•D0778•D078B•D079C•D07AD
Lastly the TT scalar array is also a series of fixed-length hexes, which also encodes a array of scalars. There is one scalar for each character name, and the texts are lexicographically ordered by their Unicode names. (For named character sequences, the scalars are the sequences’ first scalars.)
Each integer is five hex digits wide: hence
"scalarArrayDirectory":{"numOfHexesPerEntry":5}
in the directory.
The scalars (with •
inserted between each text for illustration purposes –
line breaks are also insignificant) start with:
1F9EE•1FA97•02100•023E6•00006•00006•0206D•0206B•000B4•0299F•02101•1FA79•
0262C•1E944•1E900•1E904•1E907•1E915•1E901•1E90D•1E909•1E90A•1E918•1E91E
…and end with:
0200D•0FEFF•0200C•0200B•02BE2•1F910•022FF•02A64•02989•02987•02A65•02A3E•
0298A•02988•02A1F•02A20•02A21•02981•02982•1F9DF•0200D•0FEFF•0200C•0200B
The procedure for name→value lookup with an input name is:
- Perform a binary search in the TT text sequence for a text with the input name. This results in a text entry index (the index of the matching text) and an array of tail scalars (which are embedded in the text). (If a matching text is not present in the text sequence, then there is no character with the given name.)
- Given the text entry index, get its value in the TT scalar array. This results in the character’s head point.
- Combine the head point and tail scalars into a single character. This this the final result.
The procedure for value→names access with an input value is:
- Get the head point and the tail scalars of the input character.
- Perform a linear search for the head point in the TT scalar array. This results in a set of text entry indexes.
- For each text-entry text, remove the text from the set if the text’s tail scalars do not match the input character’s tail scalars.
- Use the text-entry texts to create a sorted array of text entries. This is the final result.
The average time of name→value lookup with valid names has improved by 1100 times compared to the previous version’s, and there is much less variation in time. This reflects how much more efficient binary search is compared to sequential scanning. Name→value lookup with invalid names has similarly improved.
value→names access still requires linear search over the entire database. However, even value→names access has greatly improved: by 86 times. This might be due to the fact that scalars are now grouped together in a fixed-length array. Iterating through such a fixed-length array is probably much faster than scanning through the long, variable-length entry lines of the previous version.
We could improve value→names access’s performance even more, but because we are deprioritizing the time performance of value→names access (see § API and Goals), we will leave its performance alone for now; it is acceptable to search the entire database for names matching a character. In § ???, we will create a separate point table that may be efficiently randomly accessed.
Next, we will adopt a similar technique to IBiSRP+DAC-L-TT, and we next compress the TT text sequence’s values with front coding.
The text table (because it is sorted lexicographically by fuzzily folded name) may be considered as a binary search tree, which is defined using the same binary-search algorithm we have been using when we look up characters from the text table. There are ⌈log2(33,578) + 1⌉ (or 15) levels in this binary tree.
The text table’s root entry is the first entry in the text table that we
check during the binary search: i.e., the median of the binary tree, which is
entry number ⌊(33,578 − 1) ∕ 2⌋ (or #16,789). The two child entries of that
root entry are entries number ⌊(16,789 − 1) ∕ 2⌋ (or #8,394) and number
⌊(16,789 − 1) × 3 ∕ 2⌋ or (#25,182). This recursively goes on: each parent
entry has two child entries. For instance, entry #1 (ACCOUNT OF
) is the
parent of two child entries: entry #0 (ABACUS
) and entry #2 (ACCORDION
).
(Note: We are indexing table entries starting from #0.)
Whenever a child entry in the text table has a text that shares a prefix with its parent entry’s text, we could front code that shared prefix.
For instance, entry #0’s text (ABACUS
) shares an A
with its parent entry’s
text (ACCOUNT OF
), so it can be front coded as BACUS
(along with the length
of the shared prefix: 1). The shared prefix is omitted from the child entry’s
text; it is implicit in its parent’s text (and the length of the omitted
prefix).
Likewise, entry #2’s text (ACCORDION
) shares an ACC
with its parent entry’s
text (ACCOUNT OF
), so it can be front coded as OUNT OF
(along with the
length of the shared prefix: 3).
This is similar but different to IBiSRP+DAC-L-TT. The latter algorithm inverts the front-coding relationship: at each step of the binary search, the minimum entry of the binary search’s current interval is the “parent”, and the interval’s median entry is its “child”. We are changing this from the original to accomodate the fact that we need to map string keys to scalars, rather than to integer positions in the dictionary: this change will allow us to compress scalars more.)
Now the database weighs 651.3 kB (167.1 kB when compressed with Brotli v1.0.9). This size is a dramatic improvement (being 184.0% smaller) over the previous version’s 1,198.2 kB.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.3 kB | 0.0 kB |
TT suffix texts (val.s) | ASCII str. | 25.7 B | 862.1 kB | −578.1 kB |
TT suffix texts (sep.s) | Hex array | 5.0 B | 167.9 kB | 0.0 kB |
TT prefix length array | Hex array | 2.0 B | 67.2 kB | +67.2 kB |
TT scalar array | Hex array | 5.0 B | 167.9 kB | 0.0 kB |
Total | 23.2 B | 651.3 kB | −510.9 kB |
The directory now looks like this:
{
"nameTableDirectory": {
"numOfEntries": 33578,
"textSequencePointer": 0,
"textSequenceDirectory": {
"valuesPointer": 0,
"separationArrayPointer": 247937,
"separationArrayDirectory": {
"numOfHexesPerEntry": 5
},
"length": 415829
},
"namePrefixLengthArrayPointer": 415829,
"namePrefixLengthArrayDirectory": {
"numOfHexesPerEntry": 2
},
"scalarArrayPointer": 482987,
"scalarArrayDirectory": {
"numOfHexesPerEntry": 5
}
}
}
␂
The TT text sequence (with •
inserted between each text for illustration
purposes – line breaks are also insignificant) formerly started with this:
ABACUS•ACCORDION•ACCOUNT OF•AC CURRENT•ACK:ABBREVIATION•ACKNOWLEDGE:CONTROL•
ACTIVATE ARABIC FORM SHAPING•ACTIVATE SYMMETRIC SWAPPING•ACUTE ACCENT•
ACUTE ANGLE•ADDRESSED TO THE SUBJECT•ADHESIVE BANDAGE•ADI SHAKTI•
ADLAM ALIF LENGTHENER•ADLAM CAPITAL LETTER ALIF•ADLAM CAPITAL LETTER BA•
ADLAM CAPITAL LETTER BHE•ADLAM CAPITAL LETTER CHI•ADLAM CAPITAL LETTER DAALI•
ADLAM CAPITAL LETTER DHA•ADLAM CAPITAL LETTER E•ADLAM CAPITAL LETTER FA•
ADLAM CAPITAL LETTER GA•ADLAM CAPITAL LETTER GBE•ADLAM CAPITAL LETTER HA•
ADLAM CAPITAL LETTER I•ADLAM CAPITAL LETTER JIIM•ADLAM CAPITAL LETTER KAF•
ADLAM CAPITAL LETTER KHA•ADLAM CAPITAL LETTER KPO•ADLAM CAPITAL LETTER LAAM•
ADLAM CAPITAL LETTER MIIM
…but now it is a TT suffix sequence that starts with this (for
instance, BACUS
corresponds to ABACUS
and CORDION
to ACCORDION
):
BACUS•CCORDION•COUNT OF• CURRENT•K:ABBREVIATION•NOWLEDGE:CONTROL•
CTIVATE ARABIC FORM SHAPING•TIVATE SYMMETRIC SWAPPING•CUTE ACCENT•
NGLE•DDRESSED TO THE SUBJECT•HESIVE BANDAGE•I SHAKTI•
LAM ALIF LENGTHENER•ALIF•A•
BHE•CHI•DAALI•
DHA•E•FA•
GA•GBE•HA•
I•JIIM•AF•
KHA•PO•CAPITAL LETTER LAAM•
MIIM
The sequence also formerly ended with this:
ZANABAZAR SQUARE VOWEL SIGN REVERSED I•ZANABAZAR SQUARE VOWEL SIGN U•
ZANABAZAR SQUARE VOWEL SIGN UE•ZEBRA FACE•ZERO WIDTH JOINER•
ZERO WIDTH NO-BREAK SPACE•ZERO WIDTH NON-JOINER•ZERO WIDTH SPACE•ZEUS•
ZIPPER-MOUTH FACE•Z NOTATION BAG MEMBERSHIP•Z NOTATION DOMAIN ANTIRESTRICTION•
Z NOTATION LEFT BINDING BRACKET•Z NOTATION LEFT IMAGE BRACKET•
Z NOTATION RANGE ANTIRESTRICTION•Z NOTATION RELATIONAL COMPOSITION•
Z NOTATION RIGHT BINDING BRACKET•Z NOTATION RIGHT IMAGE BRACKET•
Z NOTATION SCHEMA COMPOSITION•Z NOTATION SCHEMA PIPING•
Z NOTATION SCHEMA PROJECTION•Z NOTATION SPOT•Z NOTATION TYPE COLON•
ZOMBIE•ZWJ:ABBREVIATION•ZWNBSP:ABBREVIATION•ZWNJ:ABBREVIATION•
ZWSP:ABBREVIATION
…but now ends like this:
ANABAZAR SQUARE VOWEL SIGN REVERSED I••
UE•EBRA FACE•ERO WIDTH JOINER•
-BREAK SPACE•RO WIDTH NON-JOINER•SPACE•US•
IPPER-MOUTH FACE• NOTATION BAG MEMBERSHIP•DOMAIN ANTIRESTRICTION•
NOTATION LEFT BINDING BRACKET•LEFT IMAGE BRACKET••
ANGE ANTIRESTRICTION•ELATIONAL COMPOSITION•
RIGHT BINDING BRACKET•RIGHT IMAGE BRACKET•
SCHEMA COMPOSITION•PIPING•
SCHEMA PROJECTION•SPOT• NOTATION TYPE COLON•
OMBIE•WJ:ABBREVIATION•BSP:ABBREVIATION•NJ:ABBREVIATION•
SP:ABBREVIATION
We store the lengths of the shared prefixes in a
a new block to the text table: the TT prefix length array.
This is another array of fixed-length hexes, which encodes
a array of unsigned integers.
Each integer in the TT prefix length array is two hex digits wide:
hence parentPrefixLengthArrayDirectory":{"numOfDigitsPerEntry":2}
in the
directory. The block starts with:
01•02•04•02•03•02•02•01•07•01•01•01•02•02•06•15•15•15•15•15•
15•15•15•15•15•15•15•15•16•16•15•06•16•15•16•15•15•15•15•15•
15•15•15•15•15•15•16•15•06•0C•0C•0C•0C•0C•0C•06•0D•06•0C•06
…and ends with:
18•18•19•18•19•19•19•18•1A•19•18•11•19•11•11•11•16•16•11•11•
1C•16•16•16•11•11•17•11•1C•1C•1C•01•1C•1C•1D•01•0B•0B•0D•02•
01•01•01•01•10•0B•0B•0B•11•0C•0B•0B•0C•01•0B•01•02•01•03•02
Fortunately, the time performance of name→value lookup has remained the same, despite the dramatic 184.0% improvement in storage size.
Unfortunately, the time performance of value→names access has markedly worsened by about 460%. This is attributable to the fact that we are still searching the entire database for names matching the input character. Any small difference in name extraction is multiplied by 33,578. As before, we are deprioritizing the time performance of value→names access (see § API and Goals), so we will leave it alone for now. In § ???, we will create a separate point table that may be efficiently randomly accessed.
We can save even more space by determining, for each entry in the text table, which of its ancestor entries (parent entry, the parent of the parent, etc.) has a text that shares the longest prefix with the given entry’s text.
This means that, in addition to the prefix length array, we also have to store an ancestor index array that, for each entry in the text table, stores the position (in the binary-search path) of the ancestor entry from whose text we are extracting the given entry’s implicit prefix.
Now the database weighs 637.7 kB (164.3 kB when compressed with Brotli v1.0.9). This size is a mild improvement over the previous version’s 651.3 kB.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.4 kB | +0.1 kB |
TT suffix texts (val.s) | ASCII str. | 5.6 B | 862.1 kB | −47.3 kB |
TT suffix texts (sep.s) | Hex array | 5.0 B | 167.9 kB | 0.0 kB |
TT prefix length array | Hex array | 2.0 B | 67.2 kB | 0.0 kB |
TT ancestor index array | Hex array | 1.0 B | 33.6 kB | +33.6 kB |
TT scalar array | Hex array | 5.0 B | 167.9 kB | 0.0 kB |
Total | 18.6 B | 637.7 kB | −13.7 kB |
The directory now looks like this:
{
"nameTableDirectory": {
"numOfEntries": 33578,
"textSequencePointer": 0,
"textSequenceDirectory": {
"valuesPointer": 0,
"separationArrayPointer": 200692,
"separationArrayDirectory": {
"numOfHexesPerEntry": 5
},
"length": 368584
},
"namePrefixLengthArrayPointer": 368584,
"namePrefixLengthArrayDirectory": {
"numOfHexesPerEntry": 2
},
"ancestorPathIndexArrayPointer": 435742,
"ancestorPathIndexArrayDirectory": {
"numOfHexesPerEntry": 1
},
"scalarArrayPointer": 469322,
"scalarArrayDirectory": {
"numOfHexesPerEntry": 5
}
}
}
␂
Name→value lookup now uses a binary search for efficient random access. However, value→names access still requires linear search over the entire text table, which is very expensive in time.
We can spend a little more space to make value→names access more efficient. Our next step is to add an indexed point table (PT), sorted numerically by scalar value. The point table will allow us to efficiently perform value→names random access. (This also will allow us to spend a little more name→value access time to make the TT scalar array smaller.)
Now the database weighs 969.2 kB (???.? kB when compressed with Brotli v1.0.9). This size is a mild improvement over the previous version’s 651.3 kB.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.4 kB | 0.0 kB |
TT suffix texts (val.s) | ASCII str. | 5.6 B | 862.1 kB | 0.0 kB |
TT suffix texts (sep.s) | Hex array | 5.0 B | 167.9 kB | 0.0 kB |
TT prefix length array | Hex array | 2.0 B | 67.2 kB | 0.0 kB |
TT ancestor index array | Hex array | 1.0 B | 33.6 kB | 0.0 kB |
TT scalar array | Removed | 0.0 B | 0.0 kB | −167.0 kB |
TT crosslink array | Hex array | 4.0 B | 134.3 kB | +134.3 kB |
PT scalar array | Hex array | 5.0 B | 167.9 kB | +167.9 kB |
PT crosslink array | Hex array | 5.0 B | 167.9 kB | +167.9 kB |
Total | 28.6 B | 935.6 kB | +302.2 kB |
The new point table has 33,578 head-point entries, one for each name
entry. However, the scalar entries are numerically ordered by scalar, from the
scalar entry for U+0000
(entry #0) to the scalar entry for U+E01EF
(entry
#33,577).
For now, the point table has two column data blocks: a scalar array and a crosslink array.
The PT scalar array is a series of fixed-length hexes, which encodes a array of unsigned integers. The integers are scalar values: each indicates the head point of a text entry in the text table. The same scalar value may occur multiple times in this array if multiple names share that same head point.
The PT crosslink array is a series of fixed-length hexes, which encodes a array of unsigned integers. The integers are crosslinks: text entry indexes in the text table. Each crosslink of each scalar in the point table is the index number of a text entry that has the head-point entry’s scalar as the name entry’s head point.
The procedure for value→names access has changed to support efficient random access.
- Get the head point and the tail scalars of the input character.
- Perform a binary search for the head point in the PT scalar array. If there is a matching scalar value, then also check for more matching scalar values that are contiguous to the first match. This results in a matching head-point entry index range. (If there is no matching head point value in the scalar array, then there is no name for the input character, and the algorithm terminates.)
- Given the head-point entry index range, get its values in the PT crosslink sequence. This results in a set of crosslinks (i.e., text entry indexes) to the text table.
- For each crosslink to the text table, get its text entry’s text value. (In order to get a value in the TT text sequence, first use the TT ancestor index array to determine all ancestor text entry indexes, recursively bottom up. Then use these text entry indexes to get an array of TT prefix lengths. If any prefix length is 0, then discard all preceding prefix lengths – and their text entry indexes – because those preceding data are useless. Finally, use the remaining text entry indexes to get their corresponding text prefixes, then use the prefixes and their prefix lengths to reconstruct the entire text-entry text.) This results in a set of text-entry texts.
- For each text-entry text, remove the text from the set if the text’s tail scalars do not match the input character’s tail scalars.
- Use the the text-entry texts to create a sorted array of text entries. This is the final result.
As a bonus, we can now make an improvement to the text table as well. We have replaced the TT scalar array with a TT crosslink array: instead of storing each text table’s head point directly in the text table (which may theoretically range between 0 and 17,825,792), we store a scalar entry index in the point table (which would range between 0 and only 33,578).
The procedure for name→value lookup is largely the same as before: a binary search over the text table with name-prefix front coding. However, when the binary search finds a matching text-entry text, instead of directly getting its text entry’s head point, we get a crosslink to the point table. We then get the head point value from the point table.
If we divide the point table into two sub-tables (PT0 and PTΔ), then we can greatly improve their storage space, without compromising the time performance of value→names access. Two-stage tables are commonly used to store Unicode property data, and here we will combine one with delta encoding and linear search (now limited to at most a small number).
Now the database weighs ???.? kB (164.3 kB when compressed with Brotli v1.0.9). This size is a mild improvement over the previous version’s 651.3 kB.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.4 kB | 0.0 kB |
TT suffix texts (val.s) | ASCII str. | 5.6 B | 862.1 kB | 0.0 kB |
TT suffix texts (sep.s) | Hex array | 5.0 B | 167.9 kB | 0.0 kB |
TT prefix length array | Hex array | 2.0 B | 67.2 kB | 0.0 kB |
TT ancestor index array | Hex array | 1.0 B | 33.6 kB | 0.0 kB |
TT crosslink array | Hex array | 4.0 B | 134.3 kB | 0.0 kB |
PT scalar array | Removed | 0.0 B | 0.0 kB | −167.9 kB |
PT crosslink array | Removed | 0.0 B | 0.0 kB | −167.9 kB |
PT0 base-point array | Hex array | ?.0 B | ???.? kB | +???.? kB |
PT0 pointer array | Hex array | ?.0 B | ???.? kB | +???.? kB |
PTΔ point-delta array | Hex array | ?.? B | ???.? kB | +???.? kB |
PTΔ crosslink array | Hex array | ?.? B | ???.? kB | +???.? kB |
Total | ??.? B | ???.? kB | −???.? kB |
Both scalar sub-tables are still sorted numerically by scalar value. However,
they differ in their cardinality. The base-point table (PT0) has relatively
few, sparse entries – their number is partially determined by an adjustable
scalar-block size value. In contrast, the point-delta table (PTΔ)
still has 33,578 entries, one for each text entry – from an entry corresponding
to U+0000
(point-delta entry #0) – to an entry corresponding to U+E01EF
(point-delta entry #33,577).
The base-point table (PT0) starts with two column data blocks: a base-point array and a pointer array.
The PT0 base-point array is a series of fixed-length hexes, which encodes a array of unsigned integers. The integers are scalar values: each indicates the head point of a text entry in the text table. There are never any duplicate base scalars in this array. It is generated through the following algorithm:
- Sort all text entries’ head points numerically.
- For each head point, if the difference between this head point and the previous head point exceeds the adjustable scalar-block size value, then this head point is added to the base-point array.
The PT0 pointer array is a series of fixed-length hexes, which encodes a array of unsigned integers. The integers are pointers to PTΔ entries: each indicates the index number of a point-delta entry in the PTΔ table associated with the PT0 entry’s base scalar.
The PTΔ point-delta array is a series of fixed-length hexes, which encodes a array of unsigned integers. The integers are point-delta values: each indicates a numeric difference – between the head point of its text entry in the text table – and the previous PTΔ entry’s scalar (or, if the previous PTΔ entry is associated with an PT0 entry, that PT0 entry’s base scalar). Because of this, the PTΔ point-delta array is divided into scalar blocks by the base-point table (PT0) – one block for each head-point entry.
The PTΔ crosslink array is a series of fixed-length hexes, which encodes a array of unsigned integers. The integers are crosslinks: text entry indexes in the text table. Each crosslink of each scalar in the point table is the index number of a text entry that has the head-point entry’s scalar as the name entry’s head point.
The procedure for value→names access has changed further:
- Get the head point and the tail scalars of the input character.
- Perform a binary search for the largest head point in the PT0 base-point array that is less than the head point. This results in a base scalar and a PT0 entry index.
- Use the PT0 entry index (and the PT0 entry index plus 1) to get start (and end) PTΔ entry indexes from the PT0 pointer array.
- Starting at the start PTΔ entry index, up until the end PTΔ entry index, perform a linear search for the head point in the PTΔ point-delta array, incrementing the PT0 base scalar with each scalar delta. Because the PTΔ point-delta array is sorted by scalar value, any multiple matching PTΔ entries will be contiguous, and the linear search results in a matching head-point entry index range. (If the linear search reaches a scalar value that is greater than the head point – or if the search reaches the end PTΔ entry index – without finding a matching head point value in the scalar array, then there is no name for the input character, and the algorithm terminates.)
- Given the head-point entry index range, get its values in the PTΔ crosslink sequence. This results in a set of crosslinks (i.e., text entry indexes) to the text table.
- For each crosslink, get its text entry’s text value (using the algorithm defined in the previous version). As before, this results in a set of name entry texts.
- For each text-entry text, remove the text from the set if the text’s tail scalars do not match the input character’s tail scalars.
- Use the the text-entry texts to create a sorted array of text entries. This is the final result.
As for the text table, the TT crosslink array now stores an PT0 entry index in PT0 (which now ranges between 0 and only ???). Each text entry’s crosslink to PT0 indicates that its head point is somewhere within the corresponding PTΔ range, as defined by the base-point table (PT0)’s crosslinks to the point-delta table (PTΔ).
The procedure for name→value lookup is similar to the previous version, except that it has become more indirect:
- Perform a binary search in the TT text sequence for a text with the input name. This results in a text entry index (the index of the matching text) and an array of tail scalars (which are embedded in the text).
- Given the text entry index, get its value in the TT crosslink array. This results in the character’s PT0 entry index.
- Use the PT0 entry index (and the PT0 entry index plus 1) to get start (and end) PTΔ entry indexes from the PT0 pointer array.
- Linearly search in the PTΔ crosslink array, within the start and end PTΔ entry indexes, for the matching text entry index. This is guaranteed to find a matching PTΔ entry index.
- Given the matching PTΔ entry index, get all the PTΔ scalar deltas from the start PTΔ entry index up to the matching PTΔ entry index. This results in an array of PTΔ scalar deltas.
- The base scalar from the PT0 base-point array with the PT0 entry index, plus the summed array of PTΔ scalar deltas, is the head point.
- Combine the head point and tail scalars into a single character. This this the final result.
By switching to using a binary buffer instead of a string as our database, we will be able to encode the database’s integers as 8-bit integers instead of hexes. This will generally cut the space of every integer array by somewhere between one fourth and one half. JSON and text values will continue to be encoded in ASCII/UTF-8.
After this switch, we will also eventually be able to encode bit arrays.
Now the database weighs ???.? kB (???.? kB when compressed with Brotli v1.0.9). This size is a ??? improvement over the previous version’s ???.? kB.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.4 kB | 0.0 kB |
TT suffix texts (val.s) | ASCII str. | 5.6 B | 862.1 kB | 0.0 kB |
TT suffix texts (sep.s) | Uint32 arr. | 4.0 B | 132.3 kB | −33.6 kB |
TT prefix length array | Uint8 arr. | 1.0 B | 33.6 kB | −33.6 kB |
TT ancestor index array | Uint8 arr. | 1.0 B | 33.6 kB | 0.0 kB |
TT crosslink array | Uint16 arr. | 2.0 B | 67.2 kB | −67.2 kB |
PT0 base-point array | Uint32 arr. | ?.0 B | ???.? kB | −???.? kB |
PT0 pointer array | Uint32 arr. | ?.0 B | ???.? kB | −???.? kB |
PTΔ point-delta array | Uint8 arr. | ?.? B | ???.? kB | −???.? kB |
PTΔ crosslink array | Uint32 arr. | ?.? B | ???.? kB | −???.? kB |
Total | ??.? B | ???.? kB | −???.? kB |
Our database now has a variable-length sequence: the TT suffix sequence. It contains a values block and a separations block; the latter stores the locations between each value in the values blocks. The separations block has been encoded as a fixed-length hex arrays, where each hex is a position within the values block.
We can change the separations block to be a bit array instead. Each bit in the array represents the presence or absence of a separation at each consecutive position within the values block. In other words, each 1-bit in the separations block’s sea of 0-bits indicates the start of a new value at that position in the values block.
Now the database weighs ???.? kB (???.? kB when compressed with Brotli v1.0.9). This size is a ??? improvement over the previous version’s ???.? kB.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.4 kB | 0.0 kB |
TT suffix texts (val.s) | ASCII str. | 5.6 B | 862.1 kB | 0.0 kB |
TT suffix texts (sep.s) | Bit arr. | 0.2 B | 7.5 kB | −124.8 kB |
TT prefix length array | Uint8 arr. | 1.0 B | 33.6 kB | 0.0 kB |
TT ancestor index array | Uint8 arr. | 1.0 B | 33.6 kB | 0.0 kB |
TT crosslink array | Uint16 arr. | 2.0 B | 67.2 kB | 0.0 kB |
PT0 base-point array | Uint32 arr. | ?.0 B | ???.? kB | 0.0 kB |
PT0 pointer array | Uint32 arr. | ?.0 B | ???.? kB | 0.0 kB |
PTΔ point-delta array | Uint8 arr. | ?.? B | ???.? kB | 0.0 kB |
PTΔ crosslink array | Uint32 arr. | ?.? B | ???.? kB | 0.0 kB |
Total | ??.? B | ???.? kB | −124.8 kB |
In order to retrieve the value corresponding to text entry #n from the TT suffix sequence:
-
When n = 0, we must retrieve the position of the 0th separation, in the sequence’s separations block. In other words, we must find the position of the 0th 1-bit in the sequence’s separations block. The value of entry #n is the data in the values block between 0 and the position of #the 0th 1-bit.
-
When n > 0, we must retrieve the position of the (n−1)th separation – and the nth separation – in the sequence’s separations block. In other words, we must find the positions of the (n−1)th 1-bit and the nth 1-bit in the sequence’s separations block. The value of entry #n is the data in the values block between the positions of the (n−1)th and nth 1-bits.
The operation of “the nth 1-bit” in a bit array is known as the select operation on bit arrays. We start with a naïve implementation of the select operation, which totally recalculates every time, without using any cached results.
When a text is empty (i.e., its parent text completely contains the given text, and the given text is completely a prefix of its parent text), then it is replaced by an arbitrary single-character string, in order to prevent positions returned by select from overlapping.
The TT suffix sequence now uses a bit array to store its separations, using the select operation to locate the position of each separation.
The select operation currently uses a naïve implementation that totally recalculates the result from scratch, starting from the very beginning of the bit array, every time. This would become much faster if we could precalculate and cache samples of select results.
We use a similar approach discussed in Zhang et al. (2013) and Zhang et al. (2018). For each variable-length sequence, we create a select cache: a fixed-length hex array. We precompute select results for sample queries at a certain frequency. If the frequency is 3, then the 0th, 3rd, 6th, etc. select results are stored in the select cache.
Now the database weighs ???.? kB (???.? kB when compressed with Brotli v1.0.9). The size has slightly increased compared to the previous version’s ???.? kB.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.4 kB | 0.0 kB |
TT suffix texts (val.s) | ASCII str. | 5.6 B | 862.1 kB | 0.0 kB |
TT suffix texts (sep.s) | Bit arr. | 0.2 B | 7.5 kB | 0.0 kB |
TT suffix texts (sel.s) | Uint32 arr. | ?.0 B | ??.? kB | +??.? kB |
TT prefix length array | Uint8 arr. | 1.0 B | 33.6 kB | 0.0 kB |
TT ancestor index array | Uint8 arr. | 1.0 B | 33.6 kB | 0.0 kB |
TT crosslink array | Uint16 arr. | 2.0 B | 67.2 kB | 0.0 kB |
PT0 base-point array | Uint32 arr. | ?.0 B | ???.? kB | 0.0 kB |
PT0 pointer array | Uint32 arr. | ?.0 B | ???.? kB | 0.0 kB |
PTΔ point-delta array | Uint8 arr. | ?.? B | ???.? kB | 0.0 kB |
PTΔ crosslink array | Uint32 arr. | ?.? B | ???.? kB | 0.0 kB |
Total | ??.? B | ???.? kB | −72.0 kB |
Two database blocks now consist of sequences of integers that we must access only sequentially, rather than directly and randomly. These are the TT suffix texts’ values (i.e., runs of ASCII strings) and the PTΔ point-delta array. We can compress these integer sequences with VByte. VByte has been described as such by Klein and Shapira (2016):
[In VByte coding], the codewords represent integers. The VByte code splits the
floor(log2(x_i)) + 1
bits needed to represent an integerx_i
in its standard binary form into blocks ofb
bits and prepends each block with a flag-bit as follows. The highest bit is 0 in the extended block holding the most significant bits ofx_i
, and 1 in the others. Thus, the 0-bit acts as a comma between codewords. For example, ifb = 3
, andx_i = 25
, the standard binary representation ofx_i
, 11001, is split into two blocks, and after adding the flags to each block, the codeword is 0011 1001. In the worst case, the VByte code loses one bit perb
bits ofx_i
plusb
bits for an almost empty leading block, which is worse than delta-Elias encoding.
We considered the variation Stream VByte, as described in Lemire, Kurz, and Rupp (2017). Stream VByte is faster to decode than VByte. However, Stream VByte uses slightly more storage than VByte proper; in addition, Stream VByte’s time performance depends on hardware-accelerated SIMD, which is not available in JavaScript.
Now the database weighs ???.? kB (???.? kB when compressed with Brotli v1.0.9). This size is a large improvement over the previous version’s ???.? kB.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.4 kB | 0.0 kB |
TT suffix texts (val.s) | VByte str. | ?.? B | ???.? kB | −??.? kB |
TT suffix texts (sep.s) | Bit arr. | 0.2 B | 7.5 kB | 0.0 kB |
TT suffix texts (sel.s) | Uint32 arr. | ?.0 B | ??.? kB | 0.0 kB |
TT prefix length array | Uint8 arr. | 1.0 B | 33.6 kB | 0.0 kB |
TT ancestor index array | Uint8 arr. | 1.0 B | 33.6 kB | 0.0 kB |
TT crosslink array | Uint16 arr. | 2.0 B | 67.2 kB | 0.0 kB |
PT0 base-point array | Uint32 arr. | ?.0 B | ???.? kB | 0.0 kB |
PT0 pointer array | Uint32 arr. | ?.0 B | ???.? kB | 0.0 kB |
PTΔ point-delta array | VByte str. | ?.? B | ???.? kB | 0.0 kB |
PTΔ crosslink array | Uint32 arr. | ?.? B | ???.? kB | 0.0 kB |
Total | ??.? B | ???.? kB | −??.? kB |
We will next compress our integer arrays with directly accessible codes (DACs). DACs were first described in Brisaboa, Ladra, and Navarro (2012) as a way to compress integer arrays, while maintaining random access to their integer values by index numbers. The integer array is encoded using a modified version of VByte, along with a “bit rank cache”.
In another work, Brisaboa et al. introduced directly accessible codes (DACs) by integrating rank dictionaries into byte aligned codes.
DACs can be regarded as a reorganization of the bits of VByte, plus extra space for the rank structures, that enables direct access to it. First, all the least significant blocks of all codewords are concatenated, then the second least significant blocks of all codewords having at least two blocks, and so on. Then the rank data structure is applied on the comma bits for attaining
log2(M) / b
processing time, whereM
is the maximum integer to be encoded.
Now the database weighs ???.? kB (???.? kB when compressed with Brotli v1.0.9). This size is a large improvement over the previous version’s ???.? kB.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.4 kB | 0.0 kB |
TT suffix texts (val.s) | VByte str. | ?.? B | ???.? kB | 0.0 kB |
TT suffix texts (sep.s) | Bit arr. | 0.2 B | 7.5 kB | 0.0 kB |
TT suffix texts (sel.s) | Uint32 arr. | ?.0 B | ??.? kB | 0.0 kB |
TT prefix length array | DAC int.s | ?.0 B | ??.? kB | −??.? kB |
TT ancestor index array | DAC int.s | ?.0 B | ??.? kB | −??.? kB |
TT crosslink array | DAC int.s | ?.0 B | ??.? kB | −??.? kB |
PT0 base-point array | DAC int.s | ?.0 B | ???.? kB | −??.? kB |
PT0 pointer array | DAC int.s | ?.0 B | ???.? kB | −??.? kB |
PTΔ point-delta array | VByte str. | ?.? B | ???.? kB | 0.0 kB |
PTΔ crosslink array | DAC int.s | ?.? B | ???.? kB | −??.? kB |
Total | ??.? B | ???.? kB | −??.? kB |
Characters whose head points are near each other tend to also have similar names. There are large blocks of characters that not only have contiguous head points but are also all named with identical prefixes. Because of this, more than 80% of crosslinks between TT and PTΔ are nearly adjacent to their neighbors’ crosslinks.
In other words, when text entries are sorted by name, their head points are usually within only 500 away from their neighboring text entries’ head points. Likewise, when head-point entries are sorted by head point, their crosslinks to the text table are usually within only 500 away from their neighboring head-point entries’ crosslinks. (In fact, more than 40% of crosslinks are within only 3 away from their neighbors.)
This means that, if we use delta encoding on both TT crosslinks and PT crosslinks, the large majority of crosslink data will become much smaller (in absolute value, since roughly half of the deltas would be negative). The worst-case range between the minimum and maximum crosslink values (for both TT and PT) would be the number of entries in the TT and PT (65,534). However, because DACs are biased to compress small integers in smaller space, and because most crosslink deltas will be very small (mostly between 0 and 3 in absolute value), much space will still be saved.
DACs can only store unsigned integers, so we store each delta’s positive/negative sign in separate bit arrays.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.4 kB | 0.0 kB |
TT suffix texts (val.s) | VByte str. | ?.? B | ???.? kB | 0.0 kB |
TT suffix texts (sep.s) | Bit arr. | 0.2 B | 7.5 kB | 0.0 kB |
TT suffix texts (sel.s) | Uint32 arr. | ?.0 B | ??.? kB | 0.0 kB |
TT prefix length array | DAC int.s | ?.0 B | ??.? kB | 0.0 kB |
TT ancestor index array | DAC int.s | ?.0 B | ??.? kB | 0.0 kB |
TT crosslink Δ array | DAC int.s | ?.0 B | ??.? kB | −??.? kB |
TT crosslink Δ-sign array | Bit arr. | ?.0 B | ??.? kB | +??.? kB |
PT0 base-point array | DAC int.s | ?.0 B | ???.? kB | 0.0 kB |
PT0 pointer array | DAC int.s | ?.0 B | ???.? kB | 0.0 kB |
PTΔ point-delta array | VByte str. | ?.? B | ???.? kB | 0.0 kB |
PTΔ crosslink Δ array | DAC int.s | ?.? B | ???.? kB | −??.? kB |
PTΔ crosslink Δ-sign array | Bit arr. | ?.? B | ???.? kB | +??.? kB |
Total | ??.? B | ???.? kB | −??.? kB |
We will next squeeze the texts further using Re-Pair compression. Afterwards, the database weighs ???.? kB (???.? kB when compressed with Brotli v1.0.9). This size is a large improvement over the previous version’s ???.? kB.
Block | Type | Per Range | Total | Change |
---|---|---|---|---|
Directory | JSON obj. | 0.0 B | 0.4 kB | 0.0 kB |
TT text grammar | Uint16 arr. | ?.? B | 35.6 kB | +35.6 kB |
TT suffix texts (val.s) | VByte str. | ?.? B | ???.? kB | −???.? kB |
TT suffix texts (sep.s) | Bit arr. | ?.? B | ?.? kB | −??.? kB |
TT suffix texts (sel.s) | Uint32 arr. | ?.? B | ??.? kB | −??.? kB |
TT prefix length array | DAC int.s | ?.0 B | ??.? kB | 0.0 kB |
TT ancestor index array | DAC int.s | ?.0 B | ??.? kB | 0.0 kB |
TT crosslink Δ array | DAC int.s | ?.0 B | ??.? kB | 0.0 kB |
TT crosslink Δ-sign array | Bit arr. | ?.0 B | ??.? kB | 0.0 kB |
PT0 base-point array | DAC int.s | ?.0 B | ???.? kB | 0.0 kB |
PT0 pointer array | DAC int.s | ?.0 B | ???.? kB | 0.0 kB |
PTΔ point-delta array | VByte str. | ?.? B | ???.? kB | 0.0 kB |
PTΔ crosslink Δ array | DAC int.s | ?.? B | ???.? kB | −??.? kB |
PTΔ crosslink Δ-sign array | Bit arr. | ?.? B | ???.? kB | 0.0 kB |
Total | ??.? B | ???.? kB | 0.0 kB |
Re-Pair repeatedly compresses strings by (1) finding their most frequent character pair, (2) replacing all instance of that pair with a new, previously unused character, and (3) recording that substitution as a rule in a separate grammar data structure. In other words:
- Let there be a collection of input texts and an empty grammar (a array of bigrams).
- Also let
‹getSubstitutionCharacter(𝑘)›
be a function that, given an integer 𝑘, returns a corresponding unique character that is not already in any of the input texts. - Repeat the following steps:
- Let
numOfSubstitutions
be the number of prior substitutions (which would start at 0). - Let
‹character0›‹character1›
be the most frequent bigram in all of the texts. (If there are multiple bigrams that are tied for being most frequent, then the particular choice among these bigrams is arbitrary. Our implementation happens to choose the lexicographically first one.) - Let
‹substitutionCharacter›
be‹getSubstitutionCharacter(numOfSubstitutions)›
. This is a new character that does not already occur in any of the texts. - Repeatedly replace instances of the bigram
‹character0›‹character1›
with the single‹substitutionCharacter›
, in each of the texts, from their starts to finishes. - Add the bigram
‹character0›‹character1›
to the grammar. - If all character pairs in the texts now each occur only once, then exit this loop.
- Let
- The end result is (1) the compressed texts after their accumulated substitutions and (2) the grammar with its accumulated rules.
To decode the compressed texts, the substitution characters are repeatedly replaced by their respective character pairs in the grammar, until there are no remaining substitutions to perform.
For example, let us suppose that getSubstitutionCharacter(𝑘)
for 𝑘 = 0, 1,
2, and 3 is ①
, ②
, ③
, and ④
respectively. Then applying Re-Pair to the
following collection of texts:
XYZZZ
XYZZ
XXXZ
YYYZ
ZZZZ
…would result in these texts:
③⑤
③Z
②X
④Y
⑤
…and this grammar:
ZZ
(for①
)XX
(for②
)XY
(for③
)YY
(for④
)①Z
(for⑤
)
There are 917 character pairs that occur more than once among the texts in the previous version. These pairs occur 209,041 times in total. The top ten character pairs are:
Bigram | Occurrences |
---|---|
ER |
2,813 |
AR |
2,762 |
AL |
2,725 |
IN |
2,725 |
LE |
2,646 |
ON |
2,624 |
TE |
2,342 |
E |
2,264 |
S |
2,250 |
TH |
2,250 |
Running Re-Pair takes approximately seven minutes. It compresses the texts down to 48,561 characters, which weigh ???.? kB (at ?.? bytes per scalar) when encoded with VByte. The grammar has 8,907 substituted bigrams, each made of two scalars, which altogether weigh ??.? kB (at ?.? bytes per scalar).
We could have adopted either of the modified Re-Pair algorithms described by Bille, Gørtz, and Prezza (2017) and by Sakai et al. (2018); their versions are more memory efficient during compression. However, the input Unicode Character Database data are static and relatively small, compared to the large genomes and other big-data sets that those articles study. While we care about decompression time and memory, we do not care about compression time and memory. We therefore do not need to save memory during compression, so we eschew these more complicated compression algorithms.
We also could have adopted the RL-MR-RePair algorithm described by Isamu Furuya (2019), but we are trying to stay simple for now.
We have greatly optimized the database’s memory, coming near its information-theoretic lower bound, while also improving bidirectional random-access time. Lastly, we will add a signature block and checksum block to the database file.
The beginning of the file identifies it as a UniRangebase.
It consists of the byte 98
(in hexadecimal) to avoid confusion with
UTF-8/ASCII text files, followed by UNINAME ? ?.?
in
UTF-8/ASCII, ending in a space. The ?.?.?
indicates the version of the file
format; the middle digit indicates the major Unicode version.
A 128-bit MD5 hash of all preceding bytes in the file.
The “N → V” and “V → Ns” columns have median times (50%-percentile values) after 10,000 repetitions run on Unicode 14.0 data, by Node v18.3.0, on a MacBook Air (M1, 2020) with 16 GB of memory. “N → V” indicates looking up characters by “good” Unicode names, and “V → Ns” indicates accessing names by “good” named Unicode values.
Version | Brotli | Heap | N → V | V → Ns |
---|---|---|---|---|
1. JSON array | 160.6 kB | ≈2,100 kB | 25,122 µs | 11,010 µs |
2. Extended name ranges | ‒‒‒.‒ kB | ‒,‒‒‒.‒ kB | ‒‒,‒‒‒ µs | ‒‒,‒‒‒ µs |
3. Text lines | 147.8 kB | 1,075.6 kB | 36,236 µs | 35,818 µs |
4. Scalar deltas | 104.6 kB | 962.9 kB | 34,033 µs | 22,757 µs |
5. Text table | 189.9 kB | 1,198.2 kB | 21 µs | 2,969 µs |
6. Parent prefixes | 167.1 kB | 651.3 kB | 22 µs | 13,529 µs |
7. Ancestor prefixes | 164.3 kB | 637.7 kB | 22 µs | 13,006 µs |
8. One-stage point table | ‒‒.‒ kB | ‒‒‒.‒ kB | ‒‒ µs | ‒‒ µs |
9. Two-stage point table | ‒‒.‒ kB | ‒‒‒.‒ kB | ‒‒ µs | ‒‒ µs |
10. Binary buffer | ‒‒.‒ kB | ‒‒‒.‒ kB | ‒‒ µs | ‒‒ µs |
11. Sep. bit arrays | ‒‒.‒ kB | ‒‒‒.‒ kB | ‒‒ µs | ‒‒ µs |
12. Bit select caches | ‒‒.‒ kB | ‒‒‒.‒ kB | ‒‒ µs | ‒‒ µs |
13. VByte integers | ‒‒.‒ kB | ‒‒‒.‒ kB | ‒‒ µs | ‒‒ µs |
14. DAC integers | ‒‒.‒ kB | ‒‒‒.‒ kB | ‒‒ µs | ‒‒ µs |
15. Crosslink deltas | ‒‒.‒ kB | ‒‒‒.‒ kB | ‒‒ µs | ‒‒ µs |
16. Text Re-Pair | ‒‒.‒ kB | ‒‒‒.‒ kB | ‒‒ µs | ‒‒ µs |
17. Signature & checksum | ‒‒.‒ kB | ‒‒‒.‒ kB | ‒‒ µs | ‒‒ µs |