Notes on Unicode Normalization Forms

These are notes from my (relatively) brief skim of http://unicode.org/reports/tr15/ . All graphics and tables are from there.

So basically unicode lets you define the same character in multiple ways, but recognizes that there are 2 broad types of character equivalence:

Canonical Equivalence which handles, amongst other cases:
- compositions like Å ≡ A+ ̊ (or \u00c5 ≡ \u0041\u030a)
- redundant definitions: Both \u2126 and \u03a9 display as the ohm symbol (Ω)
Compatibility Equivalence which handles, amongst other cases:
- characters which are rendered differently, but can be seen as pretty much the same (non-breaking space ≡ regular space, i⁹ ≡ i9, ℌ ≡ H, etc). Note that Å is not compatibility equivalent to A.

(Fun fact: you can try this in your browser console! Hit Ctrl+Shift+J or F12 and type this: console.log("\u00c5", "\u0041\u030a", "\u212b"). Weird, eh?)

So if I search for \u212b (Å), it should match a document with \u00c5 (Å) because they are equivalent; but those two numbers certainly aren't equivalent! So what should solr be indexing? Should my query be being transformed so that it can find the document? Yes and yes! That's where Unicode Normalization Forms come in.

Unicode Normalization Forms define a few ways of transforming a unicode character so that 2 "equivalent" characters, when transformed by the same normalization form, will be equal. Different normalization forms exist to handle the different types of equivalence.

Here are the normalization forms:

Form	Description
Normalization Form D (NFD)	Canonical Decomposition
Normalization Form C (NFC)	Canonical Decomposition, followed by Canonical Composition
Normalization Form KD (NFKD)	Compatibility Decomposition
Normalization Form KC (NFKC)	Compatibility Decomposition, followed by Canonical Composition

Confused yet? Here's an image that should help:

cdrini/Unicode Normalization Forms.md