These are notes from my (relatively) brief skim of http://unicode.org/reports/tr15/ . All graphics and tables are from there.
So basically unicode lets you define the same character in multiple ways, but recognizes that there are 2 broad types of character equivalence:
- Canonical Equivalence which handles, amongst other cases:
- compositions like Å ≡ A+ ̊ (or
\u00c5
≡\u0041\u030a
) - redundant definitions: Both
\u2126
and\u03a9
display as the ohm symbol (Ω)
- compositions like Å ≡ A+ ̊ (or
- Compatibility Equivalence which handles, amongst other cases:
- characters which are rendered differently, but can be seen as pretty much the same (non-breaking space ≡ regular space, i⁹ ≡ i9, ℌ ≡ H, etc). Note that Å is not compatibility equivalent to A.
(Fun fact: you can try this in your browser console! Hit Ctrl+Shift+J
or F12
and type this: console.log("\u00c5", "\u0041\u030a", "\u212b")
. Weird, eh?)
So if I search for \u212b
(Å), it should match a document with \u00c5
(Å) because they are equivalent; but those two numbers certainly aren't equivalent! So what should solr be indexing? Should my query be being transformed so that it can find the document? Yes and yes! That's where Unicode Normalization Forms come in.
Unicode Normalization Forms define a few ways of transforming a unicode character so that 2 "equivalent" characters, when transformed by the same normalization form, will be equal. Different normalization forms exist to handle the different types of equivalence.
Here are the normalization forms:
Form | Description |
---|---|
Normalization Form D (NFD) | Canonical Decomposition |
Normalization Form C (NFC) | Canonical Decomposition, followed by Canonical Composition |
Normalization Form KD (NFKD) | Compatibility Decomposition |
Normalization Form KC (NFKC) | Compatibility Decomposition, followed by Canonical Composition |
Confused yet? Here's an image that should help: