Casefolding and matching

Default Case Folding

It is common to see the str.lower() method used in Python code when the developer wants to compare or match strings written in bicameral scripts. But it is not universal. For instance, the default case for Cherokee is uppercase instead of lowercase.

>>> s = "Ꮒꭶꮣ ꭰꮒᏼꮻ ꭴꮎꮥꮕꭲ ꭴꮎꮪꮣꮄꮣ ꭰꮄ ꭱꮷꮃꭽꮙ ꮎꭲ ꭰꮲꮙꮩꮧ ꭰꮄ ꭴꮒꮂ ꭲᏻꮎꮫꮧꭲ. Ꮎꮝꭹꮎꮓ ꭴꮅꮝꭺꮈꮤꮕꭹ ꭴꮰꮿꮝꮧ ꮕᏸꮅꮫꭹ ꭰꮄ ꭰꮣꮕꮦꮯꮣꮝꮧ ꭰꮄ ꭱꮅꮝꮧ ꮟᏼꮻꭽ ꮒꮪꮎꮣꮫꮎꮥꭼꭹ ꮎ ꮧꮎꮣꮕꮯ ꭰꮣꮕꮩ ꭼꮧ."
>>> sl = s.lower()
>>> su = s.upper()
>>> scf = s.casefold()
>>> sl == scf
False
>>> su == scf
True

It can be seen that casefolding Cherokee yields a string that matches the string produced by the uppercase function, rather than the lowercase function.

Case conversion (str.lower(), str.upper() and str.title() methods) are used to ensure a string is in the specified case. Casefolding on the other hand, is to allow caseless matching of strings. A casefolded string is often lowercase for bicameral scripts, although Cherokee is casefolded to uppercase.

The string operation str.casefold() is Python's implementation of Unicode’s toCasefold().

The formal definition of toCasefold() is:

toCasefold(X): Map each character C in X to Case_Folding(C).

Casefolding can be either invariant (language insensitive) or it can use locale specific tailorings. The python implementation of str.casefold() is invariant. In order to use full casefolding that responds to locale tailorings, use icu.UnicodeString.foldCase().

Full casefolding, can make irreversable changes to the text, and is most useful when comparing or matching text. If you wish to normalise text, then it is better to use a specific Unicode normalisation form, and an appropriate case conversion or mapping.

Casefolding for identifiers: `toNFKC_Casefold()`

When the string is an identifier, a modified form of caseless matching is used: toNFKC_Casefold(). The Unicode defintion of toNFKC_Casefold() is:

toNFKC_Casefold( X ): Map each character C in X to NFKC_Casefold(C) and then normalize the resulting string to NFC.

Python does not implement this version of casefolding, an alternative would be:

import unicodedata as ud
def NFKC_Casefold(s):
  return ud.normalize("NFC", ud.normalize('NFKC', s).casefold())

Default caseless matching

Default caseless matching is defined as:

A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

In python default caseless matching would be:

X.casefold() == Y.casefold()

Canonical caseless matching

Ideally strings should also be normalised before comparing the strings. Canonical caseless matching is defined as:

A string X is a canonical caseless match for a string Y if and only if:
NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))

This could be expressed in Python as:

ud.normalize("NFD", ud.normalize("NFD", X).casefold()) == ud.normalize("NFD", ud.normalize("NFD", Y).casefold())

Compatibility caseless match

If you wish to ignore compatability differences between characters when comparing strings, you would use compatibility caseless matching, which is defined as:

A string X is a compatibility caseless match for a string Y if and only if:
NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))

This could be expressed in python as:

ud.normalize("NFKD", ud.normalize("NFKD", ud.normalize("NFD", X).casefold()).casefold()) == ud.normalize("NFKD", ud.normalize("NFKD", ud.normalize("NFD", Y).casefold()).casefold())

Identifier caseless matching

Caseless matching for identifiers uses the NFKC_Casefold mapping discussed above, and is defined as:

A string X is an identifier caseless match for a string Y if and only if:
toNFKC_Casefold(NFD(X)) = toNFKC_Casefold(NFD(Y))

This can be expressed in python as:

NFKC_Casefold(ud.normalize("NFD", x)) == NFKC_Casefold(ud.normalize("NFD", y))

One useful application of casefolding is cleaning up scraped data that has presentation forms in the text.

I will start by defining a couple of helper functions:

import regex as re

def has_presentation_forms(text):
    pattern = r'([\p{InAlphabetic_Presentation_Forms}\p{InArabic_Presentation_Forms-A}\p{InArabic_Presentation_Forms-B}]+)'
    return bool(re.findall(pattern, text))

def clean_presentation_forms(text, folding=False):
    def clean_pf(match):
        return match.group(1).casefold()
    def clean_nfkc(match):
        return ud.normalize("NFKC", match.group(1))
    pattern = r'([\p{InAlphabetic_Presentation_Forms}\p{InArabic_Presentation_Forms-A}\p{InArabic_Presentation_Forms-B}]+)'
    result = re.sub(pattern, clean_pf, text) if folding else re.sub(pattern, clean_nfkc, text)
    return result

Alternatively, the clean_presentation_forms function could be rewritten as:

def clean_presentation_forms(text, folding=False):
    def clean_pf(match, folding):
        return  match.group(1).casefold() if folding else ud.normalize("NFKC", match.group(1))
    pattern = r'([\p{InAlphabetic_Presentation_Forms}\p{InArabic_Presentation_Forms-A}\p{InArabic_Presentation_Forms-B}]+)'
    return re.sub(pattern, lambda match, folding=folding: clean_pf(match, folding), text)

We then run the functions on a string "ﬁnd an oﬃce":

>>> s = "ﬁnd an oﬃce"
>>> elu.cp(s)
'U+FB01 (ﬁ) U+006E (n) U+0064 (d) U+0020 ( ) U+0061 (a) U+006E (n) U+0020 ( ) U+006F (o) U+FB03 (ﬃ) U+0063 (c) U+0065 (e)'
>>> print(has_presentation_forms(s))
True
>>> if has_presentation_forms(s):
    r = clean_presentation_forms(s)
... 
>>> print(elu.cp(r))
U+0066 (f) U+0069 (i) U+006E (n) U+0064 (d) U+0020 ( ) U+0061 (a) U+006E (n) U+0020 ( ) U+006F (o) U+0066 (f) U+0066 (f) U+0069 (i) U+0063 (c) U+0065 (e)
>>>

andjc/casefolding_matching.md

Casefolding and matching

Default Case Folding

Casefolding for identifiers: `toNFKC_Casefold()`

Default caseless matching

Canonical caseless matching

Compatibility caseless match

Identifier caseless matching

andjc commented Mar 15, 2022 •

edited

Loading

andjc commented Mar 16, 2022 •

edited

Loading

andjc/casefolding_matching.md

Casefolding and matching

Default Case Folding

Casefolding for identifiers: toNFKC_Casefold()

Default caseless matching

Canonical caseless matching

Compatibility caseless match

Identifier caseless matching

andjc commented Mar 15, 2022 • edited Loading

andjc commented Mar 16, 2022 • edited Loading

Casefolding for identifiers: `toNFKC_Casefold()`

andjc commented Mar 15, 2022 •

edited

Loading

andjc commented Mar 16, 2022 •

edited

Loading