Exemplars for confusable characters (normalising confusable data)

Normally when preprocessing text, we want to normalise our data. Unicode Normalisation Forms KC and KD can be used for converting compatibility characters during normalisation. This will handle some confusable characters, but not all. The function below attempts to normalise confusable characters.

In is_confusable() we parse a string using icu.SpoofChecker, which is based on Unicode Technical Report #36 and Unicode Technical Standard #39.

UTS 39 defines two strings to be confusable if they map to the same skeleton. A skeleton is a sequence of families of confusable characters, where each family has a single exemplar character.

The function will return a status (ASCII, True, False) and the exemplar (or skeleton) representation of the string. Basic Latin characters are either confusable or not (so should return True or False normally) but Basic Latin characters that are confusables are also exemplars for the sequence they belong to.

Exemplar sequences will be decomposed, so the exemplar of á <00E1> will be <U+0061, U+0301>.

import icu
def is_confusable(text):
    if text.isascii():
        return ("ASCII", text)
    checker = icu.SpoofChecker()
    checker.setRestrictionLevel(icu.URestrictionLevel.HIGHLY_RESTRICTIVE)
    status = True if text != checker.getSkeleton(icu.USpoofChecks.ALL_CHECKS, text) else False
    skeleton = checker.getSkeleton(icu.USpoofChecks.ALL_CHECKS, text)
    return (status, skeleton)

The characters e, \U0001d5be, and \u0435 all return the exemplar e (U+0065):

is_confusable('e')
# ('ASCII', 'e')
is_confusable('\U0001d5be')
# (True, 'e')
is_confusable('\u0435')
# (True, 'e')

andjc/is_confusable.md

Exemplars for confusable characters (normalising confusable data)