Normally when preprocessing text, we want to normalise our data. Unicode Normalisation Forms KC and KD can be used for converting compatibility characters during normalisation. This will handle some confusable characters, but not all. The function below attempts to normalise confusable characters.
In is_confusable()
we parse a string using icu.SpoofChecker
, which is based on
Unicode Technical Report #36 and Unicode Technical Standard #39.
UTS 39 defines two strings to be confusable if they map to the same skeleton. A skeleton is a sequence of families of confusable characters, where each family has a single exemplar character.
The function will return a status (ASCII
, True
, False
) and the exemplar (or skeleton) representation of the string.
Basic Latin characters are either confusable or not (so should return True
or False
normally) but Basic Latin characters
that are confusables are also exemplars for the sequence they belong to.
Exemplar sequences will be decomposed, so the exemplar of á <00E1> will be <U+0061, U+0301>.
import icu
def is_confusable(text):
if text.isascii():
return ("ASCII", text)
checker = icu.SpoofChecker()
checker.setRestrictionLevel(icu.URestrictionLevel.HIGHLY_RESTRICTIVE)
status = True if text != checker.getSkeleton(icu.USpoofChecks.ALL_CHECKS, text) else False
skeleton = checker.getSkeleton(icu.USpoofChecks.ALL_CHECKS, text)
return (status, skeleton)
The characters e
, \U0001d5be
, and \u0435
all return the exemplar e (U+0065):
is_confusable('e')
# ('ASCII', 'e')
is_confusable('\U0001d5be')
# (True, 'e')
is_confusable('\u0435')
# (True, 'e')