Skip to content

Instantly share code, notes, and snippets.

@hoelzro
Last active February 22, 2017 17:48
Show Gist options
  • Select an option

  • Save hoelzro/7df68ea833537be14c47e4a697fdaec0 to your computer and use it in GitHub Desktop.

Select an option

Save hoelzro/7df68ea833537be14c47e4a697fdaec0 to your computer and use it in GitHub Desktop.
Detecting lookalike characters for content filtration

Problem

People will use the wide range of Unicode characters to avoid content filters, whether it's filtering out things like swear words or spam. Examples of this include:

  • Using ∪ (U+22A UNION) as a substitute for u/U
  • Using 0 as a substitute for O
  • Using cyrillic characters (eg. а/о/у) as a substitute for Latin characters (see also Greek, Cherokee, etc)
  • Using Latin characters with diacritics, whether in precomposed or combining forms (eg. spám)

Ideas for solutions

Please forgive any US-centric naïveté any of these solutions implies - I am neither a language nor Unicode expert!

Build a database of locality-sensitive hashes of known graphemes, and try to reduce graphemes to a normalized visual form

Ex. render each grapheme (but using which font?) and generate an image hash (like phash or surf). Compare each hash to a database of hashes in order to find clusters of characters with a similar visual appearance, and then nominate a member of that cluster to be the "normalized" version, which is what you use for checking the disallowed word list.

This sounds pretty slow, but could probably be made pretty fast if you store results in lookup tables.

Convert to NFKD normalized form and remove combining characters

This only helps with the last example from above, but could probably be used with other approaches to improve accuracy.

Tokenize the content into words, and look for words that contain characters from several different Unicode blocks

In normal text, chances are that individual words will contain characters from one (maybe two) Unicode blocks. Some blocks (eg. mathematical symbols) don't really make sense when mixed in with, say, Latin characters, so you could flag those blocks as "unlikely to mingle".

Prior Work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment