People will use the wide range of Unicode characters to avoid content filters, whether it's filtering out things like swear words or spam. Examples of this include:
- Using ∪ (U+22A UNION) as a substitute for u/U
- Using 0 as a substitute for O
- Using cyrillic characters (eg. а/о/у) as a substitute for Latin characters (see also Greek, Cherokee, etc)
- Using Latin characters with diacritics, whether in precomposed or combining forms (eg. spám)
Please forgive any US-centric naïveté any of these solutions implies - I am neither a language nor Unicode expert!
Build a database of locality-sensitive hashes of known graphemes, and try to reduce graphemes to a normalized visual form
Ex. render each grapheme (but using which font?) and generate an image hash (like phash or surf). Compare each hash to a database of hashes in order to find clusters of characters with a similar visual appearance, and then nominate a member of that cluster to be the "normalized" version, which is what you use for checking the disallowed word list.
This sounds pretty slow, but could probably be made pretty fast if you store results in lookup tables.
This only helps with the last example from above, but could probably be used with other approaches to improve accuracy.
Tokenize the content into words, and look for words that contain characters from several different Unicode blocks
In normal text, chances are that individual words will contain characters from one (maybe two) Unicode blocks. Some blocks (eg. mathematical symbols) don't really make sense when mixed in with, say, Latin characters, so you could flag those blocks as "unlikely to mingle".