Created
August 30, 2010 12:44
-
-
Save j4mie/557354 to your computer and use it in GitHub Desktop.
Normalise (normalize) unicode data in Python to remove umlauts, accents etc.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding: utf-8 -*- | |
import unicodedata | |
""" Normalise (normalize) unicode data in Python to remove umlauts, accents etc. """ | |
data = u'naïve café' | |
normal = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore') | |
print normal | |
# prints "naive cafe" |
Nifty, but note it doesn't change Unicode punctuation such as left and right quotation marks and en-, em-, figure, and horizontal dashes (‘ ’, “ ” , – — ‒ ―) to their ASCII equivalents, it just strips them. I tried fiddling with unicodedata.normalize
options without success. FWIW these punctuation characters are missing from the table in @erm3nda's link to ftp://ftp.unicode.org/Public/9.0.0/ucd/NormalizationTest.txt.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@frangeris a quick and probably non-pythonic solution is as follows:
Replace -&- with some other random character combination that doesn't appear in your text
This is also case sensitive and character specific. You can always add more replace calls (not ideal).