Skip to content

Instantly share code, notes, and snippets.

Created August 30, 2010 12:44
Show Gist options
  • Save j4mie/557354 to your computer and use it in GitHub Desktop.
Save j4mie/557354 to your computer and use it in GitHub Desktop.
Normalise (normalize) unicode data in Python to remove umlauts, accents etc.
# -*- coding: utf-8 -*-
import unicodedata
""" Normalise (normalize) unicode data in Python to remove umlauts, accents etc. """
data = u'naïve café'
normal = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
print normal
# prints "naive cafe"
Copy link

erm3nda commented Apr 9, 2018

@frangeris, that's a great question. I've ended with that. Works perfectly but will ignore anything combined with ~ tilde char ('COMBINING TILDE'). Being exact, it will ONLY normalize letters combined with ´ or ` and nothing else:

def strip_accents_spain(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

Docs didn't say that much about wich combos can be used for normalize(), but you can get the whole idea here: (Search for "COMBINING" at bottom document to see all options).

Copy link

ghost commented Aug 22, 2018

@frangeris a quick and probably non-pythonic solution is as follows:

line = "EL NIÑO"
line = line.replace('Ñ','-&-')
line= str(unicodedata.normalize('NFKD', line).encode('ascii','ignore'))[2:-1]
line = line.replace('-&-','Ñ')	

Replace -&- with some other random character combination that doesn't appear in your text
This is also case sensitive and character specific. You can always add more replace calls (not ideal).

Copy link

Nifty, but note it doesn't change Unicode punctuation such as left and right quotation marks and en-, em-, figure, and horizontal dashes (‘ ’, “ ” , – — ‒ ―) to their ASCII equivalents, it just strips them. I tried fiddling with unicodedata.normalize options without success. FWIW these punctuation characters are missing from the table in @erm3nda's link to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment