Created
July 6, 2012 10:26
-
-
Save jhorneman/3059407 to your computer and use it in GitHub Desktop.
How to filter out common unwanted characters in Python
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
character_replacements = [ | |
( u'\u2018', u"'"), # LEFT SINGLE QUOTATION MARK | |
( u'\u2019', u"'"), # RIGHT SINGLE QUOTATION MARK | |
( u'\u201c', u'"'), # LEFT DOUBLE QUOTATION MARK | |
( u'\u201d', u'"'), # RIGHT DOUBLE QUOTATION MARK | |
( u'\u201e', u'"'), # DOUBLE LOW-9 QUOTATION MARK | |
( u'\u2013', u'-'), # EN DASH | |
( u'\u2026', u'...'), # HORIZONTAL ELLIPSIS | |
( u'\u0152', u'OE'), # LATIN CAPITAL LIGATURE OE | |
( u'\u0153', u'oe') # LATIN SMALL LIGATURE OE | |
] | |
for (undesired_character, safe_character) in character_replacements: | |
text = text.replace(undesired_character, safe_character) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I know 'unwanted characters' can be controversial and the use case may be unclear. But this code was useful to me, in game development and when working with relatively primitive font systems.
See also my blog post: http://www.intelligent-artifice.com/2010/02/how-to-filter-out-common-unwanted-characters-in-python.html