Skip to content

Instantly share code, notes, and snippets.

@seozed
Last active May 11, 2020 09:52
Show Gist options
  • Save seozed/2e2fde28f23e8123e85e6629a7b04b14 to your computer and use it in GitHub Desktop.
Save seozed/2e2fde28f23e8123e85e6629a7b04b14 to your computer and use it in GitHub Desktop.
[转换unicode字符串为正常字符串] 如\xa0... #unicode
import unicodedata
s = 'T-shirt\xa0\xa0短袖圆领衫,\u3000体恤衫\xa0买一件\t吧'
unicodedata.normalize('NFKC', s)
# T-shirt 短袖圆领衫, 体恤衫 买一件 吧
# 日常爬虫抓取数据中常遇到此类问题,使用率较高
def unicode_normalize(unistr, form='NFKC'):
"""
转换unicode字符串为正常字符串
:param unistr: 包含unicode字符串的文本
:param form: NFC'|'NFKC'|'NFD'|'NFKD',一般用NFKC
:return:
"""
import unicodedata
return unicodedata.normalize(form, unistr)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment