Skip to content

Instantly share code, notes, and snippets.

@tanbro
Last active March 14, 2019 06:05
Show Gist options
  • Save tanbro/8b4bfc312f48f23e5bb9d2132f5fe577 to your computer and use it in GitHub Desktop.
Save tanbro/8b4bfc312f48f23e5bb9d2132f5fe577 to your computer and use it in GitHub Desktop.
Remove whitespaces in UTF-8 CJK string, by regex
import re
HANZI = r'([\u4E00-\u9FFF]|[\u3400-\u4DBF]|[\U00020000-\U0002A6DF]|[\U0002A700-\U0002B73F]|[\U0002B740-\U0002B81F]|[\U0002B820-\U0002CEAF]|[\uF900-\uFAFF]|[\U0002F800-\U0002FA1F])'
CJK_WHITESPACE_REGEX = re.compile(r'(?P<c>[\u2E80-\u9FFF])(\s+)')
def remove_cjk_whitespace(s): # type: (str)->str
"""删除字符串中 CJK 文字之间的空格
:param s: 要处理的字符串
.. important:: **必须** 是 `UTF-8` 编码,否则工作会不正常
:return: 删除空格后的字符串
see: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
"""
return re.sub(CJK_WHITESPACE_REGEX, r'\g<c>', s.strip())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment