Skip to content

Instantly share code, notes, and snippets.

@ttempe
Created December 17, 2012 13:23
Show Gist options
  • Save ttempe/4318253 to your computer and use it in GitHub Desktop.
Save ttempe/4318253 to your computer and use it in GitHub Desktop.
Separate pinyin syllables with whitespace. Eg: "nǐhāo" becomes "nǐ hāo" Eg: "Yīlù píng'ān" becomes "Yī lù píng ān"
# -*- coding: utf-8 -*-
#
# Copyright © 2012 Thomas TEMPÉ, <[email protected]>
#
# License: GNU GPL, version 3 or later; http://www.gnu.org/copyleft/gpl.html
import re
inits = u"zh|sh|ch|[bpmfdtnlgkhjqxrzscwy]"
finals = u"i[ōóǒòo]ng|[ūúǔùu]ng|[āáǎàa]ng|[ēéěèe]ng|i[āɑ̄áɑ́ɑ́ǎɑ̌àɑ̀aāáǎàa]ng|[īíǐìi]ng|i[āáǎàa]n|u[āáǎàa]n|[ōóǒòo]ng|[ēéěèe]r|i[āáǎàa]|i[ēéěèe]|i[āáǎàa]o|i[ūúǔùu]|[īíǐìi]n|u[āáǎàa]|u[ōóǒòo]|u[āáǎàa]i|u[īíǐìi]|[ūúǔùu]n|u[ēéěèe]|ü[ēéěèe]|v[ēéěèe]|i[ōóǒòo]|[āáǎàa]i|[ēéěèe]i|[āáǎàa]o|[ōóǒòo]u|[āáǎàa]n|[ēéěèe]n|[āáǎàa]|[ēéěèe]|[ōóǒòo]|[īíǐìi]|[ūúǔùu]|[ǖǘǚǜüv]"
standalones = u"'[āáǎàa]ng|'[ēéěèe]ng|'[ēéěèe]r|'[āáǎàa]i|'[ēéěèe]i|'[āáǎàa]o|'[ōóǒòo]u|'[āáǎàa]n|'[ēéěèe]n|'[āáǎàa]|'[ēéěèe]|'[ōóǒòo]"
pinyin_re = "(("+inits+")("+finals+")|("+standalones+"))"
pinyin_two_re = re.compile("(?P<one>"+pinyin_re+")(?P<two>"+pinyin_re+")", flags=re.I)
def separate_pinyin(text):
u"""
Separate pinyin syllables with whitespace.
Eg: "nǐhāo" becomes "nǐ hāo"
Eg: "Yīlù píng'ān" becomes "Yī lù píng ān"
Assumes the pinyin respects the apostrophe/accent's placement rules :
http://www.pinyin.info/romanization/hanyu/apostrophes.html
http://www.pinyin.info/rules/where.html
Respect of these rules guarantee there's no ambiguity in the separation.
Does not work with tone numbers as suffixed numeral (eg: ni3hao3)
Useful as pre-processing for Google Translate's output.
"""
def clean(t):
'remove leading apostrophe'
if "'" == t[0]:
return t[1:]
return t
def separate_pinyin_sub(p):
return clean(p.group("one"))+" "+clean(p.group("two"))
#Run twice in case of overlapping sequences
text = pinyin_two_re.sub(separate_pinyin_sub, text)
text = pinyin_two_re.sub(separate_pinyin_sub, text)
return text
#Demo
#print separate_pinyin(u"Yīlù píng'ān")# -*- coding: utf-8 -*-
#
# Copyright © 2012 Thomas TEMPÉ, <[email protected]>
#
# License: GNU GPL, version 3 or later; http://www.gnu.org/copyleft/gpl.html
import re
inits = u"zh|sh|ch|[bpmfdtnlgkhjqxrzscwy]"
finals = u"i[ōóǒòo]ng|[ūúǔùu]ng|[āáǎàa]ng|[ēéěèe]ng|i[āɑ̄áɑ́ɑ́ǎɑ̌àɑ̀aāáǎàa]ng|[īíǐìi]ng|i[āáǎàa]n|u[āáǎàa]n|[ōóǒòo]ng|[ēéěèe]r|i[āáǎàa]|i[ēéěèe]|i[āáǎàa]o|i[ūúǔùu]|[īíǐìi]n|u[āáǎàa]|u[ōóǒòo]|u[āáǎàa]i|u[īíǐìi]|[ūúǔùu]n|u[ēéěèe]|ü[ēéěèe]|v[ēéěèe]|i[ōóǒòo]|[āáǎàa]i|[ēéěèe]i|[āáǎàa]o|[ōóǒòo]u|[āáǎàa]n|[ēéěèe]n|[āáǎàa]|[ēéěèe]|[ōóǒòo]|[īíǐìi]|[ūúǔùu]|[ǖǘǚǜüv]"
standalones = u"'[āáǎàa]ng|'[ēéěèe]ng|'[ēéěèe]r|'[āáǎàa]i|'[ēéěèe]i|'[āáǎàa]o|'[ōóǒòo]u|'[āáǎàa]n|'[ēéěèe]n|'[āáǎàa]|'[ēéěèe]|'[ōóǒòo]"
pinyin_re = "(("+inits+")("+finals+")|("+standalones+"))"
pinyin_two_re = re.compile("(?P<one>"+pinyin_re+")(?P<two>"+pinyin_re+")", flags=re.I)
def separate_pinyin(text):
u"""
Separate pinyin syllables with whitespaces.
Eg: "nǐhāo" becomes "nǐ hāo"
Eg: "Yīlù píng'ān" becomes "Yī lù píng ān"
Assumes the pinyin respects the apostrophe/accent's placement rules :
http://www.pinyin.info/romanization/hanyu/apostrophes.html
http://www.pinyin.info/rules/where.html
Respect of these rules guarantee there's no ambiguity in the separation.
Does not work with tone numbers as suffixed numeral (eg: ni3hao3)
Useful as pre-processing for Google Translate's output.
"""
def clean(t):
'remove leading apostrophe'
if "'" == t[0]:
return t[1:]
return t
def separate_pinyin_sub(p):
return clean(p.group("one"))+" "+clean(p.group("two"))
#Run twice in case of overlapping sequences
text = pinyin_two_re.sub(separate_pinyin_sub, text)
text = pinyin_two_re.sub(separate_pinyin_sub, text)
return text
#Demo
#print separate_pinyin(u"Yīlù píng'ān")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment