Created
December 17, 2012 13:23
-
-
Save ttempe/4318253 to your computer and use it in GitHub Desktop.
Separate pinyin syllables with whitespace. Eg: "nǐhāo" becomes "nǐ hāo" Eg: "Yīlù píng'ān" becomes "Yī lù píng ān"
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding: utf-8 -*- | |
# | |
# Copyright © 2012 Thomas TEMPÉ, <[email protected]> | |
# | |
# License: GNU GPL, version 3 or later; http://www.gnu.org/copyleft/gpl.html | |
import re | |
inits = u"zh|sh|ch|[bpmfdtnlgkhjqxrzscwy]" | |
finals = u"i[ōóǒòo]ng|[ūúǔùu]ng|[āáǎàa]ng|[ēéěèe]ng|i[āɑ̄áɑ́ɑ́ǎɑ̌àɑ̀aāáǎàa]ng|[īíǐìi]ng|i[āáǎàa]n|u[āáǎàa]n|[ōóǒòo]ng|[ēéěèe]r|i[āáǎàa]|i[ēéěèe]|i[āáǎàa]o|i[ūúǔùu]|[īíǐìi]n|u[āáǎàa]|u[ōóǒòo]|u[āáǎàa]i|u[īíǐìi]|[ūúǔùu]n|u[ēéěèe]|ü[ēéěèe]|v[ēéěèe]|i[ōóǒòo]|[āáǎàa]i|[ēéěèe]i|[āáǎàa]o|[ōóǒòo]u|[āáǎàa]n|[ēéěèe]n|[āáǎàa]|[ēéěèe]|[ōóǒòo]|[īíǐìi]|[ūúǔùu]|[ǖǘǚǜüv]" | |
standalones = u"'[āáǎàa]ng|'[ēéěèe]ng|'[ēéěèe]r|'[āáǎàa]i|'[ēéěèe]i|'[āáǎàa]o|'[ōóǒòo]u|'[āáǎàa]n|'[ēéěèe]n|'[āáǎàa]|'[ēéěèe]|'[ōóǒòo]" | |
pinyin_re = "(("+inits+")("+finals+")|("+standalones+"))" | |
pinyin_two_re = re.compile("(?P<one>"+pinyin_re+")(?P<two>"+pinyin_re+")", flags=re.I) | |
def separate_pinyin(text): | |
u""" | |
Separate pinyin syllables with whitespace. | |
Eg: "nǐhāo" becomes "nǐ hāo" | |
Eg: "Yīlù píng'ān" becomes "Yī lù píng ān" | |
Assumes the pinyin respects the apostrophe/accent's placement rules : | |
http://www.pinyin.info/romanization/hanyu/apostrophes.html | |
http://www.pinyin.info/rules/where.html | |
Respect of these rules guarantee there's no ambiguity in the separation. | |
Does not work with tone numbers as suffixed numeral (eg: ni3hao3) | |
Useful as pre-processing for Google Translate's output. | |
""" | |
def clean(t): | |
'remove leading apostrophe' | |
if "'" == t[0]: | |
return t[1:] | |
return t | |
def separate_pinyin_sub(p): | |
return clean(p.group("one"))+" "+clean(p.group("two")) | |
#Run twice in case of overlapping sequences | |
text = pinyin_two_re.sub(separate_pinyin_sub, text) | |
text = pinyin_two_re.sub(separate_pinyin_sub, text) | |
return text | |
#Demo | |
#print separate_pinyin(u"Yīlù píng'ān")# -*- coding: utf-8 -*- | |
# | |
# Copyright © 2012 Thomas TEMPÉ, <[email protected]> | |
# | |
# License: GNU GPL, version 3 or later; http://www.gnu.org/copyleft/gpl.html | |
import re | |
inits = u"zh|sh|ch|[bpmfdtnlgkhjqxrzscwy]" | |
finals = u"i[ōóǒòo]ng|[ūúǔùu]ng|[āáǎàa]ng|[ēéěèe]ng|i[āɑ̄áɑ́ɑ́ǎɑ̌àɑ̀aāáǎàa]ng|[īíǐìi]ng|i[āáǎàa]n|u[āáǎàa]n|[ōóǒòo]ng|[ēéěèe]r|i[āáǎàa]|i[ēéěèe]|i[āáǎàa]o|i[ūúǔùu]|[īíǐìi]n|u[āáǎàa]|u[ōóǒòo]|u[āáǎàa]i|u[īíǐìi]|[ūúǔùu]n|u[ēéěèe]|ü[ēéěèe]|v[ēéěèe]|i[ōóǒòo]|[āáǎàa]i|[ēéěèe]i|[āáǎàa]o|[ōóǒòo]u|[āáǎàa]n|[ēéěèe]n|[āáǎàa]|[ēéěèe]|[ōóǒòo]|[īíǐìi]|[ūúǔùu]|[ǖǘǚǜüv]" | |
standalones = u"'[āáǎàa]ng|'[ēéěèe]ng|'[ēéěèe]r|'[āáǎàa]i|'[ēéěèe]i|'[āáǎàa]o|'[ōóǒòo]u|'[āáǎàa]n|'[ēéěèe]n|'[āáǎàa]|'[ēéěèe]|'[ōóǒòo]" | |
pinyin_re = "(("+inits+")("+finals+")|("+standalones+"))" | |
pinyin_two_re = re.compile("(?P<one>"+pinyin_re+")(?P<two>"+pinyin_re+")", flags=re.I) | |
def separate_pinyin(text): | |
u""" | |
Separate pinyin syllables with whitespaces. | |
Eg: "nǐhāo" becomes "nǐ hāo" | |
Eg: "Yīlù píng'ān" becomes "Yī lù píng ān" | |
Assumes the pinyin respects the apostrophe/accent's placement rules : | |
http://www.pinyin.info/romanization/hanyu/apostrophes.html | |
http://www.pinyin.info/rules/where.html | |
Respect of these rules guarantee there's no ambiguity in the separation. | |
Does not work with tone numbers as suffixed numeral (eg: ni3hao3) | |
Useful as pre-processing for Google Translate's output. | |
""" | |
def clean(t): | |
'remove leading apostrophe' | |
if "'" == t[0]: | |
return t[1:] | |
return t | |
def separate_pinyin_sub(p): | |
return clean(p.group("one"))+" "+clean(p.group("two")) | |
#Run twice in case of overlapping sequences | |
text = pinyin_two_re.sub(separate_pinyin_sub, text) | |
text = pinyin_two_re.sub(separate_pinyin_sub, text) | |
return text | |
#Demo | |
#print separate_pinyin(u"Yīlù píng'ān") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment