Last active
January 7, 2019 16:32
-
-
Save Elfsong/5faed3edc1eaa432fc408f7e43ce12ac to your computer and use it in GitHub Desktop.
seperate_sentence #python #NLP
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def cut_sent(para): | |
para = para.replace(u'\u3000', '') | |
para = re.sub('([。!?\?])([^”’])', r"\1\n\2", para) # 单字符断句符 | |
para = re.sub('(\.{6})([^”’])', r"\1\n\2", para) # 英文省略号 | |
para = re.sub('(\…{2})([^”’])', r"\1\n\2", para) # 中文省略号 | |
para = re.sub('([。!?\?][”’])([^,。!?\?])', r'\1\n\2', para) | |
para = para.rstrip() | |
return [sentence for sentence in para.split("\n") if sentence] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment