Skip to content

Instantly share code, notes, and snippets.

@khuangaf
Created February 4, 2018 13:15
Show Gist options
  • Save khuangaf/f5920d643cc665ef31111f536615d334 to your computer and use it in GitHub Desktop.
Save khuangaf/f5920d643cc665ef31111f536615d334 to your computer and use it in GitHub Desktop.
# remove parenthesis
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)
# store as list of sentences
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)
# store as list of lists of words
sentences_ted = []
for sent_str in sentences_strings_ted:
tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
sentences_ted.append(tokens)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment