Created
December 20, 2017 16:58
-
-
Save wpm/bf1f2301b98a883b50e903bc3cc86439 to your computer and use it in GitHub Desktop.
Segment a spaCy document into "paragraphs", treating whitespace tokens containing more than one line as a paragraph delimiter.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def paragraphs(document): | |
start = 0 | |
for token in document: | |
if token.is_space and token.text.count("\n") > 1: | |
yield document[start:token.i] | |
start = token.i | |
yield document[start:] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment