somerandomnerd/0_sentence_splitting_regex.txt

Last active December 28, 2015 02:39

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/somerandomnerd/7429169.js"></script>
Save somerandomnerd/7429169 to your computer and use it in GitHub Desktop.

Download ZIP

Regular expression for splitting sentences

Raw

0_sentence_splitting_regex.txt

(?<=[a-z])[\.?!]\s(?=([A-Z][a-z]|A |I |[A-Z]\.))|…|\.\.\.

Raw

1_What it does.md

Looks for 'end of sentence' punctuation, with certain conditions to avoid false matches (eg. initials, abbreviations.)

(?<=[a-z])

Only matches if punctuation is preceeded by a lower case letter (to avoid acronyms being misidentified as sentence breaks)

[.?!]\s

Matches end of sentence punctuation, followed by white space

(?=([A-Z][a-z]|A |I |[A-Z].))

Only when followed by a capitalised word, the words "A" or "I", or an initial (capital letter, followed by full stop.)

|…|...

Ellipses (either typographical or 'fake') are always taken to be sentence breaks.

NOTES

A pre-parser to eliminate double spaces/whitespace or whitespace at the beginning of a line would be a good idea.

ISSUES

Fails to match when a sentence ends with an acronym - ie.

"It's the first group action of its kind in Britain and one of only a handful of lawsuits against tobacco companies outside the U.S. A Paris lawyer last year sued France's Seita SA on behalf of two cancer-stricken smokers."

Could replace (?<=[a-z]) with (?<=([a-z])|[A-Z].) - but this would also incorrectly match initialled names (eg. "P.G. Wodehouse"), and I think I'd prefer to avoid false positives.
Fails to deal with brackets - eg.

"Punctuation goes outside brackets (when enclosed in the sentence). (Not when the whole sentence is within brackets though.)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment