Skip to content

Instantly share code, notes, and snippets.

@somerandomnerd
Last active December 28, 2015 02:39
Show Gist options
  • Save somerandomnerd/7429169 to your computer and use it in GitHub Desktop.
Save somerandomnerd/7429169 to your computer and use it in GitHub Desktop.
Regular expression for splitting sentences
(?<=[a-z])[\.?!]\s(?=([A-Z][a-z]|A |I |[A-Z]\.))|…|\.\.\.

Looks for 'end of sentence' punctuation, with certain conditions to avoid false matches (eg. initials, abbreviations.)

(?<=[a-z])

Only matches if punctuation is preceeded by a lower case letter (to avoid acronyms being misidentified as sentence breaks)

[.?!]\s

Matches end of sentence punctuation, followed by white space

(?=([A-Z][a-z]|A |I |[A-Z].))

Only when followed by a capitalised word, the words "A" or "I", or an initial (capital letter, followed by full stop.)

|…|...

Ellipses (either typographical or 'fake') are always taken to be sentence breaks.

NOTES

A pre-parser to eliminate double spaces/whitespace or whitespace at the beginning of a line would be a good idea.

ISSUES

  • Fails to match when a sentence ends with an acronym - ie.

"It's the first group action of its kind in Britain and one of only a handful of lawsuits against tobacco companies outside the U.S. A Paris lawyer last year sued France's Seita SA on behalf of two cancer-stricken smokers."

  • Could replace (?<=[a-z]) with (?<=([a-z])|[A-Z].) - but this would also incorrectly match initialled names (eg. "P.G. Wodehouse"), and I think I'd prefer to avoid false positives.

  • Fails to deal with brackets - eg.

"Punctuation goes outside brackets (when enclosed in the sentence). (Not when the whole sentence is within brackets though.)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment