Looks for 'end of sentence' punctuation, with certain conditions to avoid false matches (eg. initials, abbreviations.)
(?<=[a-z])
Only matches if punctuation is preceeded by a lower case letter (to avoid acronyms being misidentified as sentence breaks)
[.?!]\s
Matches end of sentence punctuation, followed by white space
(?=([A-Z][a-z]|A |I |[A-Z].))
Only when followed by a capitalised word, the words "A" or "I", or an initial (capital letter, followed by full stop.)
|…|...
Ellipses (either typographical or 'fake') are always taken to be sentence breaks.
A pre-parser to eliminate double spaces/whitespace or whitespace at the beginning of a line would be a good idea.
ISSUES
- Fails to match when a sentence ends with an acronym - ie.
"It's the first group action of its kind in Britain and one of only a handful of lawsuits against tobacco companies outside the U.S. A Paris lawyer last year sued France's Seita SA on behalf of two cancer-stricken smokers."
-
Could replace (?<=[a-z]) with (?<=([a-z])|[A-Z].) - but this would also incorrectly match initialled names (eg. "P.G. Wodehouse"), and I think I'd prefer to avoid false positives.
-
Fails to deal with brackets - eg.
"Punctuation goes outside brackets (when enclosed in the sentence). (Not when the whole sentence is within brackets though.)"