(([A-Z]([a-z]+|\.+))+(\s[A-Z][a-z]+)+)|([A-Z]{2,})|([a-z][A-Z])[a-z]*[A-Z][a-z]*
-
-
Save somerandomnerd/7529732 to your computer and use it in GitHub Desktop.
Looks for sets of capitalised words.
Basically, two halves separated by an 'or' (the pipe - "|" )
(([A-Z]([a-z]+|\.+))+(\s[A-Z][a-z]+)+)
|
([A-Z]{2,})
Looking at the second part first; just 2 or more consecutive capital letters;
([A-Z]{2,})
The first part breaks into two - stripping the brackets that surround it;
([A-Z]([a-z]+|\.+))+
Looks for one or more strings that start with a capital letter, followed by either a string of lower case letters (1 or more), or a full stop.
(\s[A-Z][a-z]+)+
...followed by a space, another capital letter, followed by one or more lower case letters.
|([a-z][A-Z])[a-z]*[A-Z][a-z]*
Or words that have a capital in the middle, beginning with either upper or lower case (eg. iPhone, ComCast.)
Single capitalised words aren't matched. This could be done with a negative look-behind - only matches a capitalised word that isn't at the start of a sentence; (?<!.\s)
Should I be using word boundaries? (ie. \b)
Words with capitals in the middle? (iPhone, CompuServe, ComCast) - which may or may not begin with a capital? [A-Za-z]+[A-Z]+[A-Za-z]*
Won't catch anything with numbers (eg. O2, F1.)
Falls over when proper nouns are at the start of a sentence - eg. "H.J. Heinz" - doesn't match the initial capital/full stop.