Skip to content

Instantly share code, notes, and snippets.

@somerandomnerd
Last active December 28, 2015 16:29
Show Gist options
  • Save somerandomnerd/7529732 to your computer and use it in GitHub Desktop.
Save somerandomnerd/7529732 to your computer and use it in GitHub Desktop.
Regular expression for identifying proper nouns
(([A-Z]([a-z]+|\.+))+(\s[A-Z][a-z]+)+)|([A-Z]{2,})|([a-z][A-Z])[a-z]*[A-Z][a-z]*

Looks for sets of capitalised words.

Basically, two halves separated by an 'or' (the pipe - "|" )

(([A-Z]([a-z]+|\.+))+(\s[A-Z][a-z]+)+)
|
([A-Z]{2,})

Looking at the second part first; just 2 or more consecutive capital letters;

([A-Z]{2,})

The first part breaks into two - stripping the brackets that surround it;

([A-Z]([a-z]+|\.+))+

Looks for one or more strings that start with a capital letter, followed by either a string of lower case letters (1 or more), or a full stop.

(\s[A-Z][a-z]+)+

...followed by a space, another capital letter, followed by one or more lower case letters.

|([a-z][A-Z])[a-z]*[A-Z][a-z]*

Or words that have a capital in the middle, beginning with either upper or lower case (eg. iPhone, ComCast.)

Issues

Single capitalised words aren't matched. This could be done with a negative look-behind - only matches a capitalised word that isn't at the start of a sentence; (?<!.\s)

Should I be using word boundaries? (ie. \b)

Words with capitals in the middle? (iPhone, CompuServe, ComCast) - which may or may not begin with a capital? [A-Za-z]+[A-Z]+[A-Za-z]*

Won't catch anything with numbers (eg. O2, F1.)

Falls over when proper nouns are at the start of a sentence - eg. "H.J. Heinz" - doesn't match the initial capital/full stop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment