[A-Za-z]+(?:(?:-[A-Za-z]+)*)?(?:'[A-Za-z]*)?
On each item, first line of code represents the regular expression that expresses the description and the second line represents the cumulative regular expression.
-
A sequence one or more letters:
[A-Za-z]+ [A-Za-z]+
-
Optionally followed by a dash and a sequence of one or more letters:
(?:-[A-Za-z]+)? [A-Za-z]+(?:-[A-Za-z]+)?
-
Where the optional following part is repeated zero or more times:
(?:(?:-[A-Za-z]+)*)? [A-Za-z]+(?:(?:-[A-Za-z]+)*)?
Note that expressing "optional following part is repeated zero or more times" as
(?:-[A-Za-z]+)*?
won't work, because the last*?
will mean "match the preceding expression ("token" is the proper term to use instead of "expression"), which turns out to be a non-capturing group in this case, lazily, instead of meaning "the part ("token") that is repeated zero or more times is optional". Hence, we need to surround the part that matches zero or more times with a non-capturing group, and follow this non-capturing group with a question mark to indicate that it is optional. -
Which is, in turn, optionally followed by a quote and zero or more letters:
(?:'[A-Za-z]*)? [A-Za-z]+(?:(?:-[A-Za-z]+)*)?(?:'[A-Za-z]*)?
- Won't work for words that contain letters that are not from the basic Latin alphabet, such as words that have letters with diacritical marks. However, this is easy to fix. All that is needed is to replace all occurrences of
[A-Za-z]
with another character set that expresses all characters that are considered to be valid letters.