Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save ugultopu/29023122a5e4e974699252034f5a83ed to your computer and use it in GitHub Desktop.
Save ugultopu/29023122a5e4e974699252034f5a83ed to your computer and use it in GitHub Desktop.
[A-Za-z]+(?:(?:-[A-Za-z]+)*)?(?:'[A-Za-z]*)?

Explanation

On each item, first line of code represents the regular expression that expresses the description and the second line represents the cumulative regular expression.

  • A sequence one or more letters:

    [A-Za-z]+
    [A-Za-z]+
    
  • Optionally followed by a dash and a sequence of one or more letters:

    (?:-[A-Za-z]+)?
    [A-Za-z]+(?:-[A-Za-z]+)?
    
  • Where the optional following part is repeated zero or more times:

    (?:(?:-[A-Za-z]+)*)?
    [A-Za-z]+(?:(?:-[A-Za-z]+)*)?
    

    Note that expressing "optional following part is repeated zero or more times" as (?:-[A-Za-z]+)*? won't work, because the last *? will mean "match the preceding expression ("token" is the proper term to use instead of "expression"), which turns out to be a non-capturing group in this case, lazily, instead of meaning "the part ("token") that is repeated zero or more times is optional". Hence, we need to surround the part that matches zero or more times with a non-capturing group, and follow this non-capturing group with a question mark to indicate that it is optional.

  • Which is, in turn, optionally followed by a quote and zero or more letters:

    (?:'[A-Za-z]*)?
    [A-Za-z]+(?:(?:-[A-Za-z]+)*)?(?:'[A-Za-z]*)?
    

Shortcomings

  • Won't work for words that contain letters that are not from the basic Latin alphabet, such as words that have letters with diacritical marks. However, this is easy to fix. All that is needed is to replace all occurrences of [A-Za-z] with another character set that expresses all characters that are considered to be valid letters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment