Skip to content

Instantly share code, notes, and snippets.

@DV8FromTheWorld
Last active December 22, 2023 17:16
Show Gist options
  • Save DV8FromTheWorld/5975f7d5cb4df9872540418df57c9482 to your computer and use it in GitHub Desktop.
Save DV8FromTheWorld/5975f7d5cb4df9872540418df57c9482 to your computer and use it in GitHub Desktop.
// Background:
// Word boundaries are 0 width assertions that are between \w (word characters)
// and \W (non-word characters) or start/end of the string.
// So, in a scenario like "-abc" there are 2 word boundaries:
// 1. between the '-' and the 'a'
// 2. between the 'c' and the end of the string
//
// Regex explanation:
// ^\\b_ -> Start matching by ensuring that we are at the beginning of the match, and that the
// underscore we are using as the sentinel to start the italics boundary is proceeded
// by a word boundary to ensure we aren't consuming an underscore from a previous word.
// (...) -> A capture group to capture the fully matched content that should be made italicized
// (?: ...)+? -> A non-capturing group. This match is what captures all the content between
// the 2 underscores that denote the italics boundaries.
//
// Given the trailing +?, this non-capturing group must match
// AT LEAST one time, but it can match multiple times. This allows us to properly
// consume as many underscores as we can. The trailing ? means that we will only
// consume as much as we need while still allowing the closing underscore to be found
// by the later _ match.
//
// Within the non-capturing group, we have multiple matchable patterns that
// can match content.
// -----------------------( (START) non-capturing group matching pattern branches ]------------------------
// _[_(] -> Pattern which allows matching of __ and _(
// The __ is needed so that the underline rule can work (__content__)
// The _( is needed because we are using word boundaries (\b) to know when we should
// and should't capture the last _. The ( character is considered a word boundary,
// but it valid in URLs which is where _ is also used, so we don't want to treat
// _( as a valid break case. Example: https://en.wikipedia.org/wiki/Endemic_(epidemiology)
// \\\\[\\s\\S] -> Pattern which allows matching of '\{anyCharacter}'.
// This allows for the markdown escape rule to work.
// This also powers '\_' acting as an escaped underscore.
// (?<!_)\\B_\\B -> Pattern which ensure that we do not prematurely break _within_ a word that contains
// underscores while the word is in italics.
// For example, without this rule, this does not render correctly: _my_cool_thing_
//
// So, to break this pattern down:
// (?<!_) -> A negative look behind pattern to ensure that the PREVIOUS character
// was NOT an underscore. This check exists to make sure that the
// underline rule will work.
// Without it, the following examples would fail:
// - __yo__
// - ~~_**__Google__**_~~
// \\B -> Assert that the the following underscore is part of an existing word
// by ensuring that the previous character and the next underscore do
// not share a word boundary. (i.e, they are part of one continuous word)
// _ -> Match the _ character literally
// \\B -> Assert that the item after the underscore is not a word boundary. This ensures that
// the character that follows the underscore is a word character and thus
// we are continuing a word
// [^\\\\_] -> This pattern captures ALL CHARACTERS that are not '\' or '_'
// This is the pattern that captures all of the content between the 2 underscore
// sentinels that act as boundaries for the italics.
//
// This pattern is designed to NOT capture '\' nor '_' characters because those are
// are important characters that could define the end of the boundary or be being used
// in an escape.
//
// As such, the patterns that come BEFORE this pattern are specializations
// that include '\' or '_' in specialized ways to ensure those characters can be
// captured in the italics when necessary as this pattern purposely doesn't capture them.
// -----------------------[ (END) non-capturing group matching pattern branches ]------------------------
// _ -> Match the _ character literally. This is the ending boundary for the italics.
// (?! ...) -> A negative lookahead. Make sure that the content immediately following the
// literal _ character (that is acting as the ending boundary) does not match
// the provided pattern
// [(] -> A pattern to check if the immediate next character is the '(' character.
// In conjunction with the negative look ahead, We are making sure that the
// character immediately following the ending sentinel _ is NOT a '('.
//
// We need this to ensure that the '_[_(]' pattern from the above non-capturing group
// can properly detect the _(. Without this, the outer pattern will match on
// the _( before the non-capturing group can because the non-capturing group is
// defined using the non-greedy '?' (which it needs to be to avoid capturing too match.
// \\b -> Lastly, similarly to the starting sentinel, make sure that the underscore we
// captured as our ending sentinel is divided from other following content by
// a word boundary so that we are not consuming part of the way into a word that
// contains an underscore.
"^\\b_((?:_[_(]|\\\\[\\s\\S]|(?<!_)\\B_\\B|[^\\\\_])+?)_(?![(])\\b" +
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment