Last active
December 22, 2023 17:16
-
-
Save DV8FromTheWorld/5975f7d5cb4df9872540418df57c9482 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// Background: | |
// Word boundaries are 0 width assertions that are between \w (word characters) | |
// and \W (non-word characters) or start/end of the string. | |
// So, in a scenario like "-abc" there are 2 word boundaries: | |
// 1. between the '-' and the 'a' | |
// 2. between the 'c' and the end of the string | |
// | |
// Regex explanation: | |
// ^\\b_ -> Start matching by ensuring that we are at the beginning of the match, and that the | |
// underscore we are using as the sentinel to start the italics boundary is proceeded | |
// by a word boundary to ensure we aren't consuming an underscore from a previous word. | |
// (...) -> A capture group to capture the fully matched content that should be made italicized | |
// (?: ...)+? -> A non-capturing group. This match is what captures all the content between | |
// the 2 underscores that denote the italics boundaries. | |
// | |
// Given the trailing +?, this non-capturing group must match | |
// AT LEAST one time, but it can match multiple times. This allows us to properly | |
// consume as many underscores as we can. The trailing ? means that we will only | |
// consume as much as we need while still allowing the closing underscore to be found | |
// by the later _ match. | |
// | |
// Within the non-capturing group, we have multiple matchable patterns that | |
// can match content. | |
// -----------------------( (START) non-capturing group matching pattern branches ]------------------------ | |
// _[_(] -> Pattern which allows matching of __ and _( | |
// The __ is needed so that the underline rule can work (__content__) | |
// The _( is needed because we are using word boundaries (\b) to know when we should | |
// and should't capture the last _. The ( character is considered a word boundary, | |
// but it valid in URLs which is where _ is also used, so we don't want to treat | |
// _( as a valid break case. Example: https://en.wikipedia.org/wiki/Endemic_(epidemiology) | |
// \\\\[\\s\\S] -> Pattern which allows matching of '\{anyCharacter}'. | |
// This allows for the markdown escape rule to work. | |
// This also powers '\_' acting as an escaped underscore. | |
// (?<!_)\\B_\\B -> Pattern which ensure that we do not prematurely break _within_ a word that contains | |
// underscores while the word is in italics. | |
// For example, without this rule, this does not render correctly: _my_cool_thing_ | |
// | |
// So, to break this pattern down: | |
// (?<!_) -> A negative look behind pattern to ensure that the PREVIOUS character | |
// was NOT an underscore. This check exists to make sure that the | |
// underline rule will work. | |
// Without it, the following examples would fail: | |
// - __yo__ | |
// - ~~_**__Google__**_~~ | |
// \\B -> Assert that the the following underscore is part of an existing word | |
// by ensuring that the previous character and the next underscore do | |
// not share a word boundary. (i.e, they are part of one continuous word) | |
// _ -> Match the _ character literally | |
// \\B -> Assert that the item after the underscore is not a word boundary. This ensures that | |
// the character that follows the underscore is a word character and thus | |
// we are continuing a word | |
// [^\\\\_] -> This pattern captures ALL CHARACTERS that are not '\' or '_' | |
// This is the pattern that captures all of the content between the 2 underscore | |
// sentinels that act as boundaries for the italics. | |
// | |
// This pattern is designed to NOT capture '\' nor '_' characters because those are | |
// are important characters that could define the end of the boundary or be being used | |
// in an escape. | |
// | |
// As such, the patterns that come BEFORE this pattern are specializations | |
// that include '\' or '_' in specialized ways to ensure those characters can be | |
// captured in the italics when necessary as this pattern purposely doesn't capture them. | |
// -----------------------[ (END) non-capturing group matching pattern branches ]------------------------ | |
// _ -> Match the _ character literally. This is the ending boundary for the italics. | |
// (?! ...) -> A negative lookahead. Make sure that the content immediately following the | |
// literal _ character (that is acting as the ending boundary) does not match | |
// the provided pattern | |
// [(] -> A pattern to check if the immediate next character is the '(' character. | |
// In conjunction with the negative look ahead, We are making sure that the | |
// character immediately following the ending sentinel _ is NOT a '('. | |
// | |
// We need this to ensure that the '_[_(]' pattern from the above non-capturing group | |
// can properly detect the _(. Without this, the outer pattern will match on | |
// the _( before the non-capturing group can because the non-capturing group is | |
// defined using the non-greedy '?' (which it needs to be to avoid capturing too match. | |
// \\b -> Lastly, similarly to the starting sentinel, make sure that the underscore we | |
// captured as our ending sentinel is divided from other following content by | |
// a word boundary so that we are not consuming part of the way into a word that | |
// contains an underscore. | |
"^\\b_((?:_[_(]|\\\\[\\s\\S]|(?<!_)\\B_\\B|[^\\\\_])+?)_(?![(])\\b" + |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment