seth10/quote_regex.md

Last active October 8, 2022 03:30

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/seth10/a618b6382ff2ebb05c8bb79c79400748.js"></script>
Save seth10/a618b6382ff2ebb05c8bb79c79400748 to your computer and use it in GitHub Desktop.

Download ZIP

A record of all the regular expressions I tried to match single and double quotes for LaTeX in FicBot

Raw

quote_regex.md

Single quotes

/ '([A-Za-z ]+)'(?=[^A-Za-z])/g -> `$1'

Find a quote preceded by a space. Capture one or more letters or spaces. Find another quote. Lookahead to find a character that is not a letter. If the character following the last quote were a letter, it would be a contraction. The replacement string should have a leading space because we do capture a space before the first quote.
/(\s|{)(?:')(.+?)(?:')(\s|})/g -> $1`$2'$3

First big new attempt, see detailed explanation at 675e6a1. Capture a whitespace character or opening brace (from an italics or bold LaTeX command). Match a quote. Capture one or more of any character, but the least amount possible. Match a quote. Capture a whitespace character or closing brace. Put it back together with the captured whitepsace or braces at the ends and captured text between new quotes.
/(\s|{)'(.+?)'(\s|})/g -> $1`$2'$3

Simplify unnecessary non-capture groups around the quotes. I must have been excited about finding out they exist, becuse there's no reason to use them here.
/(\s|{)'([^']+?)'(\s|})/g -> $1`$2'$3

Instead of capturing one or more of any character between the quotes, capture one or more of any non-quote character. However, this will not work with quoted contractions. This was in an effort to avoid problems with informal contractions like 'sup or runnin'.
/\W'(.+)'\W/g -> `$1'

Simplifying/generalizing from \s|{, now match any non-word character (could be whitespace, but that's not what the 'w' stands for) instead of specifically a whitespace or brace. Unfortunately, removing the extra capture groups loses the non-word character on either end because they are included in the match.
/(?<=\W)'(.+)'(?=\W)/g -> `$1'

To not include the non-word characters in the match, use regex lookaround. This will look for a non-word character before the first quote and after the last quote, without including them in the match. Unfortunately, this will not work in JavaScripts regex engine as it does not support lookbehind.
/(\W)'(.+)'(?=\W)/g -> $1`$2'

Working around JavaScript's lack of lookbehind, capture the leading non-word character and put it back in the replacement.
/\B'(.+)'\B/g -> `$1'

\B matches any position that is not a word boundary. This anchor matches a position, which is zero-width, meaning we don't need to use lookaround. This seemed a bit backwords at first so let's take the example of don't, which we wouldn't want to match. The position between n and ' is a word boundary, as it is preceded with a word character and followed by a non-word character. A position between two non-word characters, such as and ', or { and ', is not a word boundary, matched by \B.

Double quotes

/(?:")([^"]+)(?:")/g -> ``$1''

Match a double-quote, capture one or more of any characters that are not a double quote, and match a closing double-quote.
/"([^"]+)"/g -> ``$1''

As with #3 of single quotes, remove the unnecessary non-capture groups.
/\B"(.+)"\B/g -> ``$1''

Building off the conclusion of single quotes in #8, do similar for double quotes

Axlefublr commented Sep 2, 2022

Used to maintain my version of an extension and had to implement regex for both types of quotes

And yeah, they are quite annoying to deal with

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment