First and most important my friend: don't get frustrated, RegEx is an entity on its own and complex enough to justify head scratching.
Let's split your issue into pieces:
- The HTML.
- The expression.
- The replacement.
- How to loop.
You are trying to match any img
tag, the tag is an inline element (meaning that it doesn't have other tags in between), it also is in the XHTML form (<tag />
, which is not recommended BTW).
The upside is that is generated, and the generator did a pretty good job at being uniform, always the same template, even the double space between the source and the width attributes.
<img src="SRC" width="WIDTH" alt="ALT" title="TITLE" />
The HTML spec is very clear on how attributes for tags should be delimited: always in the same line and (optionally) enclosed in quotes. I have almost never encounter HTML tags with attributes without quotes, so the delimiter are the quotes.
In other words, you'll be matching every character up to the quotes, if you wanted to grab each of the attributes in a named group:
<img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
Please note the difference between src
and the others, src
has a +
while the other use *
; that denotes a required attribute and the optional ones.
Here is the regex101 for that particular example.
Glad you use regex101, gives a lot of perspective on how an expression will work.
This one is the most straightforward, as is just swiping one thing with the other.
This particular regex is heavy on the quotes and as you know in AHK a quote is escaped with another quote. That is in the expression syntax, while a literal assignment doesn't need it.
I'm a fierce advocate of the expression syntax, but in cases like this, one could argue makes sense:
; Literal
regex = <img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
; Expression
regex := "<img src=""(?<SRC>[^""]+)"" width=""(?<WIDTH>[^""]*)"" alt=""(?<ALT>[^""]*)"" title=""(?<TITLE>[^""]*)"" \/>"
It is up to you which one to chose, as literal assignment can make a quick test/edit in regex101.
txt =
(LTrim %
Hello
<img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
World
)
regex = <img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
RegExMatch(txt, "iO)" regex, match)
align := ""
alt := match.alt
src := match.src
title := match.title
width := match.width
tpl =
(LTrim
{r, out.width='%width%', out.extra='', fig.align='%align%', fig.cap='%alt%', fig.title ='%title%'}
knitr::include_graphics("%src%")
)
OutputDebug % txt
OutputDebug -----
txt := StrReplace(txt, match[0], tpl)
OutputDebug % txt
The above will output the desired result:
Hello
<img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
World
-----
Hello
{r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
World
Please note that the template in the OP has a placeholder for %align%
but I have no idea what to put there, so I left it blank.
AutoHotkey unlike other PCRE implementations doesn't have a "match all" mode that captures all the matches in a single go, you need to iterate over the original text keeping track of the position where you start looking for the next match (to avoid infinite loops).
For performance reasons, you should always keep track of the position, but for some replacements is not actually needed. The logic behind this is that if the replacement modifies the original text to the extent the match is not found again, you can opt out the whole position tracking.
However, if you have several hundred thousand replacements, you really need to track where to start to avoid overhead. In this case, is small enough to get away with it, but let's see first how to do it while tracking it and then we'll simplify.
Let's get to how the loop works:
RegExMatch()
returns the position where the match was found, so a while
will loop until it find no matches:
while (RegExMatch(...)) {
; do stuff
}
If you were not to modify the contents of the original text it will get stuck looping, because it will always return the first match, that is (aside from performance) why it is recommended to keep track of the position.
This is the complete function call:
foundPosition := RegExMatch(haystack, RegExNeedle, outputVariable, startingPosition)
And the loop changes its form:
p := 1
while (p := RegExMatch(txt, regex, match, p)) {
; do stuff
}
p := 1
is to declare the initial position where to start, virtually:
; Initial position -----------------------↓
while (p := RegExMatch(txt, regex, match, 1))
In the first iteration, the RegExMatch()
will start at the beginning and returns the position where it finds the match (say 7
), so by the second iteration you would be:
; Initial position -----------------------↓
while (p := RegExMatch(txt, regex, match, 7))
That's how the position is tracked, you feed as an input argument the p
variable and the returned position is assigned to the same variable when the function returns.
Now when inside the loop you are going to change the structure of the text, so you need to adjust the value of p
accordingly.
You are going to change an HTML tag into two lines (making the text grow), so you need to include that character count into the starting position for the next match. The same would apply if instead you shrink the text, you need to set p
to the position where to start looking up again.
Given the text:
**1** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
The first loop will have p := 1
, do the replacement and convert the text into:
**1** {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
RegExMatch()
will return 7
where it finds the first tag, then you will replace the tag with the 2 lines of text, now you need to add p + lengthOfTheReplacement
, making p := 175
; which is the position next to the end of the replacement (the first asterisk for **2**
).
Putting all together:
txt =
(LTrim %
**1** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
3 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
4 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
)
OutputDebug % txt
OutputDebug -----
p := 1
regex = <img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
while (p := RegExMatch(txt, "iO)" regex, match, p)) {
align := ""
alt := match.alt
src := match.src
title := match.title
width := match.width
tpl =
(LTrim
{r, out.width='%width%', out.extra='', fig.align='%align%', fig.cap='%alt%', fig.title ='%title%'}
knitr::include_graphics("%src%")
)
txt := StrReplace(txt, match[0], tpl)
p += StrLen(tpl)
}
OutputDebug % txt
The results:
**1** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
3 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
4 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
-----
**1** {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
**2** {r, out.width='100', out.extra='', fig.align='', fig.cap='', fig.title =''}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
3 {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.svg")
4 {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.svg")
Now since this replacement falls into the scenario where you don't need to keep track of the position you can remove those parts:
p := 1 ; remove
while (p := RegExMatch(txt, "iO)" regex, match, p)) { ; change...
while (RegExMatch(txt, "iO)" regex, match)) { ; ...for
p += StrLen(tpl) ; remove
The result will be the same.
Hope this clears things, I know it is a lot to read but if you have any questions just shoot.
extra
andalign
are never defined.Generally, instead of using this:
You add everything that is needed and at the end of the loop you remove leftovers
Can I have a link with a copy of
indeex.md
? So I can do a test myself, because I don't really understand what you're trying to achieve.Also, this:
And this:
Is invalid, the fenced blocks should be in its own line, different parser will produce a different result... most of them will fail.
https://www.markdownguide.org/extended-syntax/#fenced-code-blocks
The only word that optionally can follow the starting backticks is the language, so syntax highlighting can make sense of which highlighter to use.
Examples:
Will render:
And:
Will become: