First and most important my friend: don't get frustrated, RegEx is an entity on its own and complex enough to justify head scratching.
Let's split your issue into pieces:
- The HTML.
- The expression.
- The replacement.
- How to loop.
You are trying to match any img
tag, the tag is an inline element (meaning that it doesn't have other tags in between), it also is in the XHTML form (<tag />
, which is not recommended BTW).
The upside is that is generated, and the generator did a pretty good job at being uniform, always the same template, even the double space between the source and the width attributes.
<img src="SRC" width="WIDTH" alt="ALT" title="TITLE" />
The HTML spec is very clear on how attributes for tags should be delimited: always in the same line and (optionally) enclosed in quotes. I have almost never encounter HTML tags with attributes without quotes, so the delimiter are the quotes.
In other words, you'll be matching every character up to the quotes, if you wanted to grab each of the attributes in a named group:
<img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
Please note the difference between src
and the others, src
has a +
while the other use *
; that denotes a required attribute and the optional ones.
Here is the regex101 for that particular example.
Glad you use regex101, gives a lot of perspective on how an expression will work.
This one is the most straightforward, as is just swiping one thing with the other.
This particular regex is heavy on the quotes and as you know in AHK a quote is escaped with another quote. That is in the expression syntax, while a literal assignment doesn't need it.
I'm a fierce advocate of the expression syntax, but in cases like this, one could argue makes sense:
; Literal
regex = <img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
; Expression
regex := "<img src=""(?<SRC>[^""]+)"" width=""(?<WIDTH>[^""]*)"" alt=""(?<ALT>[^""]*)"" title=""(?<TITLE>[^""]*)"" \/>"
It is up to you which one to chose, as literal assignment can make a quick test/edit in regex101.
txt =
(LTrim %
Hello
<img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
World
)
regex = <img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
RegExMatch(txt, "iO)" regex, match)
align := ""
alt := match.alt
src := match.src
title := match.title
width := match.width
tpl =
(LTrim
{r, out.width='%width%', out.extra='', fig.align='%align%', fig.cap='%alt%', fig.title ='%title%'}
knitr::include_graphics("%src%")
)
OutputDebug % txt
OutputDebug -----
txt := StrReplace(txt, match[0], tpl)
OutputDebug % txt
The above will output the desired result:
Hello
<img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
World
-----
Hello
{r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
World
Please note that the template in the OP has a placeholder for %align%
but I have no idea what to put there, so I left it blank.
AutoHotkey unlike other PCRE implementations doesn't have a "match all" mode that captures all the matches in a single go, you need to iterate over the original text keeping track of the position where you start looking for the next match (to avoid infinite loops).
For performance reasons, you should always keep track of the position, but for some replacements is not actually needed. The logic behind this is that if the replacement modifies the original text to the extent the match is not found again, you can opt out the whole position tracking.
However, if you have several hundred thousand replacements, you really need to track where to start to avoid overhead. In this case, is small enough to get away with it, but let's see first how to do it while tracking it and then we'll simplify.
Let's get to how the loop works:
RegExMatch()
returns the position where the match was found, so a while
will loop until it find no matches:
while (RegExMatch(...)) {
; do stuff
}
If you were not to modify the contents of the original text it will get stuck looping, because it will always return the first match, that is (aside from performance) why it is recommended to keep track of the position.
This is the complete function call:
foundPosition := RegExMatch(haystack, RegExNeedle, outputVariable, startingPosition)
And the loop changes its form:
p := 1
while (p := RegExMatch(txt, regex, match, p)) {
; do stuff
}
p := 1
is to declare the initial position where to start, virtually:
; Initial position -----------------------↓
while (p := RegExMatch(txt, regex, match, 1))
In the first iteration, the RegExMatch()
will start at the beginning and returns the position where it finds the match (say 7
), so by the second iteration you would be:
; Initial position -----------------------↓
while (p := RegExMatch(txt, regex, match, 7))
That's how the position is tracked, you feed as an input argument the p
variable and the returned position is assigned to the same variable when the function returns.
Now when inside the loop you are going to change the structure of the text, so you need to adjust the value of p
accordingly.
You are going to change an HTML tag into two lines (making the text grow), so you need to include that character count into the starting position for the next match. The same would apply if instead you shrink the text, you need to set p
to the position where to start looking up again.
Given the text:
**1** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
The first loop will have p := 1
, do the replacement and convert the text into:
**1** {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
RegExMatch()
will return 7
where it finds the first tag, then you will replace the tag with the 2 lines of text, now you need to add p + lengthOfTheReplacement
, making p := 175
; which is the position next to the end of the replacement (the first asterisk for **2**
).
Putting all together:
txt =
(LTrim %
**1** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
3 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
4 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
)
OutputDebug % txt
OutputDebug -----
p := 1
regex = <img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
while (p := RegExMatch(txt, "iO)" regex, match, p)) {
align := ""
alt := match.alt
src := match.src
title := match.title
width := match.width
tpl =
(LTrim
{r, out.width='%width%', out.extra='', fig.align='%align%', fig.cap='%alt%', fig.title ='%title%'}
knitr::include_graphics("%src%")
)
txt := StrReplace(txt, match[0], tpl)
p += StrLen(tpl)
}
OutputDebug % txt
The results:
**1** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
3 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
4 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
-----
**1** {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
**2** {r, out.width='100', out.extra='', fig.align='', fig.cap='', fig.title =''}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
3 {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.svg")
4 {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.svg")
Now since this replacement falls into the scenario where you don't need to keep track of the position you can remove those parts:
p := 1 ; remove
while (p := RegExMatch(txt, "iO)" regex, match, p)) { ; change...
while (RegExMatch(txt, "iO)" regex, match)) { ; ...for
p += StrLen(tpl) ; remove
The result will be the same.
Hope this clears things, I know it is a lot to read but if you have any questions just shoot.
I know that you need to escape a backtick with a backtick. What I meant is that you need to replace this:
For this:
Ie, you need to keep the backticks in its own line, read carefully the wording:
Backticks, lines, before and after the code block.
https://www.markdownguide.org/extended-syntax/#fenced-code-blocks
The next topic covers the highlighting:
https://www.markdownguide.org/extended-syntax/#syntax-highlighting
And before jumping to other topic, that looks plain wrong... shouldn't be a comma there to separate the
echo
property from all the other properties the%options%
var adds?, for example:If that is R, then it needs a comma.
And after seeing the file, I think I know what you want to do: test with different characters how the AHK script will fare.
One of the issues is that the markdown in itself is invalid. The proper image format is as follows:
For an image with a link:
The other issue is that you need proper encoding/decoding of the characters. The
img
tag must adhere to the conventions of the HTML markup (encode all that is needed). Here is the whole enchilada on the why, how and when:https://wikiless.org/wiki/Percent-encoding
So after fixing the input data (markdown file) and tweaking the script, I think is working as you wanted.
However, the RegEx works for
img
tags, not for Markdown image syntax (brackets). You need to create a new function to deal with just that.I included
WinHttpRequest
, but again you can make a stand-alone function. BTW, since many characters require to be encoded in each attribute (src
,alt
,title
), and whatever you are doing doesn't support percentage-encoding, I used the decode method on each.Anyway, to construct the options, I went with the order in the
img
tag:src
,width
,alt
andtitle
.align
andextra
are never defined, so I commented them because I don't know what's what with those.Files:
fixed.md
is the fixed version ofindex.md
.obsidian.ahk
is the script with the function.fixed.md
Obsidian.ahk