First and most important my friend: don't get frustrated, RegEx is an entity on its own and complex enough to justify head scratching.
Let's split your issue into pieces:
- The HTML.
- The expression.
- The replacement.
- How to loop.
You are trying to match any img
tag, the tag is an inline element (meaning that it doesn't have other tags in between), it also is in the XHTML form (<tag />
, which is not recommended BTW).
The upside is that is generated, and the generator did a pretty good job at being uniform, always the same template, even the double space between the source and the width attributes.
<img src="SRC" width="WIDTH" alt="ALT" title="TITLE" />
The HTML spec is very clear on how attributes for tags should be delimited: always in the same line and (optionally) enclosed in quotes. I have almost never encounter HTML tags with attributes without quotes, so the delimiter are the quotes.
In other words, you'll be matching every character up to the quotes, if you wanted to grab each of the attributes in a named group:
<img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
Please note the difference between src
and the others, src
has a +
while the other use *
; that denotes a required attribute and the optional ones.
Here is the regex101 for that particular example.
Glad you use regex101, gives a lot of perspective on how an expression will work.
This one is the most straightforward, as is just swiping one thing with the other.
This particular regex is heavy on the quotes and as you know in AHK a quote is escaped with another quote. That is in the expression syntax, while a literal assignment doesn't need it.
I'm a fierce advocate of the expression syntax, but in cases like this, one could argue makes sense:
; Literal
regex = <img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
; Expression
regex := "<img src=""(?<SRC>[^""]+)"" width=""(?<WIDTH>[^""]*)"" alt=""(?<ALT>[^""]*)"" title=""(?<TITLE>[^""]*)"" \/>"
It is up to you which one to chose, as literal assignment can make a quick test/edit in regex101.
txt =
(LTrim %
Hello
<img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
World
)
regex = <img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
RegExMatch(txt, "iO)" regex, match)
align := ""
alt := match.alt
src := match.src
title := match.title
width := match.width
tpl =
(LTrim
{r, out.width='%width%', out.extra='', fig.align='%align%', fig.cap='%alt%', fig.title ='%title%'}
knitr::include_graphics("%src%")
)
OutputDebug % txt
OutputDebug -----
txt := StrReplace(txt, match[0], tpl)
OutputDebug % txt
The above will output the desired result:
Hello
<img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
World
-----
Hello
{r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
World
Please note that the template in the OP has a placeholder for %align%
but I have no idea what to put there, so I left it blank.
AutoHotkey unlike other PCRE implementations doesn't have a "match all" mode that captures all the matches in a single go, you need to iterate over the original text keeping track of the position where you start looking for the next match (to avoid infinite loops).
For performance reasons, you should always keep track of the position, but for some replacements is not actually needed. The logic behind this is that if the replacement modifies the original text to the extent the match is not found again, you can opt out the whole position tracking.
However, if you have several hundred thousand replacements, you really need to track where to start to avoid overhead. In this case, is small enough to get away with it, but let's see first how to do it while tracking it and then we'll simplify.
Let's get to how the loop works:
RegExMatch()
returns the position where the match was found, so a while
will loop until it find no matches:
while (RegExMatch(...)) {
; do stuff
}
If you were not to modify the contents of the original text it will get stuck looping, because it will always return the first match, that is (aside from performance) why it is recommended to keep track of the position.
This is the complete function call:
foundPosition := RegExMatch(haystack, RegExNeedle, outputVariable, startingPosition)
And the loop changes its form:
p := 1
while (p := RegExMatch(txt, regex, match, p)) {
; do stuff
}
p := 1
is to declare the initial position where to start, virtually:
; Initial position -----------------------↓
while (p := RegExMatch(txt, regex, match, 1))
In the first iteration, the RegExMatch()
will start at the beginning and returns the position where it finds the match (say 7
), so by the second iteration you would be:
; Initial position -----------------------↓
while (p := RegExMatch(txt, regex, match, 7))
That's how the position is tracked, you feed as an input argument the p
variable and the returned position is assigned to the same variable when the function returns.
Now when inside the loop you are going to change the structure of the text, so you need to adjust the value of p
accordingly.
You are going to change an HTML tag into two lines (making the text grow), so you need to include that character count into the starting position for the next match. The same would apply if instead you shrink the text, you need to set p
to the position where to start looking up again.
Given the text:
**1** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
The first loop will have p := 1
, do the replacement and convert the text into:
**1** {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
RegExMatch()
will return 7
where it finds the first tag, then you will replace the tag with the 2 lines of text, now you need to add p + lengthOfTheReplacement
, making p := 175
; which is the position next to the end of the replacement (the first asterisk for **2**
).
Putting all together:
txt =
(LTrim %
**1** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
3 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
4 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
)
OutputDebug % txt
OutputDebug -----
p := 1
regex = <img src="(?<SRC>[^"]+)" width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
while (p := RegExMatch(txt, "iO)" regex, match, p)) {
align := ""
alt := match.alt
src := match.src
title := match.title
width := match.width
tpl =
(LTrim
{r, out.width='%width%', out.extra='', fig.align='%align%', fig.cap='%alt%', fig.title ='%title%'}
knitr::include_graphics("%src%")
)
txt := StrReplace(txt, match[0], tpl)
p += StrLen(tpl)
}
OutputDebug % txt
The results:
**1** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
**2** <img src="001%20assets/AsexualPrideFlag.png" width="100" alt="" title="" />
3 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
4 <img src="001%20assets/AsexualPrideFlag.svg" width="100" alt="The Asexual Flag" title="The Asexual Flag" />
-----
**1** {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
**2** {r, out.width='100', out.extra='', fig.align='', fig.cap='', fig.title =''}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
3 {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.svg")
4 {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.svg")
Now since this replacement falls into the scenario where you don't need to keep track of the position you can remove those parts:
p := 1 ; remove
while (p := RegExMatch(txt, "iO)" regex, match, p)) { ; change...
while (RegExMatch(txt, "iO)" regex, match)) { ; ...for
p += StrLen(tpl) ; remove
The result will be the same.
Hope this clears things, I know it is a lot to read but if you have any questions just shoot.
Ah crap, I'm sorry for not mentioning this, that was my bad. I thought I did, but I also learned that I often cannot trust my own memory on such things - sadly. I'm sorry.
The clear overview:
Below you can find the raw output from ObsidianHTML. This is what we need to work with when parsing and converting. So we either need to preprocess and parse this and manually convert
"
to"
(and so on for other potential characters that will bring issues) in order to get properly encoded attributes (as I understand your explanation, correct me if I'm wrong), or... I am not sure. That might be the only option, but sounds like a pain to do as well - but maybe it's not and I just don't know that.I could ask ObsidianHTML for a setting to do a complete encoding so we start out with properly formatted html in our src à la your
fixed.md
. In that case, this current solution should be all we need, (if that makes sense to you. I got a mild bit lost if I am completely honest.)However, given that they are actually careful about scope creep, and this is a specific question for a niche solution that's actually way outside the projects' boundaries, I am not sure that would be the sensible path to take.
So. Just looking at the relevant section of ObsidianHTML's output (
"
and'
):Running through the script results in
This has two problems. Double quotes are not detected and converter, and single quotes are not escaped and cause mayhem.
But we are looking to get
To summarise the required steps
'
must be escaped via backslash when building the fig.title- and fig.caption-attributes."
are not currently converted because I cannot generate thefixed.md
syntax you are basing the third iteration of ConvertSRC_SYNTAX upon from the ObsidianHTML-output, and thus the codeblock is not built. Check the first codeblock in this reply for the format we start with.(We should start naming the function :P)
P.S.: How do I make these codeblocks collapsible like your
fixed.md
?ObsidianHTML's output:
ObsidianHTML's input: