Skip to content

Instantly share code, notes, and snippets.

@anonymous1184
Last active December 28, 2023 12:08
Show Gist options
  • Save anonymous1184/33b2511412158f286d8a719da4f38790 to your computer and use it in GitHub Desktop.
Save anonymous1184/33b2511412158f286d8a719da4f38790 to your computer and use it in GitHub Desktop.

First and most important my friend: don't get frustrated, RegEx is an entity on its own and complex enough to justify head scratching.

Let's split your issue into pieces:

  • The HTML.
  • The expression.
  • The replacement.
  • How to loop.

The HTML

You are trying to match any img tag, the tag is an inline element (meaning that it doesn't have other tags in between), it also is in the XHTML form (<tag />, which is not recommended BTW).

The upside is that is generated, and the generator did a pretty good job at being uniform, always the same template, even the double space between the source and the width attributes.

<img src="SRC"  width="WIDTH" alt="ALT" title="TITLE" />

The expression

The HTML spec is very clear on how attributes for tags should be delimited: always in the same line and (optionally) enclosed in quotes. I have almost never encounter HTML tags with attributes without quotes, so the delimiter are the quotes.

In other words, you'll be matching every character up to the quotes, if you wanted to grab each of the attributes in a named group:

<img src="(?<SRC>[^"]+)"  width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>

Please note the difference between src and the others, src has a + while the other use *; that denotes a required attribute and the optional ones.

Here is the regex101 for that particular example.

Glad you use regex101, gives a lot of perspective on how an expression will work.

The replacement

This one is the most straightforward, as is just swiping one thing with the other.

This particular regex is heavy on the quotes and as you know in AHK a quote is escaped with another quote. That is in the expression syntax, while a literal assignment doesn't need it.

I'm a fierce advocate of the expression syntax, but in cases like this, one could argue makes sense:

; Literal
regex = <img src="(?<SRC>[^"]+)"  width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>

; Expression
regex := "<img src=""(?<SRC>[^""]+)""  width=""(?<WIDTH>[^""]*)"" alt=""(?<ALT>[^""]*)"" title=""(?<TITLE>[^""]*)"" \/>"

It is up to you which one to chose, as literal assignment can make a quick test/edit in regex101.

txt =
    (LTrim %
    Hello
    <img src="001%20assets/AsexualPrideFlag.png"  width="100" alt="The Asexual Flag" title="The Asexual Flag" />
    World
    )
regex = <img src="(?<SRC>[^"]+)"  width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
RegExMatch(txt, "iO)" regex, match)
align := ""
alt := match.alt
src := match.src
title := match.title
width := match.width
tpl =
    (LTrim
    {r, out.width='%width%', out.extra='', fig.align='%align%', fig.cap='%alt%', fig.title ='%title%'}
    knitr::include_graphics("%src%")
    )
OutputDebug % txt
OutputDebug -----
txt := StrReplace(txt, match[0], tpl)
OutputDebug % txt

The above will output the desired result:

Hello
<img src="001%20assets/AsexualPrideFlag.png"  width="100" alt="The Asexual Flag" title="The Asexual Flag" />
World
-----
Hello
{r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
World

Please note that the template in the OP has a placeholder for %align% but I have no idea what to put there, so I left it blank.

How to loop

AutoHotkey unlike other PCRE implementations doesn't have a "match all" mode that captures all the matches in a single go, you need to iterate over the original text keeping track of the position where you start looking for the next match (to avoid infinite loops).

For performance reasons, you should always keep track of the position, but for some replacements is not actually needed. The logic behind this is that if the replacement modifies the original text to the extent the match is not found again, you can opt out the whole position tracking.

However, if you have several hundred thousand replacements, you really need to track where to start to avoid overhead. In this case, is small enough to get away with it, but let's see first how to do it while tracking it and then we'll simplify.

Let's get to how the loop works:

RegExMatch() returns the position where the match was found, so a while will loop until it find no matches:

while (RegExMatch(...)) {
    ; do stuff
}

If you were not to modify the contents of the original text it will get stuck looping, because it will always return the first match, that is (aside from performance) why it is recommended to keep track of the position.

This is the complete function call:

foundPosition := RegExMatch(haystack, RegExNeedle, outputVariable, startingPosition)

And the loop changes its form:

p := 1
while (p := RegExMatch(txt, regex, match, p)) {
    ; do stuff
}

p := 1 is to declare the initial position where to start, virtually:

; Initial position -----------------------↓
while (p := RegExMatch(txt, regex, match, 1))

In the first iteration, the RegExMatch() will start at the beginning and returns the position where it finds the match (say 7), so by the second iteration you would be:

; Initial position -----------------------↓
while (p := RegExMatch(txt, regex, match, 7))

That's how the position is tracked, you feed as an input argument the p variable and the returned position is assigned to the same variable when the function returns.

Now when inside the loop you are going to change the structure of the text, so you need to adjust the value of p accordingly.

You are going to change an HTML tag into two lines (making the text grow), so you need to include that character count into the starting position for the next match. The same would apply if instead you shrink the text, you need to set p to the position where to start looking up again.

Given the text:

    **1** <img src="001%20assets/AsexualPrideFlag.png"  width="100" alt="The Asexual Flag" title="The Asexual Flag" />
    **2** <img src="001%20assets/AsexualPrideFlag.png"  width="100" alt="" title="" />

The first loop will have p := 1, do the replacement and convert the text into:

**1** {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
**2** <img src="001%20assets/AsexualPrideFlag.png"  width="100" alt="" title="" />

RegExMatch() will return 7 where it finds the first tag, then you will replace the tag with the 2 lines of text, now you need to add p + lengthOfTheReplacement, making p := 175; which is the position next to the end of the replacement (the first asterisk for **2**).

Putting all together:

txt =
    (LTrim %
    **1** <img src="001%20assets/AsexualPrideFlag.png"  width="100" alt="The Asexual Flag" title="The Asexual Flag" />
    **2** <img src="001%20assets/AsexualPrideFlag.png"  width="100" alt="" title="" />
    3 <img src="001%20assets/AsexualPrideFlag.svg"  width="100" alt="The Asexual Flag" title="The Asexual Flag" />
    4 <img src="001%20assets/AsexualPrideFlag.svg"  width="100" alt="The Asexual Flag" title="The Asexual Flag" />
    )
OutputDebug % txt
OutputDebug -----
p := 1
regex = <img src="(?<SRC>[^"]+)"  width="(?<WIDTH>[^"]*)" alt="(?<ALT>[^"]*)" title="(?<TITLE>[^"]*)" \/>
while (p := RegExMatch(txt, "iO)" regex, match, p)) {
    align := ""
    alt := match.alt
    src := match.src
    title := match.title
    width := match.width
    tpl =
        (LTrim
        {r, out.width='%width%', out.extra='', fig.align='%align%', fig.cap='%alt%', fig.title ='%title%'}
        knitr::include_graphics("%src%")
        )
    txt := StrReplace(txt, match[0], tpl)
    p += StrLen(tpl)
}
OutputDebug % txt

The results:

**1** <img src="001%20assets/AsexualPrideFlag.png"  width="100" alt="The Asexual Flag" title="The Asexual Flag" />
**2** <img src="001%20assets/AsexualPrideFlag.png"  width="100" alt="" title="" />
3 <img src="001%20assets/AsexualPrideFlag.svg"  width="100" alt="The Asexual Flag" title="The Asexual Flag" />
4 <img src="001%20assets/AsexualPrideFlag.svg"  width="100" alt="The Asexual Flag" title="The Asexual Flag" />
-----
**1** {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
**2** {r, out.width='100', out.extra='', fig.align='', fig.cap='', fig.title =''}
knitr::include_graphics("001%20assets/AsexualPrideFlag.png")
3 {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.svg")
4 {r, out.width='100', out.extra='', fig.align='', fig.cap='The Asexual Flag', fig.title ='The Asexual Flag'}
knitr::include_graphics("001%20assets/AsexualPrideFlag.svg")

Now since this replacement falls into the scenario where you don't need to keep track of the position you can remove those parts:

p := 1                                                ; remove
while (p := RegExMatch(txt, "iO)" regex, match, p)) { ; change...
while (RegExMatch(txt, "iO)" regex, match)) {         ; ...for
p += StrLen(tpl)                                      ; remove

The result will be the same.


Hope this clears things, I know it is a lot to read but if you have any questions just shoot.

@Gewerd-Strauss
Copy link

That seems to work flawlessly. I'm sure I will find some stupid way to break this that I didn't think of yet (I seem very good at this, sadly), but so far: Thank you very much for all the help once again.

(And now back to studying :P)

@Gewerd-Strauss
Copy link

Hey, random question:
I am looking into building a config-GUI in a not-stupid way. For reference,

I am now trying to get a GUI build to give said arguments if I want to. Initially, I am just doing it for word (as it's the output I need working the fastest), and did so hardcoded. Obviously, that's... a nightmare.
Obviously I am not looking at implementing all of them (just... no), but a couple are relevant. For Word f.e., that would be "reference_docx" and "number_sections".

So the structure would be something along the lines of

varName=Value
<two spaces>Type: bool/string/number
<two spaces>Default:Value
<two spaces>Range:Min-Max
<two spaces>Description:Description String
...
NextVar=FALSE
....

And then you could parse this block by block and generate the GUI from there. Stuff like Range would only be valid for types which actually contain it - a Range doesn't make much sense on a string or bool, after all.
However, I am not sure how I'd do so without making it a 700-line large editor à la rajat's IniSettingsEditor, and even that's not quite what I am looking for - it's too unwieldy for this task :P

@anonymous1184
Copy link
Author

A couple of years ago, when I was teaching my kid AHK, he wanted to do something similar for GTA mods (or some other game mods).

We used file hashing as an example: https://redd.it/m0kzdy

Then you'll find how to dynamically create a GUI, fully auto (no x/y coordinates for anything).

Perhaps that can be of help, you create an object with properties for each of the elements and then a loop will take care of everything. Rather than, you know... adding an awful lot of elements to a GUI and manually position them.

@Gewerd-Strauss
Copy link

So... had a bit of brainstorming, since I am too busy to do this anyways right now and don't plan to start on it before my exams are over anyways. The current idea would be the following - sorry if I am jumbling jargon, it has been a while since I had to think about classes at all :P


Create a template class for outputformat of.
of.Attribute

where of:=new of(output_format) initiates with defaults (sets of.Template, of.Toc...)

of.Init(Type) sets defaults for all attributes.

An attribute is defined as

Attribute:=[Control:="" ,Type:="" ,Default:="" ,String:=" , Value:=""]

Where f.e.
of.Init("word_document") sets the equivalent of

of.TOC:={"Checkbox",boolean,false,"Add TOC?"}
of.Template:={"DDL","String",Template_docx",fPopulateDDL(Source)",""}
of.NumSecs:=....

Chosen Attributes, their types,explanation strings and defaults can be read back from a csv, à la

document_type,TOC?,TOC_depth,Template?,....

It gets pretty wide admittedly. The alternative would be a pseudo-array à la

document_type
    Key:|Control|Type|Default|String|Value
    Key:|Control|Type|Default|String|Value
    Key:|Control|Type|Default|String|Value
    Key:|Control|Type|Default|String|Value


document_type
    Key:|Control|Type|Default|String|Value
    Key:|Control|Type|Default|String|Value
    Key:|Control|Type|Default|String|Value
    Key:|Control|Type|Default|String|Value
    
document_type
...

and from there you would do something like

gui, 1: submit ; get the chosen `output_type`

of:=new of(output_type)
;; → of.TOC:={Control:"Checkbox",Type:"bool",Default:false,String:"Do you want to add a TOC?",Value:"",
gui, 2: new ....
for Argument, Obj in of
    gui, 2: add, % Obj.Control, v%Argument% , % Obj.String
gui, 2: show 

Not sure if/how I'd implement shit like limits in there, or if I just post-userinput go

if of.toc_depth.value<oc.toc_depth.min 
    of.toc_depth.value:=oc.toc_depth_min
    ...
;; and on submit, when building the RScript-String, you'd do

BuildRScriptContent(Path,of,...)
{

    SubmitString:="rmarkdown::" %of% "("  ;; pretty sure this is illegal cuz I'd be trying to get the name of the object, so I guess another attribute must be added where of:=new("word_document") adds key of.Output_Format:="word_document"
    for attribute, value in of
    {
        if value="" ;; need to find a valid string to signify that an element is not assigned
            value:=of[attribute].Default
        if of[attribute].Type=String
            value:=Quote(of[attribute].Default)
        SubmitString.="`n" attribute " = " Value
    }
     ;; and on with BuildRScript we go   
}


Does that make any sense? I'm just theorising right now, not sure if I am completely off the hooks or not.

@Gewerd-Strauss
Copy link

Because I am not quite sure if I understood your example properly, this is how I understood the generality of it.

@Gewerd-Strauss
Copy link

Gewerd-Strauss commented Jan 24, 2023

Just made a first draft:
Gewerd-Strauss/ObsidianKnittr@b5eca38

DynamicArguments.ahk/.ini is what you need.
I feel like I can do it this way, but I can't help but feel like I am still overcomplicating it.... My encoding format is probably not the best terrible to make this easy, nor is it likely to be expandable without major modifications. I am inclined to say fuck it and put every value into an edit field to trust the user to sort it out, but by that logic I can just as well write the R-Script completely by scratch from hand as well - that's not really the point.

I am aware this is all kinds of jumbled, probably both over- and underengineered, but I both wanna experiment and... get it done in a sensible way, if that makes sense?
I also haven't figured out how I want to emulate a FileSelectDialogue for template-selection properly, but that's required as well - or you do it via a Listview that is created & populated if a File-Variable exists...

@Gewerd-Strauss
Copy link

Gewerd-Strauss commented Jan 25, 2023

Just made a first draft: Gewerd-Strauss/ObsidianScripts@b5eca38

DynamicArguments.ahk/.ini is what you need. I feel like I can do it this way, but I can't help but feel like I am still overcomplicating it.... My encoding format is probably not the best to make this easy, nor is it likely to be expandable without major modifications. I am inclined to say fuck it and put every value into an edit field to trust the user to sort it out, but by that logic I can just as well write the R-Script completely by scratch from hand as well - that's not really the point.

I am aware this is all kinds of jumbled, probably both over- and underengineered, but I both wanna experiment and... get it done in a sensible way, if that makes sense?

Second Third draft, sorted out hte fileselect shenanigans, still janky imo:
Gewerd-Strauss/ObsidianKnittr@de88407
Gewerd-Strauss/ObsidianKnittr@c22bbcb

(Yes I theoretically should move this all over to another branch that's not "main", but... yea no. too late, maybe I pick that habit up on my next project :P)

@Gewerd-Strauss
Copy link

Quick Ping, please check the reddit chat.

@Gewerd-Strauss
Copy link

Sorry to revive this thread. It occured today that I never tested for german Umlaute äöü in the src, which are not correctly converted apparently.

Assuming the test file

---
creation date: "2023-02-26 23:10"
modification date: "2023-02-26 23:11"
tags: programming/bug 
---

!{{Umlaute in images_SRC_Bug.png|This subtitle contains "äöü".}}

Will result in the following input via FileRead buffer, % PathOrContent:

---
creation date: 2023-02-26 23:10
modification date: 2023-02-26 23:11
tags:
- programming/bug
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```




```{r, echo=FALSE, fig.cap='This subtitle contains "äöü".', fig.title='This subtitle contains "äöü".'}
knitr::include_graphics("001 assets/Umlaute in images_SRC_Bug.png")
```

by the converter. This blocks umlaute from being part of image captions. This has never been discovered so far cuz 99.5% of my work is in english, and it never occured to me. The hotfix I employed is to modify the buffer-FileRead to

    Current_FileEncoding:=A_FileEncoding
    FileEncoding, UTF-8
    FileRead buffer, % PathOrContent
    FileEncoding, % Current_FileEncoding

Sidequestion: Why is A_FileEncoding empty if no encoding is specified explicitly?
The Docs don't suggest to me that it should be empty - unless I am completely misunderstanding something.


This initially surfaced while examining this janky hell, and is topically related.

This makes sense so far, and the above fix seems to resolve that part of the greater issue.

However, the wild goose chase I am on due to the issue I outlined in the reddit post above is still ongoing, and I am not sure how to resolve it :P
And I could not figure out how I'd resolve it properly - aside from banning umlaute in filenames, I guess. But that obviously should not be the actual solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment