Skip to content

Instantly share code, notes, and snippets.

@coderofsalvation
Last active April 24, 2026 08:07
Show Gist options
  • Select an option

  • Save coderofsalvation/405077a88e996b1fef693552238edbb7 to your computer and use it in GitHub Desktop.

Select an option

Save coderofsalvation/405077a88e996b1fef693552238edbb7 to your computer and use it in GitHub Desktop.
tiny naive but extendable html/rss/xml parser in AWK
#!/usr/bin/env -S awk -f
BEGIN { RS = "<" }
NR > 1 {
# Find the end of the opening/closing tag
idx = index($0, ">")
if (idx > 0) {
v = substr($0, 1, idx-1) # Everything inside < ... >
txt = substr($0, idx+1) # Everything after >
if (v ~ /^\//) {
if (p > 0) { process(s_n[p], stack[p], s_v[p]); p--; }
} else if (v !~ /(\/|^\?|^!)/) {
# OPENING TAG
# Separate tag name from attribute string using first whitespace
match(v, /[ \t\n\r]/)
if (RSTART) {
s_n[++p] = substr(v, 1, RSTART-1)
stack[p] = substr(v, RSTART) # Keep the whole attribute string
} else {
s_n[++p] = v
stack[p] = ""
}
s_v[p] = txt
}
}
}
function process(tag, attr_str, val, attr_map, n, i, b, pair) {
# 1. Extract attributes: key="value":
# Using a more inclusive regex for the attribute values
n = patsplit(attr_str, b, /[a-zA-Z0-9_-]+=[^ \t\n\r>]+/)
for (i = 1; i <= n; i++) {
split(b[i], pair, "=")
gsub(/^["']|["']$/, "", pair[2])
attr_map[pair[1]] = pair[2]
}
# 2. Clean up value text
gsub(/^[ \t\n\r]+|[ \t\n\r]+$/, "", val)
# 3. Output
printf "Tag: %s", tag
if (val) printf " | Text: %s", val
for (k in attr_map) printf " | %s: %s", k, attr_map[k]
print ""
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment