UPDATE: this was solved, thanks to the Cloudflare Community Discord. See comments for the solution.
I've been wanting to kick the tires with Cloudflare's HTMLRewriter to see if it could be used as an HTML parser. As a simple example, can Cloudflare Workers + HTMLRewriter be used to build an API to parse OpenGraph metadata and return the properties as a JSON document? Based on a cursory review of the documentation, it appears as if this should be quite simple.
However, I have observed a race condition where HTMLRewriter will always find fewer than the present number of matching elements unless a simulate a 1 millisecond "sleep":
await new Promise( function(resolve) { setTimeout(resolve, 1); })
This can be reproduced with Miniflare (via wrangler dev
) using the following example code (index.js
). Run the worker as-is to observe the race condition, then uncomment line 47 (the simulated "sleep") to see it work as expected.
I'm either doing something wrong (very likely!), or there's a bug in HTMLRewriter...
index.js
let worker = {};
// HTMLRewriter Element Handler to get <meta> element key:value pair(s)
// Reference: https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/#element-handlers
class MetaElementHandler {
// public class fields
properties;
// constructor
constructor(metadata={}) {
this.properties = new Object(metadata);
};
// element handler method
element(e) {
console.debug("<meta property=\"%s\" content=\"%s\">", e.getAttribute("property"), e.getAttribute("content"));
let key = e.getAttribute("property").replace("og:", "");
let value = e.getAttribute("content");
this.properties[key] = value;
return;
};
};
// Fetch Event Handler
// Reference: https://developers.cloudflare.com/workers/runtime-apis/fetch-event/
worker.fetch = async function(request, env, context) {
// Initialize the request & fetch the target
let params = Object.fromEntries(new URL(request.url).searchParams);
let url = new URL(params.target);
let target = new Request(url, {
method: "GET",
redirect: "follow",
headers: {
"User-Agent": env.USER_AGENT || "HTMLRewriter/1.0",
}
});
let source = await fetch(target);
// Use HTMLRewriter to extract target metadata
// Reference: https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/
console.debug("Executing HTMLRewriter...");
var metadata = new MetaElementHandler({ url: params.target, hostname: url.hostname });
await new HTMLRewriter().on('head meta[property^="og:"]', metadata).transform(source);
// await new Promise( function(resolve) { setTimeout(resolve, 1); }); // Sleep for 1ms... because Y U NO async?
console.debug("Executed HTMLRewriter...");
console.debug(metadata.properties);
// Return the response
return new Response(JSON.stringify(metadata.properties).concat("\n"), {
status: 200,
headers: {
"Content-Type": "application/json",
}
});
};
export default worker;
The expected output for GET /?target=https://theverge.com
should include debug output from my ElementHandler element(e) method between the Executing HTMLRewriter...
and Executed HTMLRewriter...
debug output.
Executing HTMLRewriter...
<meta property="og:description" content="The Verge is about technology and how it makes us feel...">
# [ more meta element debug output... ]
Executed HTMLRewriter...
{
url: "https://theverge.com",
hostname: "theverge.com",
description: "The Verge is about technology and how it makes us feel...",
type: "website",
image: "https://cdn.vox-cdn.com/.../the_verge_social_share.png",
site_name: "The Verge"
}
The actual output for GET /?target=https://theverge.com
reveals that HTMLRewriter finding elements matching my selector, but the ElementHandler element(e) method debug output comes after the Executed HTMLRewriter...
debug output.
Executing HTMLRewriter...
Executed HTMLRewriter...
{
url: "https://theverge.com",
hostname: "theverge.com"
}
<meta property="og:description" content="The Verge is about technology and how it makes us feel...">
NOTE: in some cases there will be no debug output from my ElementHandler element(e) method; this was what helped me realize there was a race condition.
Problem solved, thanks to
@kian
in the #workers-discussions channel of the Cloudflare Community Discord.In the end, all I needed to do was append a call to
.text()
and everything started working as expected!🙌