Skip to content

Instantly share code, notes, and snippets.

@andenacitelli
Last active April 19, 2023 16:25
Show Gist options
  • Save andenacitelli/8a760d4cebfdaa43c07060a2e45af50f to your computer and use it in GitHub Desktop.
Save andenacitelli/8a760d4cebfdaa43c07060a2e45af50f to your computer and use it in GitHub Desktop.
Himalaya JS - Extract All Text From Webpage
// May require some slight tweaking, but this should
const transformTree = (tree: any): string | (object | string)[] => {
if (tree.type === "text") {
const cleaned = tree.content
.replaceAll(/[\n\r]/g, " ")
.replaceAll(/\s{2,}/g, "")
.trim();
if (
!/[\dA-Za-z]/g.test(cleaned) ||
cleaned.includes("function()") ||
cleaned.includes("__footer__") ||
/if ?\(/g.test(cleaned) // if( or if ( indicates this is a JS script
)
return "";
logger.info(`Valid Text: ${cleaned}`);
return cleaned;
}
if (tree.children)
return tree.children
.map((tree: any) => transformTree(tree))
.filter((x: any) => x !== "" && (!Array.isArray(x) || x.length > 0));
return "";
};
import { parse } from "himalaya";
const originalTree = parse(pageText)[2];
const tree = transformTree(originalTree);

Put this together because I needed to keep the rough HTML heirarchy together while still reducing it to only text. My use case is to plug it into OpenAI's GPT models and offer that "structure" as additional context. Only iffy area is that this eliminates inline JavaScript where possible, but isn't perfect. function( and if( are used as indicators that it's probably JavaScript.

Example Output:

[12:20:03.571] INFO (13376): Valid Text: Merchant Payment Solutions
[12:20:03.571] INFO (13376): Valid Text: Get Support
[12:20:03.571] INFO (13376): Valid Text: United States
[12:20:03.571] INFO (13376): Valid Text: Change Country
[12:20:03.571] INFO (13376): Valid Text: Elevate your lifestyle & earn
[12:20:03.571] INFO (13376): Valid Text: 100,000 Membership Rewards
[12:20:03.571] INFO (13376): Valid Text: points with Platinum.
[12:20:03.571] INFO (13376): Valid Text: Plus, enjoy exclusive benefits for Morgan Stanley Card Members only.
[12:20:03.571] INFO (13376): Valid Text: Annual Fee $695
[12:20:03.571] INFO (13376): Valid Text: Learn More
[12:20:03.571] INFO (13376): Valid Text: Apply Now
[12:20:03.571] INFO (13376): Valid Text: Earn more Cash Back ΓÇô starting with
[12:20:03.571] INFO (13376): Valid Text: a
[12:20:03.571] INFO (13376): Valid Text: $250
[12:20:03.571] INFO (13376): Valid Text: statement credit.
[12:20:03.571] INFO (13376): Valid Text: Reward your daily routine with our premier
[12:20:03.571] INFO (13376): Valid Text: Cash Back Card ΓÇô enjoy 6%
[12:20:03.571] INFO (13376): Valid Text: Cash Back at
[12:20:03.571] INFO (13376): Valid Text: U.S. supermarkets & more.
[12:20:03.571] INFO (13376): Valid Text: $0 intro annual fee for the first year, then $95
[12:20:03.571] INFO (13376): Valid Text: Learn More
[12:20:03.571] INFO (13376): Valid Text: Apply Now
[12:20:03.571] INFO (13376): Valid Text: The perfect companions for your portfolio
[12:20:03.571] INFO (13376): Valid Text: Luxury benefits for travel, entertainment, wellness, and more.
[12:20:03.571] INFO (13376): Valid Text: Cash Back
[12:20:03.571] INFO (13376): Valid Text: on the purchases you make every day.
[12:20:03.571] INFO (13376): Valid Text: Find your best fitΓÇöor pair them together and earn even more.
[12:20:03.571] INFO (13376): Valid Text: The Platinum Card
[12:20:03.571] INFO (13376): Valid Text: from
[12:20:03.571] INFO (13376): Valid Text: American Express Exclusively for Morgan Stanley
[12:20:03.571] INFO (13376): Valid Text: Experience premium perks and luxury benefits.
[12:20:03.571] INFO (13376): Valid Text: Earn 100,000 Membership Rewards
[12:20:03.571] INFO (13376): Valid Text: Points
[12:20:03.571] INFO (13376): Valid Text: after you spend $6,000 on purchases on your new Card in your first 6 months of Card Membership.
[12:20:03.571] INFO (13376): Valid Text: Annual Fee $695
[12:20:03.571] INFO (13376): Valid Text: Learn More
[12:20:03.571] INFO (13376): Valid Text: Morgan Stanley clients with an eligible brokerage account can apply for The Platinum Card
[12:20:03.571] INFO (13376): Valid Text: from American Express Exclusively for Morgan Stanley
[12:20:03.571] INFO (13376): Valid Text: Apply Now
[12:20:03.571] INFO (13376): Valid Text: Offer Terms
[12:20:03.571] INFO (13376): Valid Text: Benefit Terms
[12:20:03.571] INFO (13376): Valid Text: Rates & Fees
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment