Skip to content

Instantly share code, notes, and snippets.

@michaeltelford
Last active July 23, 2025 17:02
Show Gist options
  • Save michaeltelford/7424bc1c008b813d77fb98cdb25ecece to your computer and use it in GitHub Desktop.
Save michaeltelford/7424bc1c008b813d77fb98cdb25ecece to your computer and use it in GitHub Desktop.
Ruby script using Wgit to extract the meaningful content from a webpage (without the crap e.g. cookie banners)
require "wgit"
# Remove the default extractors since we won't be using them.
Wgit::Document.remove_extractors
# The default name of the output file containing the clean HTML.
def default_file_name
"webpage.html"
end
# The HTML elements containing the content that we're interested in viewing.
def content_elements
Set.new(%i[
a abbr address aside b bdi bdo blockquote caption cite
code data del details dfn div dl em figcaption figure footer h1 h2
h3 h4 h5 h6 header hr i img ins kbd legend main mark meter ol
option output p pre q rb rt ruby s samp section small span strong sub
summary sup textarea time u ul var wbr
])
end
# Returns an xpath query (String) to extract the meaningful content on a page.
def content_xpath
content_elements.each_with_index.reduce("") do |xpath, (el, i)|
xpath += " | " unless i.zero?
xpath += format("//%s", el)
end
end
# Extracts the meaningful content on a webpage to be viewed without annoyances like popups etc.
class CleanCrawler < Wgit::Base
start ARGV.first
mode :crawl_url
extract :content, content_xpath, singleton: false, text_content_only: false
extract :article, "//article", singleton: true, text_content_only: false
attr_reader :file_name
# Parse/process the crawled web document. We want to extract and write only the meaningful
# content, not the crap around it e.g. cookie banners.
def parse(doc)
raise "doc.content should be an Enumerable" unless doc.content&.is_a?(Enumerable)
@file_name = ARGV[1] || default_file_name
File.open(@file_name, "w+") do |f|
write_html(f) do
write_content(f, doc)
end
end
end
private
# Write the opening HTML tag with a CSS link to SimpleCSS.
def write_html(file)
html_opening_tags = <<~HTML
<html>
<head>
<link rel="stylesheet" href="https://cdn.simplecss.org/simple.min.css">
</head>
HTML
file.write(html_opening_tags)
yield
file.write("</html>")
end
# If there's an article element, write it to file, otherwise write all of the page content.
def write_content(file, doc)
if doc.article
file.write(doc.article)
else
write_article(file) do
doc.content.each { |el| file.write(el) }
end
end
end
def write_article(file)
file.write("<article>")
yield
file.write("</article>")
end
end
if __FILE__ == $0
if ARGV.empty?
raise "missing URL parameter, use like: ruby extract.rb http://example.com [example.html]"
end
crawler = CleanCrawler.run
file_name = crawler.file_name
file_size = File.size(file_name)
absolute_file_path = File.expand_path(file_name)
puts "Wrote #{file_size} bytes to:\n#{absolute_file_path}"
end
@michaeltelford
Copy link
Author

Use like:

$ ruby extract.rb http://example.com example.html
Wrote 751 bytes to:
${PWD}/example.html

The final argument is a file name and is optional, defaulting to webpage.html.

You can then open the resulting webpage in any browser to see a clean version of the webpage you've crawled without annoyances like popups, subscription requests, cookie banners etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment