Skip to content

Instantly share code, notes, and snippets.

@chapmanjacobd
Last active December 25, 2025 23:57
Show Gist options
  • Select an option

  • Save chapmanjacobd/00054b3374566c1c390ae39d3de1f083 to your computer and use it in GitHub Desktop.

Select an option

Save chapmanjacobd/00054b3374566c1c390ae39d3de1f083 to your computer and use it in GitHub Desktop.

Have you ever wanted to do:

SELECT * FROM www.google.com

With BigQuery you kind of can. But have you ever wanted to do:

SELECT * FROM www.wikipedia.com

With Wikidata Query Service you kind of can (semantic web!, RDF!?, https://www.w3.org/wiki/SparqlImplementations).

But have you ever wanted to do:

SELECT * FROM small_website_with_some_tabular_data

Aha! Now I have something that actually might be useful for you. It even works with ShadowDOM websites (React, Vue, Angular)

Of course, if the data is actually organized as an HTML table you can use something like pandas.read_html() or:

lb site-add --extract-html-table sqlite.db https://www.w3schools.com/html/html_tables.asp

But that is pretty boring, though might still be useful:

$ sqlite-preview sqlite.db

sqlite.db

t1 (20 rows)

tag description tag_url
<table> Defines a table /tags/tag_table.asp
<th> Defines a header cell in a table /tags/tag_th.asp
<tr> Defines a row in a table /tags/tag_tr.asp
<td> Defines a cell in a table /tags/tag_td.asp
<caption> Defines a table caption /tags/tag_caption.asp

t2 (12 rows)

company contact country
Alfreds Futterkiste Maria Anders Germany
Centro comercial Moctezuma Francisco Chang Mexico
Ernst Handel Roland Mendel Austria
Island Trading Helen Bennett UK
Laughing Bacchus Winecellars Yoshi Tannamuri Canada

t3 (1 rows)

0 1
Jill Smith 50

But wait there's more!

But what if the data that we want is not an HTML table? HTML is hierarchical and not tabular.

Well... structure is still structure. We can use patterns to force tagular data into fabular data:

lb site-add --extract-html fabular.db https://www.w3schools.com/html/html_tables.asp

This creates 24 tables:

$ sqlite-tables fabular.db
head_link
head_script
body_div_div_a
body_div_div_div
body_div_div_div_a
body_div_div
body_div_a
body_div_div_h2
body_div_div_div_div_a
body_div_div_div_div_tr_th
body_div_div_div_div_tr_td
body_div_div_div_div
body_div_div_div_div_div_a_span
body_div_div_div_div_div
body_div_div_div_div_div_div_div
body_div_div_div_div_div_div_a
body_div_div_div_p
body_div_div_div_p_code
body_div_div_div_h2
body_div_div_div_div_p
body_div_div_div_tr_th
body_div_div_div_tr_td
body_script
body_script_root

I hope this brings tears to your eyes. Here are some interesting highlights:

body_div_div_div_p (10 rows)

text
HTML tables allow web developers to arrange data into rows and columns.
A table in HTML consists of table cells inside rows and columns.
Each table cell is defined by a, and a, tag.
Everything between, and, are the content of the table cell.
Each table row starts with a, and ends with a, tag.

body_div_a (41 rows)

title href text
HTML Tutorial /html/default.asp HTML
CSS Tutorial /css/default.asp CSS
JavaScript Tutorial /js/default.asp JAVASCRIPT
SQL Tutorial /sql/default.asp SQL
Python Tutorial /python/default.asp PYTHON

body_div_div_div_a (40 rows)

title href text
W3Schools on YouTube https://www.youtube.com/@w3schools
W3Schools on LinkedIn https://www.linkedin.com/company/w3schools.com/
Join the W3schools community on Discord https://discord.gg/6Z7UaRbUQM
W3Schools on Facebook https://www.facebook.com/w3schoolscom/
W3Schools on Instagram https://www.instagram.com/w3schools.com_official/

head_link (18 rows)

href
https://www.w3schools.com/favicon.ico
/apple-touch-icon.png
/favicon-32x32.png
/favicon-16x16.png
/site.webmanifest

This is not meant to be a very serious tool but it might make HTML scraping more accessible to people who only speak SQL.

lb site-add json.db https://www.w3schools.com/html/html_tables.asp

The default mode is for sites that load data incrementally (JSON, protobufs, etc). It does many the same things to wrangle hierarchical JSON data into tabular data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment