Have you ever wanted to do:
SELECT * FROM www.google.com
With BigQuery you kind of can. But have you ever wanted to do:
SELECT * FROM www.wikipedia.com
With Wikidata Query Service you kind of can (semantic web!, RDF!?, https://www.w3.org/wiki/SparqlImplementations).
But have you ever wanted to do:
SELECT * FROM small_website_with_some_tabular_data
Aha! Now I have something that actually might be useful for you. It even works with ShadowDOM websites (React, Vue, Angular)
Of course, if the data is actually organized as an HTML table you can use something like pandas.read_html() or:
lb site-add --extract-html-table sqlite.db https://www.w3schools.com/html/html_tables.asp
But that is pretty boring, though might still be useful:
$ sqlite-preview sqlite.db
| tag | description | tag_url |
|---|---|---|
<table> |
Defines a table | /tags/tag_table.asp |
<th> |
Defines a header cell in a table | /tags/tag_th.asp |
<tr> |
Defines a row in a table | /tags/tag_tr.asp |
<td> |
Defines a cell in a table | /tags/tag_td.asp |
<caption> |
Defines a table caption | /tags/tag_caption.asp |
| company | contact | country |
|---|---|---|
| Alfreds Futterkiste | Maria Anders | Germany |
| Centro comercial Moctezuma | Francisco Chang | Mexico |
| Ernst Handel | Roland Mendel | Austria |
| Island Trading | Helen Bennett | UK |
| Laughing Bacchus Winecellars | Yoshi Tannamuri | Canada |
| 0 | 1 |
|---|---|
| Jill Smith | 50 |
But what if the data that we want is not an HTML table? HTML is hierarchical and not tabular.
Well... structure is still structure. We can use patterns to force tagular data into fabular data:
lb site-add --extract-html fabular.db https://www.w3schools.com/html/html_tables.asp
This creates 24 tables:
$ sqlite-tables fabular.db
head_link
head_script
body_div_div_a
body_div_div_div
body_div_div_div_a
body_div_div
body_div_a
body_div_div_h2
body_div_div_div_div_a
body_div_div_div_div_tr_th
body_div_div_div_div_tr_td
body_div_div_div_div
body_div_div_div_div_div_a_span
body_div_div_div_div_div
body_div_div_div_div_div_div_div
body_div_div_div_div_div_div_a
body_div_div_div_p
body_div_div_div_p_code
body_div_div_div_h2
body_div_div_div_div_p
body_div_div_div_tr_th
body_div_div_div_tr_td
body_script
body_script_root
I hope this brings tears to your eyes. Here are some interesting highlights:
| text |
|---|
| HTML tables allow web developers to arrange data into rows and columns. |
| A table in HTML consists of table cells inside rows and columns. |
| Each table cell is defined by a, and a, tag. |
| Everything between, and, are the content of the table cell. |
| Each table row starts with a, and ends with a, tag. |
| title | href | text |
|---|---|---|
| HTML Tutorial | /html/default.asp | HTML |
| CSS Tutorial | /css/default.asp | CSS |
| JavaScript Tutorial | /js/default.asp | JAVASCRIPT |
| SQL Tutorial | /sql/default.asp | SQL |
| Python Tutorial | /python/default.asp | PYTHON |
| title | href | text |
|---|---|---|
| W3Schools on YouTube | https://www.youtube.com/@w3schools | |
| W3Schools on LinkedIn | https://www.linkedin.com/company/w3schools.com/ | |
| Join the W3schools community on Discord | https://discord.gg/6Z7UaRbUQM | |
| W3Schools on Facebook | https://www.facebook.com/w3schoolscom/ | |
| W3Schools on Instagram | https://www.instagram.com/w3schools.com_official/ | |
| href |
|---|
| https://www.w3schools.com/favicon.ico |
| /apple-touch-icon.png |
| /favicon-32x32.png |
| /favicon-16x16.png |
| /site.webmanifest |
This is not meant to be a very serious tool but it might make HTML scraping more accessible to people who only speak SQL.
lb site-add json.db https://www.w3schools.com/html/html_tables.asp
The default mode is for sites that load data incrementally (JSON, protobufs, etc). It does many the same things to wrangle hierarchical JSON data into tabular data.