The other day I needed to build a small web scraper to get new listings from a local real-estate website and store them somewhere I could easily fetch them.
Previously I had built a similar solution on a long-running Digital Ocean droplet and stored the listings in a plain JSON file. Obviously that was less than ideal setup, since the server was doing nothing most of the time, and it was a hassle to actually fetch the data.
I decided this was a good chance to get to know Supabase Edge functions a bit better and utilize the Postres pg_cron
extension to periodically trigger the function.
Let's get started.
- Create a new Supabase project
- Create a table for storing the scraped data
- Create a new edge function
- Create a supabase client inside the edge function
- Load the desired site using
axiod
- Pump the html into
cheerio
- Filter out the desired data
- Upsert the data into the created database table
- Create a new Postgres cron job
- Profit
If you haven't already, head over to supabase.io and create a new project. You can use the free tier for this project, since we won't be using much resources.
For this tutorial we'll try to scrape the good old Hacker News and store the articles from front page into our database along with some meta information. (FYI Hacker News has an actual API so you don't really need to scrape it, but got to learn on something right).
Let's name the project "Supanews". You'll need to set a secure password, select the region closest to you and create the new project.
Next we'll need to create the table to store the articles in.
Navigate to the "Table Editor" tab and click on the "Create new table" button and let's name the table "articles".
For this example we'll store the article id, title, link url, the submitter username and the timestamp it was posted at. So let's create the said columns. Also make sure to have the Row Level Security enabled to prevent unwanted goblins from mucking with your database.
Supabase Edge Functions are nifty little Typescript functions that are deployed globally at the edge and are executed on the Deno runtime.
To create a new Edge Function, you'll need to have the Supabase CLI installed.
On Mac you can install the CLI with Homebrew brew install supabase/tap/supabase
or check out the other methods of installing the CLI.
Next log into Supabase by running supabase login
in your terminal.
Next let's create a new folder to keep our Supabase project in mkdir supanews
and initialize a new local supabase project in it with supabase init
.
Next we'll link the newly created local project with the actual live project we created on supabase.com. For that note the project ref code which is the cryptic string after "project/" in the url. https://app.supabase.com/project/gwiiamgxgufguedfyzez
Copy that and run supabase link --project-ref gwiiamgxgufguedfyzez
Enter your previously created database password and we're ready to start creating some edgy functions.
Run supabase functions new scrape-hn
. This creates a new Typescript file at supanews/supabase/functions/scrape-hn/index.ts
. Let's open it up.
You'll find a nice little example function along with a curl
command to invoke the function locally. Note that in order to invoke the function locally, you'll have to download and run Docker, run supabase start
and supabase functions serve scrape-hn
to actually be able to call the cron command against a local url.
To actually publish the function into the interwebs we can run supabase functions deploy scrape-hn
.
This will deploy the function to Supabase and you can see it in your Supabase dashboard at https://app.supabase.com/project/YOUR-PROJECT-REF/functions
Alright, enough setup, let's actually code something.
First we'll use the Deno's fetch to get the Hacker News frontpage and parse it into HTML.
const website = await fetch("https://news.ycombinator.com/");
const html = await website.text();
Next we'll import Cheerio to parse markup. It provides a nice API for traversing/manipulating the resulting data structure.
import * as cheerio from "https://esm.sh/cheerio";
First let's pass the fetched html into Cheerio
const page = cheerio.load(html);
Now we need to figure out how to find and target the individual hacker news article from the list. For that we can right click on the article, inspect the source and see that the element that contains a single article is a tr
with a class of athing
. We'll use that as a starting point.
Let's call .html()
method on the result to get the HTML content of the first match, and temporarily store it in a result
variable and return it to see what we have so far.
const page = cheerio.load(html);
const result = page(".athing").html();
return new Response(JSON.stringify(result), {
headers: { "Content-Type": "application/json" },
});
Ayy, would you look at that, we have the HTMl for the first article containing all the information we need. Now we just need to pluck it out using handy Cheerio methods.
Since page(".athing")
returns all matching results, let's map over them and grab the id from the matched elements.
type Article = {
id: number;
};
serve(async (req) => {
const website = await fetch("https://news.ycombinator.com/");
const html = await website.text();
const $ = cheerio.load(html);
const articles: Article[] = [];
$(".athing").map((i, el) => {
articles.push({
id: parseInt($(el).attr("id") as string),
});
});
return new Response(JSON.stringify(articles), {
headers: { "Content-Type": "application/json" },
});
});
Now once we invoke the local funtcion again we get back a nice array of objects with ids.
[{"id":33774353},{"id":33773043},{"id":33777802},{"id":33775390},{"id":33775085},{"id":33775464},{"id":33771445},{"id":33760814},{"id":33776796},{"id":33775143},{"id":33776346},{"id":33772578},{"id":33772876},{"id":33774241},{"id":33774920},{"id":33769920},{"id":33775334},{"id":33774386},{"id":33774248},{"id":33772566},{"id":33760837},{"id":33774373},{"id":33774822},{"id":33762625},{"id":33771881},{"id":33774685},{"id":33774085},{"id":33773786},{"id":33773972},{"id":33763087}]%
Let's tackle the article name next.
type Article = {
id: number;
title: string;
};
serve(async (req) => {
const website = await fetch("https://news.ycombinator.com/");
const html = await website.text();
const $ = cheerio.load(html);
const articles: Article[] = [];
$(".athing").map((i, el) => {
articles.push({
id: parseInt($(el).attr("id") as string),
title: $(el).find(".titleline a").children().remove().end().text(),
});
});
return new Response(JSON.stringify(articles), {
headers: { "Content-Type": "application/json" },
});
});
Noice, let's get the link next since it's attached to the same a element.
type Article = {
id: number;
title: string;
link: string;
};
serve(async (req) => {
const website = await fetch("https://news.ycombinator.com/");
const html = await website.text();
const $ = cheerio.load(html);
const articles: Article[] = [];
$(".athing").map((i, el) => {
articles.push({
id: parseInt($(el).attr("id") as string),
title: $(el).find(".titleline a").children().remove().end().text(),
link: $(el)
.find(".titleline a")
.children()
.remove()
.end()
.attr("href") as string,
});
});
return new Response(JSON.stringify(articles), {
headers: { "Content-Type": "application/json" },
});
});
Next let's get the author name. Since it's actually not in the .athing
element we need to target the tr
right next to it and grab the author name from there.
type Article = {
id: number;
title: string;
link: string;
author: string;
};
serve(async (req) => {
const website = await fetch("https://news.ycombinator.com/");
const html = await website.text();
const $ = cheerio.load(html);
const articles: Article[] = [];
$(".athing").map((i, el) => {
articles.push({
id: parseInt($(el).attr("id") as string),
title: $(el).find(".titleline a").children().remove().end().text(),
link: $(el)
.find(".titleline a")
.children()
.remove()
.end()
.attr("href") as string,
author: $(el).next().find(".hnuser").text(),
});
});
return new Response(JSON.stringify(articles), {
headers: { "Content-Type": "application/json" },
});
});
And finally we'll get the date which we'll have to cast into string as well.
type Article = {
id: number;
title: string;
link: string;
author: string;
posted_at: string;
};
serve(async (req) => {
const website = await fetch("https://news.ycombinator.com/");
const html = await website.text();
const $ = cheerio.load(html);
const articles: Article[] = [];
$(".athing").map((i, el) => {
articles.push({
id: parseInt($(el).attr("id") as string),
title: $(el).find(".titleline a").children().remove().end().text(),
link: $(el)
.find(".titleline a")
.children()
.remove()
.end()
.attr("href") as string,
author: $(el).next().find(".hnuser").text(),
posted_at: $(el).next().find(".age").attr("title") as string,
});
});
return new Response(JSON.stringify(articles), {
headers: { "Content-Type": "application/json" },
});
});
First lets import Supabase client
import { createClient } from "https://esm.sh/@supabase/[email protected]";
Then we'll initialize the client with environment variables Supabase automatically makes available for us.
const supabase = createClient(
Deno.env.get("SUPABASE_URL") ?? "",
Deno.env.get("SUPABASE_SERVICE_ROLE_KEY") ?? ""
);
And finally we'll use the supabase upsert to push the articles into database. Since we'll run the function periodically we are bound to scrape articles that we already have in our database. We can set ignoreDuplicates
to true so that Supabase will automatically ignore entries with IDs that already exist in the database.
const { data: savedArticles } = await supabase
.from("articles")
.upsert(articles, { ignoreDuplicates: true })
.select();
We'll also replace the returned object with savedArticles
so that we can visually see what rows are actually added to the database.
Now once we call the function for the first time, we'll get an array back with 30 entries (all the articles from the frontpage). But if we immidiately run the function again, we'll get back an empty array, since all the scraped listings already exist in the database and (presumably) no new articles have been added in the frontpage between the two function calls.
Let's terminate our locally served function with ctrl+c
and let's push our fancy new edge function live.
supabase functions deploy scrape-hn
And with that we can shut down our local supabase server supabase stop
Now let's write some SQL to trigger the edge function with a Postgres HTTP client.
First we need to enalbe the pgsql-http extensions. We can do that under the "Database" tab by clicking on Extensions, finding the HTTP block and toggling it on. OR we can go to the "SQL editor" tab, replace the placeholder query with create extension if not exists http;
and hit run. You should get a "Success. No rows returned" response.
Next we can compose our SQL query that will trigger the edge function. We'll need two things first - our edge function url and the SERVICE_KEY
select status
from
http((
'POST',
'http://localhost:54321/functions/v1/',
array[
http_header('Authorization','Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZS1kZW1vIiwicm9sZSI6ImFub24iLCJleHAiOjE5ODM4MTI5OTZ9.CRXP1A7WOeoJeXxjNni43kdQwgnWNReilDMblYTn_I0')
],
'',
'application/json'
)::http_request)
Next let's enable cron extensions
create extension if not exists pg_cron;
And wrap the previous code ina cron function
select
cron.schedule(
'scrape-hn-every-30-min',
'*/30 * * * *',
$$
select status
from
http((
'POST',
'http://localhost:54321/functions/v1/',
array[
http_header('Authorization','Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZS1kZW1vIiwicm9sZSI6ImFub24iLCJleHAiOjE5ODM4MTI5OTZ9.CRXP1A7WOeoJeXxjNni43kdQwgnWNReilDMblYTn_I0')
],
'',
'application/json'
)::http_request)
$$
);
And you're done!