Skip to content

Instantly share code, notes, and snippets.

@kuldar
Created November 29, 2022 06:00
Show Gist options
  • Save kuldar/1593e602b88c79abd74eb4fa50eb68ec to your computer and use it in GitHub Desktop.
Save kuldar/1593e602b88c79abd74eb4fa50eb68ec to your computer and use it in GitHub Desktop.

Scraping websites with Supabase

The other day I needed to build a small web scraper to get new listings from a local real-estate website and store them somewhere I could easily fetch them.

Previously I had built a similar solution on a long-running Digital Ocean droplet and stored the listings in a plain JSON file. Obviously that was less than ideal setup, since the server was doing nothing most of the time, and it was a hassle to actually fetch the data.

I decided this was a good chance to get to know Supabase Edge functions a bit better and utilize the Postres pg_cron extension to periodically trigger the function.

Let's get started.

General overview

  1. Create a new Supabase project
  2. Create a table for storing the scraped data
  3. Create a new edge function
  4. Create a supabase client inside the edge function
  5. Load the desired site using axiod
  6. Pump the html into cheerio
  7. Filter out the desired data
  8. Upsert the data into the created database table
  9. Create a new Postgres cron job
  10. Profit

Creating a new Supabase project

If you haven't already, head over to supabase.io and create a new project. You can use the free tier for this project, since we won't be using much resources.

For this tutorial we'll try to scrape the good old Hacker News and store the articles from front page into our database along with some meta information. (FYI Hacker News has an actual API so you don't really need to scrape it, but got to learn on something right).

Let's name the project "Supanews". You'll need to set a secure password, select the region closest to you and create the new project.

Next we'll need to create the table to store the articles in.

Navigate to the "Table Editor" tab and click on the "Create new table" button and let's name the table "articles".

For this example we'll store the article id, title, link url, the submitter username and the timestamp it was posted at. So let's create the said columns. Also make sure to have the Row Level Security enabled to prevent unwanted goblins from mucking with your database.

Creating a new edge function

Supabase Edge Functions are nifty little Typescript functions that are deployed globally at the edge and are executed on the Deno runtime.

To create a new Edge Function, you'll need to have the Supabase CLI installed.

On Mac you can install the CLI with Homebrew brew install supabase/tap/supabase or check out the other methods of installing the CLI.

Next log into Supabase by running supabase login in your terminal.

Next let's create a new folder to keep our Supabase project in mkdir supanews and initialize a new local supabase project in it with supabase init.

Next we'll link the newly created local project with the actual live project we created on supabase.com. For that note the project ref code which is the cryptic string after "project/" in the url. https://app.supabase.com/project/gwiiamgxgufguedfyzez

Copy that and run supabase link --project-ref gwiiamgxgufguedfyzez

Enter your previously created database password and we're ready to start creating some edgy functions.

Run supabase functions new scrape-hn. This creates a new Typescript file at supanews/supabase/functions/scrape-hn/index.ts. Let's open it up.

You'll find a nice little example function along with a curl command to invoke the function locally. Note that in order to invoke the function locally, you'll have to download and run Docker, run supabase start and supabase functions serve scrape-hn to actually be able to call the cron command against a local url.

To actually publish the function into the interwebs we can run supabase functions deploy scrape-hn.

This will deploy the function to Supabase and you can see it in your Supabase dashboard at https://app.supabase.com/project/YOUR-PROJECT-REF/functions

Alright, enough setup, let's actually code something.

Building a scraping function

First we'll use the Deno's fetch to get the Hacker News frontpage and parse it into HTML.

const website = await fetch("https://news.ycombinator.com/");
const html = await website.text();

Next we'll import Cheerio to parse markup. It provides a nice API for traversing/manipulating the resulting data structure.

import * as cheerio from "https://esm.sh/cheerio";

First let's pass the fetched html into Cheerio

const page = cheerio.load(html);

Now we need to figure out how to find and target the individual hacker news article from the list. For that we can right click on the article, inspect the source and see that the element that contains a single article is a tr with a class of athing. We'll use that as a starting point.

Let's call .html() method on the result to get the HTML content of the first match, and temporarily store it in a result variable and return it to see what we have so far.

const page = cheerio.load(html);
const result = page(".athing").html();

return new Response(JSON.stringify(result), {
  headers: { "Content-Type": "application/json" },
});

Ayy, would you look at that, we have the HTMl for the first article containing all the information we need. Now we just need to pluck it out using handy Cheerio methods.

Since page(".athing") returns all matching results, let's map over them and grab the id from the matched elements.

type Article = {
  id: number;
};

serve(async (req) => {
  const website = await fetch("https://news.ycombinator.com/");
  const html = await website.text();
  const $ = cheerio.load(html);
  
  const articles: Article[] = [];

  $(".athing").map((i, el) => {
    articles.push({
      id: parseInt($(el).attr("id") as string),
    });
  });

  return new Response(JSON.stringify(articles), {
    headers: { "Content-Type": "application/json" },
  });
});

Now once we invoke the local funtcion again we get back a nice array of objects with ids.

[{"id":33774353},{"id":33773043},{"id":33777802},{"id":33775390},{"id":33775085},{"id":33775464},{"id":33771445},{"id":33760814},{"id":33776796},{"id":33775143},{"id":33776346},{"id":33772578},{"id":33772876},{"id":33774241},{"id":33774920},{"id":33769920},{"id":33775334},{"id":33774386},{"id":33774248},{"id":33772566},{"id":33760837},{"id":33774373},{"id":33774822},{"id":33762625},{"id":33771881},{"id":33774685},{"id":33774085},{"id":33773786},{"id":33773972},{"id":33763087}]%    

Let's tackle the article name next.

type Article = {
  id: number;
  title: string;
};

serve(async (req) => {
  const website = await fetch("https://news.ycombinator.com/");
  const html = await website.text();
  const $ = cheerio.load(html);

  const articles: Article[] = [];

  $(".athing").map((i, el) => {
    articles.push({
      id: parseInt($(el).attr("id") as string),
      title: $(el).find(".titleline a").children().remove().end().text(),
    });
  });

  return new Response(JSON.stringify(articles), {
    headers: { "Content-Type": "application/json" },
  });
});

Noice, let's get the link next since it's attached to the same a element.

type Article = {
  id: number;
  title: string;
  link: string;
};

serve(async (req) => {
  const website = await fetch("https://news.ycombinator.com/");
  const html = await website.text();
  const $ = cheerio.load(html);

  const articles: Article[] = [];

  $(".athing").map((i, el) => {
    articles.push({
      id: parseInt($(el).attr("id") as string),
      title: $(el).find(".titleline a").children().remove().end().text(),
      link: $(el)
        .find(".titleline a")
        .children()
        .remove()
        .end()
        .attr("href") as string,
    });
  });

  return new Response(JSON.stringify(articles), {
    headers: { "Content-Type": "application/json" },
  });
});

Next let's get the author name. Since it's actually not in the .athing element we need to target the tr right next to it and grab the author name from there.

type Article = {
  id: number;
  title: string;
  link: string;
  author: string;
};

serve(async (req) => {
  const website = await fetch("https://news.ycombinator.com/");
  const html = await website.text();
  const $ = cheerio.load(html);

  const articles: Article[] = [];

  $(".athing").map((i, el) => {
    articles.push({
      id: parseInt($(el).attr("id") as string),
      title: $(el).find(".titleline a").children().remove().end().text(),
      link: $(el)
        .find(".titleline a")
        .children()
        .remove()
        .end()
        .attr("href") as string,
      author: $(el).next().find(".hnuser").text(),
    });
  });

  return new Response(JSON.stringify(articles), {
    headers: { "Content-Type": "application/json" },
  });
});

And finally we'll get the date which we'll have to cast into string as well.

type Article = {
  id: number;
  title: string;
  link: string;
  author: string;
  posted_at: string;
};

serve(async (req) => {
  const website = await fetch("https://news.ycombinator.com/");
  const html = await website.text();
  const $ = cheerio.load(html);

  const articles: Article[] = [];

  $(".athing").map((i, el) => {
    articles.push({
      id: parseInt($(el).attr("id") as string),
      title: $(el).find(".titleline a").children().remove().end().text(),
      link: $(el)
        .find(".titleline a")
        .children()
        .remove()
        .end()
        .attr("href") as string,
      author: $(el).next().find(".hnuser").text(),
      posted_at: $(el).next().find(".age").attr("title") as string,
    });
  });

  return new Response(JSON.stringify(articles), {
    headers: { "Content-Type": "application/json" },
  });
});

Saving results into Supabase

First lets import Supabase client

import { createClient } from "https://esm.sh/@supabase/[email protected]";

Then we'll initialize the client with environment variables Supabase automatically makes available for us.

const supabase = createClient(
  Deno.env.get("SUPABASE_URL") ?? "",
  Deno.env.get("SUPABASE_SERVICE_ROLE_KEY") ?? ""
);

And finally we'll use the supabase upsert to push the articles into database. Since we'll run the function periodically we are bound to scrape articles that we already have in our database. We can set ignoreDuplicates to true so that Supabase will automatically ignore entries with IDs that already exist in the database.

const { data: savedArticles } = await supabase
    .from("articles")
    .upsert(articles, { ignoreDuplicates: true })
    .select();

We'll also replace the returned object with savedArticles so that we can visually see what rows are actually added to the database.

Now once we call the function for the first time, we'll get an array back with 30 entries (all the articles from the frontpage). But if we immidiately run the function again, we'll get back an empty array, since all the scraped listings already exist in the database and (presumably) no new articles have been added in the frontpage between the two function calls.

Let's terminate our locally served function with ctrl+c and let's push our fancy new edge function live.

supabase functions deploy scrape-hn

And with that we can shut down our local supabase server supabase stop

Cron job to trigger the Edge function

Now let's write some SQL to trigger the edge function with a Postgres HTTP client.

First we need to enalbe the pgsql-http extensions. We can do that under the "Database" tab by clicking on Extensions, finding the HTTP block and toggling it on. OR we can go to the "SQL editor" tab, replace the placeholder query with create extension if not exists http; and hit run. You should get a "Success. No rows returned" response.

Next we can compose our SQL query that will trigger the edge function. We'll need two things first - our edge function url and the SERVICE_KEY

select status
  from
    http((
      'POST',
      'http://localhost:54321/functions/v1/',
      array[
        http_header('Authorization','Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZS1kZW1vIiwicm9sZSI6ImFub24iLCJleHAiOjE5ODM4MTI5OTZ9.CRXP1A7WOeoJeXxjNni43kdQwgnWNReilDMblYTn_I0')
      ],
      '',
      'application/json'
    )::http_request)

Next let's enable cron extensions

create extension if not exists pg_cron;

And wrap the previous code ina cron function

select
  cron.schedule(
    'scrape-hn-every-30-min',
    '*/30 * * * *',
    $$
    select status
      from
        http((
          'POST',
          'http://localhost:54321/functions/v1/',
          array[
            http_header('Authorization','Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZS1kZW1vIiwicm9sZSI6ImFub24iLCJleHAiOjE5ODM4MTI5OTZ9.CRXP1A7WOeoJeXxjNni43kdQwgnWNReilDMblYTn_I0')
          ],
          '',
          'application/json'
        )::http_request)
    $$
  );

And you're done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment