Skip to content

Instantly share code, notes, and snippets.

View thunderpoot's full-sized avatar
💨

underwood thunderpoot

💨
View GitHub Profile
@thunderpoot
thunderpoot / fetch_subdomains.sh
Created November 6, 2024 13:37
Shell script using curl and jq to retrieve all subdomains for a given domain from a given Common Crawl index
#!/bin/bash
# Shell script using curl and jq to retrieve all subdomains for a given domain
# from Common Crawl's most recent index or a specified crawl ID. This script
# dynamically retrieves the latest crawl ID if none is provided, fetches data
# (across multiple pages if necessary), retries failed requests, and extracts
# unique subdomains.
# Usage:
# bash fetch_subdomains.sh <domain> [crawl_id]
@thunderpoot
thunderpoot / create_cc_index_table.sql
Created November 12, 2024 19:37
SQL script used to create an external table in Amazon Athena (and so on). Contains the schema for CC's columnar index
CREATE EXTERNAL TABLE IF NOT EXISTS commoncrawl_index -- let’s create a new table with the following columns:
(
url_surtkey STRING, -- Sort-friendly URI Reordering Transform
url STRING, -- the URL (duh) including protocol (http or https)
url_host_name STRING, -- the hostname, including subdomain(s)
url_host_tld STRING, -- the top-level domain such as `.org`
url_host_registered_domain STRING, -- the registered domain name
url_host_private_domain STRING, -- private domain such as `example.com`
url_host_public_suffix STRING, -- public suffix of the domain such as `.co.uk` or `.edu`
url_protocol STRING, -- the transfer protocol used, (http or https)