Skip to content

Instantly share code, notes, and snippets.

@kvncrw
Created April 3, 2026 19:26
Show Gist options
  • Select an option

  • Save kvncrw/3628dca3da33b5ade897e8d76fe4664b to your computer and use it in GitHub Desktop.

Select an option

Save kvncrw/3628dca3da33b5ade897e8d76fe4664b to your computer and use it in GitHub Desktop.

KF Archive Migration Pipeline

Architecture

flowchart TD
    subgraph Crawl["basilisk-tor crawl"]
        KF["KF .onion"]
        TOR["Tor SOCKS5 Proxy"]
        POW["Tartarus PoW Solver"]
        PARSE["XenForo Parser"]
        WARC["WARC Writer"]
        EXTRACT["Archive Link Extractor"]

        KF --> TOR --> POW --> PARSE
        PARSE --> WARC
        PARSE --> EXTRACT
    end

    subgraph DB["SQLite (iSCSI SSD)"]
        LINKS["archive_links table"]
    end

    subgraph Migrate["basilisk-tor migrate"]
        CLAIM["Claim pending link"]
        CLEAN["Extract clean URL"]
        SUBMIT["Submit to Basilisk API"]
        TRACK["Update status"]

        CLAIM --> CLEAN --> SUBMIT --> TRACK
    end

    subgraph Basilisk["Basilisk Capture"]
        API["archive.nashgo.org API"]
        SCOOP["Scoop capture engine"]
        S3["Garage S3"]

        API --> SCOOP --> S3
    end

    EXTRACT --> LINKS
    LINKS --> CLAIM
    SUBMIT --> API
    TRACK --> LINKS
Loading

Status Flow

stateDiagram-v2
    [*] --> pending: crawl discovers link
    pending --> fetching: worker claims
    fetching --> captured: Basilisk returns UUID
    fetching --> failed: API error / timeout
    failed --> pending: release_stale_locks
    captured --> replaced: KF owner rewrites link
Loading

Archive-of-Archives Strategy

flowchart LR
    LINK["archive.ph/XyZ12"]
    CHECK{Can extract original URL?}
    ORIG["Capture https://reddit.com/r/..."]
    FALLBACK["Capture https://archive.ph/XyZ12"]
    USIPS["USIPS Archive UUID"]

    LINK --> CHECK
    CHECK -->|"Wayback: yes"| ORIG --> USIPS
    CHECK -->|"archive.today short hash: no"| FALLBACK --> USIPS
Loading
Provider URL Pattern Clean URL Extractable? Strategy
archive.org web.archive.org/web/{ts}/{url} Yes Capture original URL
archive.today archive.ph/{hash} No Capture archive page
archive.today archive.ph/newest/{url} Yes Capture original URL
ghostarchive ghostarchive.org/archive/{id} No Capture archive page
google cache webcache.googleusercontent.com/... Yes Capture original URL

CLI Commands

# Crawl KF subforum, write WARC, ingest archive links to SQLite
basilisk-tor crawl \
  "http://kiwifarmsaaf4t2h7gc3dfc5ojhmqruw2nit3uejrpiagrxeuxiyxcyd.onion/forums/lolcows.16/" \
  --proxy "socks5h://tor-proxy-pool.proxy-pool.svc.cluster.local:9050" \
  --output /data/kf-lolcows.warc.gz \
  --db /data/archive_links.db \
  --delay 2000

# Run migration worker (long-lived, polls for new links)
basilisk-tor migrate \
  --db /data/archive_links.db \
  --api-url "https://archive.nashgo.org" \
  --submit-delay 5

# Check progress
basilisk-tor stats --db /data/archive_links.db

SQLite Schema

archive_links (
    id                 INTEGER PRIMARY KEY AUTOINCREMENT,
    source_url         TEXT NOT NULL,       -- KF thread URL where found
    archive_provider   TEXT NOT NULL,       -- archive.org, archive.today, etc.
    archive_url        TEXT NOT NULL UNIQUE, -- original archive link
    clean_url          TEXT NOT NULL,       -- extracted original URL (or empty)
    migration_status   TEXT NOT NULL,       -- pending|fetching|captured|replaced|failed|skipped
    lock               INTEGER NOT NULL,    -- 1 = being processed
    error_message      TEXT,
    usips_archive_id   TEXT,               -- Basilisk UUID once captured
    extracted_at       TEXT NOT NULL,
    updated_at         TEXT NOT NULL,
    created_at         TEXT NOT NULL
)

What's Left (Requires KF Owner)

The captured → replaced transition requires XenForo API access to edit posts:

  1. Query SQLite for migration_status = 'captured' rows
  2. For each row: rewrite archive_urlhttps://archive.usips.org/archives/{usips_archive_id} in the KF post
  3. Mark as replaced in SQLite

This is a XenForo POST /api/posts/{id} call with the updated message body. Needs an API key from the KF admin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment