flowchart TD
subgraph Crawl["basilisk-tor crawl"]
KF["KF .onion"]
TOR["Tor SOCKS5 Proxy"]
POW["Tartarus PoW Solver"]
PARSE["XenForo Parser"]
WARC["WARC Writer"]
EXTRACT["Archive Link Extractor"]
KF --> TOR --> POW --> PARSE
PARSE --> WARC
PARSE --> EXTRACT
end
subgraph DB["SQLite (iSCSI SSD)"]
LINKS["archive_links table"]
end
subgraph Migrate["basilisk-tor migrate"]
CLAIM["Claim pending link"]
CLEAN["Extract clean URL"]
SUBMIT["Submit to Basilisk API"]
TRACK["Update status"]
CLAIM --> CLEAN --> SUBMIT --> TRACK
end
subgraph Basilisk["Basilisk Capture"]
API["archive.nashgo.org API"]
SCOOP["Scoop capture engine"]
S3["Garage S3"]
API --> SCOOP --> S3
end
EXTRACT --> LINKS
LINKS --> CLAIM
SUBMIT --> API
TRACK --> LINKS
stateDiagram-v2
[*] --> pending: crawl discovers link
pending --> fetching: worker claims
fetching --> captured: Basilisk returns UUID
fetching --> failed: API error / timeout
failed --> pending: release_stale_locks
captured --> replaced: KF owner rewrites link
flowchart LR
LINK["archive.ph/XyZ12"]
CHECK{Can extract original URL?}
ORIG["Capture https://reddit.com/r/..."]
FALLBACK["Capture https://archive.ph/XyZ12"]
USIPS["USIPS Archive UUID"]
LINK --> CHECK
CHECK -->|"Wayback: yes"| ORIG --> USIPS
CHECK -->|"archive.today short hash: no"| FALLBACK --> USIPS
| Provider | URL Pattern | Clean URL Extractable? | Strategy |
|---|---|---|---|
| archive.org | web.archive.org/web/{ts}/{url} |
Yes | Capture original URL |
| archive.today | archive.ph/{hash} |
No | Capture archive page |
| archive.today | archive.ph/newest/{url} |
Yes | Capture original URL |
| ghostarchive | ghostarchive.org/archive/{id} |
No | Capture archive page |
| google cache | webcache.googleusercontent.com/... |
Yes | Capture original URL |
# Crawl KF subforum, write WARC, ingest archive links to SQLite
basilisk-tor crawl \
"http://kiwifarmsaaf4t2h7gc3dfc5ojhmqruw2nit3uejrpiagrxeuxiyxcyd.onion/forums/lolcows.16/" \
--proxy "socks5h://tor-proxy-pool.proxy-pool.svc.cluster.local:9050" \
--output /data/kf-lolcows.warc.gz \
--db /data/archive_links.db \
--delay 2000
# Run migration worker (long-lived, polls for new links)
basilisk-tor migrate \
--db /data/archive_links.db \
--api-url "https://archive.nashgo.org" \
--submit-delay 5
# Check progress
basilisk-tor stats --db /data/archive_links.dbarchive_links (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_url TEXT NOT NULL, -- KF thread URL where found
archive_provider TEXT NOT NULL, -- archive.org, archive.today, etc.
archive_url TEXT NOT NULL UNIQUE, -- original archive link
clean_url TEXT NOT NULL, -- extracted original URL (or empty)
migration_status TEXT NOT NULL, -- pending|fetching|captured|replaced|failed|skipped
lock INTEGER NOT NULL, -- 1 = being processed
error_message TEXT,
usips_archive_id TEXT, -- Basilisk UUID once captured
extracted_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
created_at TEXT NOT NULL
)The captured → replaced transition requires XenForo API access to edit posts:
- Query SQLite for
migration_status = 'captured'rows - For each row: rewrite
archive_url→https://archive.usips.org/archives/{usips_archive_id}in the KF post - Mark as
replacedin SQLite
This is a XenForo POST /api/posts/{id} call with the updated message body. Needs an API key from the KF admin.