Transform Eventasaurus from a system where developers manually build each scraper into an autonomous event crawler that discovers, analyzes, builds, tests, and deploys new sources with minimal human intervention — scaling from ~17 sources to hundreds.
- Motivation
- Current State Assessment
- Architecture Overview
- Pipeline Phases (Detail)
- Gap Analysis
- Gap 1: Source Quality Scoring System
- Gap 2: ML Category Classification Integration
- Gap 3: JSON-LD / Schema.org Universal Extractor
- Gap 4: Source Candidate Table & State Machine
- Gap 5: Site Analysis / Feasibility Module
- Gap 6: LLM-Based Code Generation Loop
- Gap 7: Data Validation / LLM-as-Judge
- Gap 8: Structural Fingerprinting & Self-Healing
- Gap 9: Dynamic Source Registration
- Gap 10: Performer Enrichment Hardening
- Wireframes
- Crawler Architecture Comparison
- Implementation Roadmap
- Cost & Scale Considerations
- Open Questions
Today, adding a new event source requires a developer to:
- Manually inspect the target site (DevTools, network tab, page source)
- Determine the best extraction strategy (API? JSON-LD? HTML scraping?)
- Run
mix discovery.generate_sourceto scaffold empty stubs - Write ~200-500 lines of custom code (client, transformer, config, jobs)
- Manually register the source in
SourceRegistryandscraper.test - Run end-to-end tests, debug, iterate
- Deploy and monitor
This takes hours to days per source. At this rate, reaching 100+ sources is impractical.
The insight: We've already built most of the pipeline infrastructure — BaseJob, processing pipeline, geocoding, dedup, metrics, testing harness. The missing piece is automating the per-source customization (steps 1-6 above). Every crawler system — from Googlebot to Scrapy to Diffbot — solves this same problem: how do you go from "here's a URL" to "here's structured data" at scale?
Long-term vision: An event-specific web crawler. Like Googlebot, but instead of indexing all web content, it specifically looks for events — discovers them, extracts structured data, validates quality, and feeds them into our pipeline. Today it processes sources we point it at. Tomorrow it proactively discovers new event sources by following links and detecting event-like structured data across the web.
| Component | Status | Automation Ready? |
|---|---|---|
Source generator (mix discovery.generate_source) |
Scaffolds files with TODO stubs | Partial — stubs are empty |
| BaseJob + processing pipeline (Venue → Event → Performer) | Solid, battle-tested | Yes |
End-to-end testing (mix scraper.test) |
Real Oban workers, pass/fail | Partial — binary, no scoring |
| MetricsTracker + 13 error categories | Per-job outcome tracking | Partial — no aggregate score |
| Geocoding orchestrator (9 providers) | Auto-fallback chain | Yes |
| Dedup (ExternalID + PostGIS proximity + fuzzy) | Same-source + cross-source | Yes |
Category classification (CategoryClassifier) |
ML model exists (BART-large-MNLI) | No — not wired into pipeline |
| Performer matching (fuzzy 0.85 threshold) | Basic but functional | Mostly |
| HTTP adapter system (Direct/Crawlbase/Zyte) | Auto-fallback on blocking | Yes |
Source registry (source_registry.ex) |
Compile-time map | No — manual edit required |
| Fixture recording for tests | --fixture flag |
No — manual recording |
| Monitoring dashboards (error trends, health) | Multiple admin views | Yes |
| API discovery guide | Human-readable doc only | No — completely manual |
| Source Implementation Guide | Comprehensive, 7 steps | No — human instructions |
The downstream pipeline is remarkably autonomous. Once a transformer produces an event map with the right shape, everything after that is automatic:
event_map → Processor → VenueProcessor (geocode, dedup, create)
→ EventProcessor (category, collision, create)
→ PerformerStore (fuzzy match, enrich)
→ MetricsTracker (record outcome)
The upstream work — going from "here's a website with events" to "here's a working transformer that produces event maps" — is 100% manual. This is the automation target.
┌─────────────────────────────────────────────────────────────────────┐
│ SOURCE CANDIDATES TABLE │
│ │
│ url: "https://example.com/events" │
│ name: "Example Events" │
│ instructions: "Events listed on /agenda, Polish language" │
│ status: pending_analysis | analyzing | feasible | generating | │
│ testing | deployed | rejected | broken │
└───────────────────────────────┬─────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌───────────────┐ ┌──────────────────┐
│ PHASE 1 │ │ PHASE 2 │ │ PHASE 3 │
│ Analysis & │ │ Extraction │ │ Code │
│ Feasibility │ │ Strategy │ │ Generation │
│ │ │ │ │ │
│ • robots.txt │ │ • JSON-LD │ │ • client.ex │
│ • Format detect │ │ • API │ │ • transformer │
│ • LLM analysis │ │ • SSR bundle │ │ • config.ex │
│ • Score 0-100 │ │ • HTML+LLM │ │ • sync_job.ex │
│ • Gate: ≥60 │ │ │ │ • registration │
└────────┬─────────┘ └───────┬───────┘ └────────┬─────────┘
│ │ │
└───────────────────┼───────────────────┘
│
▼
┌─────────────────────────────────┐
│ PHASE 4: VALIDATION LOOP │
│ │
│ • Compile check │
│ • mix scraper.test --limit 5 │
│ • Quality scoring (8 dims) │
│ • LLM-as-judge comparison │
│ • If fail → back to Phase 3 │
│ (max 5 iterations) │
└───────────────┬─────────────────┘
│
Pass (score ≥ 80)
│
▼
┌─────────────────────────────────┐
│ PHASE 5: DEPLOY & MONITOR │
│ │
│ • Auto-register in production │
│ • 7-day elevated monitoring │
│ • Structural fingerprinting │
│ • Self-healing on breakage │
│ • Health scoring via Metrics │
└─────────────────────────────────┘
Input: A URL + optional human instructions Output: Feasibility report with score (0-100)
┌─────────────────────────────────────────────────┐
│ ANALYSIS PIPELINE │
│ │
│ URL ──► Fetch HTML ──► robots.txt check │
│ │ │
│ ▼ │
│ ┌─── Format Detection ───┐ │
│ │ │ │
│ │ 1. JSON-LD present? │──► score += 40 │
│ │ 2. API endpoints? │──► score += 30 │
│ │ 3. SSR data bundles? │──► score += 25 │
│ │ 4. Structured HTML? │──► score += 15 │
│ │ 5. JS-rendered only? │──► score += 5 │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─── Content Assessment ──┐ │
│ │ │ │
│ │ • Event-like content? │ (LLM check) │
│ │ • Date/time patterns? │ score += 0-20 │
│ │ • Venue information? │ score += 0-10 │
│ │ • Pagination detected? │ score += 0-10 │
│ │ • Language identified? │ │
│ └──────────────────────────┘ │
│ │ │
│ ▼ │
│ Feasibility Score: 0-100 │
│ ≥ 60: proceed < 60: reject (with reason) │
└─────────────────────────────────────────────────┘
Key checks:
robots.txtcompliance — reject if crawling is disallowed- Data format detection — JSON-LD is the golden path (zero per-source code needed)
- Content assessment via LLM — "Does this page list events? What fields are available?"
- Language detection — for
MultilingualDateParserconfiguration - Pagination pattern — infinite scroll, numbered pages, load-more button, API cursor
- Rendering requirements — static HTML vs JS-rendered (affects cost and complexity)
Input: Feasibility report from Phase 1 Output: Chosen strategy + sample extracted data
The tiered extraction model (borrowed from Crawl4AI / Diffbot patterns):
┌─────────────────────────────────────────────────┐
│ EXTRACTION STRATEGY TIERS │
│ │
│ ┌─── Tier 1: JSON-LD ────────────────────────┐ │
│ │ Cost: $0 Reliability: ★★★★★ │ │
│ │ Maintenance: Zero │ │
│ │ │ │
│ │ Parse <script type="application/ld+json"> │ │
│ │ Map schema.org Event → our event_map │ │
│ │ Handles: Event, MusicEvent, TheaterEvent, │ │
│ │ ScreeningEvent, etc. │ │
│ └─────────────────────────────────────────────┘ │
│ │ not found │
│ ▼ │
│ ┌─── Tier 2: API Endpoint ───────────────────┐ │
│ │ Cost: $0 Reliability: ★★★★☆ │ │
│ │ Maintenance: Low (API versioning) │ │
│ │ │ │
│ │ Discovered XHR/fetch endpoints from page │ │
│ │ GraphQL introspection if available │ │
│ │ Returns structured JSON already │ │
│ └─────────────────────────────────────────────┘ │
│ │ not found │
│ ▼ │
│ ┌─── Tier 3: SSR Data Bundle ────────────────┐ │
│ │ Cost: $0 Reliability: ★★★☆☆ │ │
│ │ Maintenance: Medium (framework updates) │ │
│ │ │ │
│ │ Extract __NEXT_DATA__, __NUXT__, │ │
│ │ window.__INITIAL_STATE__, etc. │ │
│ │ Already JSON — just needs field mapping │ │
│ └─────────────────────────────────────────────┘ │
│ │ not found │
│ ▼ │
│ ┌─── Tier 4: LLM HTML Extraction ───────────┐ │
│ │ Cost: ~$0.01-0.05/page Reliability: ★★★☆☆│ │
│ │ Maintenance: Self-healing via LLM │ │
│ │ │ │
│ │ Clean HTML (strip nav/footer/ads) │ │
│ │ Send to Claude: "Extract event fields" │ │
│ │ Validate output against schema │ │
│ │ Option A: Generate CSS selectors (once) │ │
│ │ Option B: LLM extract every crawl │ │
│ └─────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
Strategy selection criteria:
| Strategy | When to use | Per-crawl cost | Breakage risk |
|---|---|---|---|
| JSON-LD | Site has @type: "Event" markup |
$0 | Very low |
| API | Discoverable REST/GraphQL endpoint | $0 | Low-medium |
| SSR Bundle | Next.js/Nuxt/SvelteKit with SSR | $0 | Medium |
| LLM (selector gen) | HTML only, generate selectors once | $0 after init | Medium-high |
| LLM (runtime) | Frequently changing HTML structure | $0.01-0.05/page | Low |
Input: Extraction strategy + sample data + feasibility report Output: Complete source directory (client.ex, transformer.ex, config.ex, sync_job.ex)
┌────────────────────────────────────────────────────────┐
│ CODE GENERATION LOOP │
│ │
│ Context Window: │
│ ┌──────────────────────────────────────────────────┐ │
│ │ • Source Implementation Guide (docs/) │ │
│ │ • 2-3 similar existing sources (few-shot) │ │
│ │ • Feasibility report + sample data │ │
│ │ • Extraction strategy + sample raw response │ │
│ │ • BaseJob source code │ │
│ │ • Processor / EventProcessor source code │ │
│ │ • Target event_map schema │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─── Generate ────────────────────────────────────┐ │
│ │ │ │
│ │ client.ex HTTP calls for chosen strategy │ │
│ │ transformer.ex Raw response → event_map │ │
│ │ config.ex URLs, rate limits, dedup │ │
│ │ sync_job.ex Orchestration (BaseJob or │ │
│ │ custom for multi-page) │ │
│ │ + SourceRegistry entry │ │
│ │ + scraper.test entry │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ For JSON-LD sources: may skip client.ex/ │
│ transformer.ex entirely — use universal extractor │
└────────────────────────────────────────────────────────┘
For JSON-LD sources (Tier 1): The universal JSON-LD extractor handles everything. Code generation may only need a thin config specifying the base URL and pagination pattern. This is the highest-leverage path — dozens of sources with near-zero per-source code.
For API/SSR sources (Tiers 2-3): LLM generates client.ex (HTTP calls) and transformer.ex (field mapping). These are relatively straightforward since the data is already structured JSON.
For HTML sources (Tier 4): Most complex. LLM generates either CSS selectors (one-time) or a runtime extraction prompt. The generated transformer includes date parsing via MultilingualDateParser.
Input: Generated source code Output: Quality score (0-100) + pass/fail decision
┌────────────────────────────────────────────────────────┐
│ VALIDATION LOOP │
│ │
│ Iteration 1 of max 5: │
│ │
│ ┌─── Step 1: Compile ──────────────────────────────┐ │
│ │ mix compile --warnings-as-errors │ │
│ │ If fail → feed errors to LLM → regenerate │ │
│ └──────────────────────────────────────────────────┘ │
│ │ pass │
│ ▼ │
│ ┌─── Step 2: End-to-End Test ──────────────────────┐ │
│ │ mix scraper.test <source> --limit 5 │ │
│ │ If SyncJob fails → feed errors to LLM │ │
│ └──────────────────────────────────────────────────┘ │
│ │ pass │
│ ▼ │
│ ┌─── Step 3: Quality Scoring ──────────────────────┐ │
│ │ │ │
│ │ Dimension Weight Score │ │
│ │ ───────────────── ────── ───── │ │
│ │ Field completeness 25% title+date+venue+url │ │
│ │ Date accuracy 20% parsed, not noon-UTC │ │
│ │ Venue geocoding 15% geocoded successfully │ │
│ │ Category confidence 10% ML classification avg │ │
│ │ Dedup collision 10% reasonable rate │ │
│ │ Performer extract 10% at least 1 performer │ │
│ │ Metadata richness 5% images, descriptions │ │
│ │ URL validity 5% resolvable event URLs │ │
│ │ │ │
│ │ Composite: weighted average → 0-100 │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─── Step 4: LLM-as-Judge ────────────────────────┐ │
│ │ │ │
│ │ For 3-5 sample events: │ │
│ │ • Fetch original page │ │
│ │ • Compare extracted data vs. what's on page │ │
│ │ • Flag: hallucinated fields? mismatched data? │ │
│ │ • Catch systematic errors (e.g., venue as │ │
│ │ title, wrong date interpretation) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ │ │
│ Score ≥ 80 Score < 80 │
│ │ │ │
│ ▼ ▼ │
│ PHASE 5: Deploy Feed errors + scores │
│ back to Phase 3 │
│ (iteration N+1) │
└────────────────────────────────────────────────────────┘
Quality thresholds:
| Score | Action |
|---|---|
| ≥ 80 | Auto-deploy with standard monitoring |
| 60-79 | Auto-deploy with elevated monitoring (7-day probation) |
| 40-59 | Queue for human review — too risky to auto-deploy |
| < 40 | Reject — mark candidate as rejected with reason |
Input: Validated source with quality score ≥ 60 Output: Live source in production with ongoing health monitoring
┌────────────────────────────────────────────────────────┐
│ DEPLOYMENT & ONGOING MONITORING │
│ │
│ Day 0: Deploy │
│ ┌──────────────────────────────────────────────────┐ │
│ │ • Register source in DB (dynamic registration) │ │
│ │ • Set initial crawl schedule │ │
│ │ • Record structural fingerprint of key pages │ │
│ │ • Flag as "probation" for 7 days │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Days 1-7: Probation │
│ ┌──────────────────────────────────────────────────┐ │
│ │ • Every crawl: compare quality score to initial │ │
│ │ • If score drops > 20 points → pause + alert │ │
│ │ • If 3+ consecutive crawls succeed → confidence │ │
│ │ • Human can promote to "stable" early │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Ongoing: Health Monitoring │
│ ┌──────────────────────────────────────────────────┐ │
│ │ • Structural fingerprint check each crawl │ │
│ │ • Quality score computed per run │ │
│ │ • Error rate trends via existing monitoring │ │
│ │ • Event-aware freshness: │ │
│ │ - starts_at < 72h: high crawl priority │ │
│ │ - starts_at passed: stop crawling │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Self-Healing Loop: │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Fingerprint changed? │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Run validation on existing selectors │ │
│ │ │ │ │
│ │ ├── Still works → no action │ │
│ │ │ │ │
│ │ └── Broken → Trigger re-analysis (Phase 2) │ │
│ │ │ │ │
│ │ ├── LLM regenerates selectors │ │
│ │ │ │ │
│ │ ├── Validate new selectors │ │
│ │ │ │ │
│ │ └── If 3 failures → mark "broken", │ │
│ │ alert human │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
Priority: P0 — prerequisite for everything Effort: Medium (1-2 weeks) Dependencies: None (builds on existing MetricsTracker data)
Currently: mix scraper.test gives binary pass/fail. MetricsTracker records per-job outcomes but there's no aggregate source-level score.
Needed: A SourceQualityScore module that grades a scraper run across multiple dimensions:
# Proposed API
defmodule EventasaurusDiscovery.Metrics.SourceQualityScore do
@type dimension :: %{
name: String.t(),
weight: float(),
score: float(), # 0.0 - 1.0
details: String.t()
}
@type report :: %{
source_slug: String.t(),
run_id: String.t(),
composite_score: float(), # 0-100
dimensions: [dimension()],
sample_size: integer(),
computed_at: DateTime.t()
}
@spec score_run(String.t(), String.t()) :: {:ok, report()} | {:error, term()}
def score_run(source_slug, run_id)
@spec score_latest(String.t()) :: {:ok, report()} | {:error, term()}
def score_latest(source_slug)
endScoring dimensions:
| Dimension | Weight | How Measured |
|---|---|---|
| Field completeness | 25% | % events with title + starts_at + venue + URL |
| Date accuracy | 20% | % dates that aren't noon-UTC fallback (time_tbd: false) |
| Venue geocoding | 15% | % venues that geocode to a real location |
| Category confidence | 10% | Average ML classification confidence |
| Dedup collision rate | 10% | Between 5-40% is healthy (too low with known overlap = broken dedup) |
| Performer extraction | 10% | % events with ≥ 1 performer (where applicable) |
| Metadata richness | 5% | % with images, descriptions > 50 chars |
| URL validity | 5% | % of event URLs that resolve (HTTP 200) |
Also needed: mix quality.score <source> CLI task and integration into admin dashboards.
Priority: P0 — required for autonomous sources (no rule-based mapping exists for new sources)
Effort: Medium (1-2 weeks)
Dependencies: CategoryClassifier already exists with BART-large-MNLI
Currently: CategoryClassifier exists but is not wired into the processing pipeline. CategoryExtractor (called from EventProcessor) uses rule-based per-source mapping only.
Needed:
- Wire
CategoryClassifierintoCategoryExtractoras a fallback when rule-based mapping produces no result - Confidence threshold: only apply ML category if confidence > 0.7
- For autonomous sources, ML classification is the primary method (no rule-based mapping will exist)
- Backtest against existing categorized events to validate accuracy before going live
- Add classification confidence to MetricsTracker metadata for quality scoring
Event Title + Description
│
▼
┌─── CategoryExtractor ───────────────────┐
│ │
│ 1. Check source-specific rule mapping │
│ (existing sources only) │
│ │ │
│ ├── Match found → use it │
│ │ │
│ └── No match ──▼ │
│ │
│ 2. ML Classification (CategoryClassifier)│
│ Input: title + description text │
│ Output: category + confidence │
│ │ │
│ ├── confidence ≥ 0.7 → use it │
│ │ │
│ └── confidence < 0.7 → "other" + │
│ flag for review │
└──────────────────────────────────────────┘
Priority: P0 — highest ROI single feature Effort: Medium (1-2 weeks) Dependencies: None
Currently: Some sources have bespoke JSON-LD parsing. No universal extractor.
Needed: A generic module that can extract events from any page with schema.org Event markup:
defmodule EventasaurusDiscovery.Extraction.JsonLdExtractor do
@supported_types ~w(Event MusicEvent TheaterEvent ScreeningEvent
DanceEvent EducationEvent SportsEvent
VisualArtsEvent Festival)
@spec extract_events(String.t()) :: {:ok, [event_map()]} | {:error, term()}
def extract_events(html)
@spec extract_events_from_url(String.t()) :: {:ok, [event_map()]} | {:error, term()}
def extract_events_from_url(url)
endSchema.org Event → our event_map mapping:
| Schema.org field | Our field | Notes |
|---|---|---|
name |
title |
|
startDate |
starts_at |
ISO 8601, handle timezone |
endDate |
ends_at |
Optional |
url |
url |
|
location.name |
venue_data.name |
|
location.address |
venue_data.address |
Can be string or PostalAddress |
location.geo.latitude |
venue_data.latitude |
|
location.geo.longitude |
venue_data.longitude |
|
performer[].name |
performer_names |
|
image |
metadata.image_url |
|
description |
metadata.description |
|
offers.price |
metadata.min_price |
|
offers.priceCurrency |
metadata.currency |
Why this is highest-ROI: Many event websites already emit JSON-LD because Google requires it for rich search results. A single universal extractor could add dozens of sources with zero per-source custom code. The only per-source configuration needed is the base URL and pagination pattern.
Priority: P0 — the "crawl frontier" for source onboarding Effort: Small (1 week) Dependencies: None
Needed: A new source_candidates table:
CREATE TABLE source_candidates (
id BIGSERIAL PRIMARY KEY,
url TEXT NOT NULL,
name TEXT NOT NULL,
instructions TEXT, -- human notes ("events on /agenda page")
status TEXT NOT NULL DEFAULT 'pending_analysis',
feasibility_score INTEGER, -- 0-100
quality_score INTEGER, -- 0-100 (after test run)
extraction_strategy TEXT, -- json_ld, api, ssr, html_selectors, llm_runtime
analysis_report JSONB, -- full feasibility report
quality_report JSONB, -- full quality score breakdown
generation_attempts INTEGER DEFAULT 0,
error_log JSONB, -- history of failures
structural_fingerprint TEXT, -- DOM hash for change detection
deployed_source_id BIGINT REFERENCES sources(id),
last_analyzed_at TIMESTAMPTZ,
deployed_at TIMESTAMPTZ,
inserted_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL
);State machine:
pending_analysis ──► analyzing ──► feasible ──► generating ──► testing ──► deployed
│ │ │ │
▼ ▼ ▼ ▼
rejected rejected failed ──► failed ──►
(retry if (retry if
attempts < 5) attempts < 5)
deployed ──► broken ──► re_analyzing ──► ... (self-healing loop)
Priority: P1 — needed for automated onboarding Effort: Large (2-3 weeks) Dependencies: Gap 4 (candidate table)
Currently: Completely manual. The API discovery guide is a human-readable document.
Needed: An Oban job that:
- Fetches the target URL + 2-3 linked pages
- Checks
robots.txtfor crawl permissions - Detects data format:
- Scan for
<script type="application/ld+json">with Event types - Scan for XHR/fetch patterns in page source (API endpoints)
- Scan for
__NEXT_DATA__,__NUXT__,__INITIAL_STATE__etc. - Analyze HTML structure if none of the above
- Scan for
- Sends cleaned page content to LLM with prompt: "Analyze this page. Does it contain event listings? What fields are available (title, date, time, venue, price, performers)? Is there pagination? What language is the content in?"
- Computes feasibility score
- Stores report in
source_candidates.analysis_report
The LLM analysis prompt would include:
- Examples of what we consider "good" event data
- Our required fields vs. nice-to-have fields
- Known patterns that indicate high feasibility (structured dates, clear venue names)
Priority: P1 — the core automation Effort: Large (3-4 weeks) Dependencies: Gap 1 (quality scoring), Gap 3 (JSON-LD extractor), Gap 5 (analysis module)
Currently: mix discovery.generate_source creates empty stubs. A developer fills them in.
Needed: An agent loop (could be an Oban job chain or an external agent process) that:
- Takes the feasibility report + extraction strategy
- Selects 2-3 similar existing sources as few-shot examples based on:
- Same extraction strategy (e.g., other JSON-LD sources for a JSON-LD candidate)
- Similar content type (music events → use Bandsintown/RA as examples)
- Similar language/region
- Constructs a prompt with:
- Source Implementation Guide
- BaseJob source code
- Example sources (full client.ex + transformer.ex)
- Sample raw response from the target site
- Target event_map schema
- Generates complete source files
- Writes files to disk
- Runs validation (Phase 4)
- If validation fails, feeds error messages + quality report back and regenerates
For JSON-LD sources: This step may be trivial or unnecessary — the universal extractor (Gap 3) handles extraction, so code generation only needs a minimal config file specifying:
- Base URL(s) for crawling
- Pagination pattern (if any)
- Any source-specific field overrides
For API/HTML sources: Full code generation is needed. The key insight is that our existing sources serve as excellent few-shot examples — the LLM can see how client.ex calls an API and how transformer.ex maps the response.
Priority: P1 — prevents erroneous data from entering the DB Effort: Medium (1-2 weeks) Dependencies: None
Currently: Validation is structural only (required fields present, types correct). No semantic validation.
Needed: For new/autonomous sources, an LLM validation pass:
┌─────────────────────────────────────────────────┐
│ LLM-AS-JUDGE VALIDATION │
│ │
│ For each of 3-5 sample events: │
│ │
│ Input: │
│ ┌────────────────────────────────────────────┐ │
│ │ Extracted event_map: │ │
│ │ {title: "Jazz Night", starts_at: ..., │ │
│ │ venue: "Blue Note", ...} │ │
│ │ │ │
│ │ Original page HTML (cleaned): │ │
│ │ <div class="event">...</div> │ │
│ └────────────────────────────────────────────┘ │
│ │
│ Prompt: │
│ "Compare the extracted data to the original │
│ page. Score accuracy 0-100. Flag any: │
│ - Hallucinated fields (not on page) │
│ - Mismatched data (wrong date, wrong venue) │
│ - Systematic errors (field swap patterns) │
│ - Missing available data (on page, not │
│ extracted)" │
│ │
│ Output: │
│ ┌────────────────────────────────────────────┐ │
│ │ accuracy_score: 92 │ │
│ │ issues: [ │ │
│ │ {field: "starts_at", type: "mismatch", │ │
│ │ expected: "20:00", got: "08:00 PM"} │ │
│ │ ] │ │
│ └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
Key principle: This runs on a sample (3-5 events) during onboarding, not on every crawl. It's a one-time quality gate, not a runtime cost.
Priority: P2 — needed for long-term maintenance at scale Effort: Medium (2 weeks) Dependencies: Gap 6 (code generation, for regeneration capability)
Currently: When a site changes its HTML structure, scrapers break silently. Detection relies on humans noticing via monitoring dashboards.
Needed: Inspired by Kadoa's self-healing scraper pattern:
-
Fingerprinting: After each successful crawl, hash the DOM structure of key pages (listing page, detail page). Store in
source_candidates.structural_fingerprint. -
Change detection: Before each crawl, fetch and fingerprint. If hash differs significantly from stored:
- Still try existing extraction logic
- If extraction still works → update fingerprint, no action
- If extraction fails → trigger self-healing
-
Self-healing cascade:
Fingerprint changed + extraction failed │ ├── Tier 1 source (JSON-LD): Check if JSON-LD still present │ └── Usually still works (JSON-LD survives redesigns) │ ├── Tier 2-3 (API/SSR): Check if endpoint still responds │ └── Usually still works (API versioned separately from UI) │ └── Tier 4 (HTML selectors): Re-analyze page structure ├── LLM generates new selectors ├── Validate on sample ├── If works → update source code, continue └── If fails 3x → mark "broken", alert human
This is why JSON-LD sources are so valuable for autonomous crawling — they're naturally resistant to site redesigns because the structured data block is independent of visual layout.
Priority: P2 — needed for zero-deployment onboarding Effort: Small-Medium (1 week) Dependencies: Gap 4 (candidate table)
Currently: SourceRegistry is a compile-time map. Adding a source requires a code change and deployment.
Needed: Move source resolution to a DB-first approach:
# Current (compile-time)
@source_to_job %{
"bandsintown" => EventasaurusDiscovery.Sources.Bandsintown.Jobs.SyncJob,
# ... hardcoded
}
# Proposed (DB-first with compile-time fallback)
def get_sync_job(slug) do
case get_from_db(slug) do
{:ok, module_name} -> {:ok, String.to_existing_atom(module_name)}
:not_found -> get_from_compile_time_map(slug)
end
endFor autonomous sources using the universal JSON-LD extractor, the "SyncJob module" could be a generic configurable job that reads its configuration from the DB rather than from a source-specific module.
Priority: P2 — quality improvement for autonomous sources Effort: Small (1 week) Dependencies: None
Currently: Performer matching works via name-based fuzzy match at 0.85 threshold. For new/unknown sources, performer names may come in unusual formats.
Needed:
- Name normalization before matching: strip roles ("DJ", "MC", "feat."), normalize casing, handle "Last, First" vs "First Last"
- Disambiguation assist for edge cases: "The National" (band? venue?), "Paris" (performer? city?)
- Duplicate prevention: Before creating a new performer, check if a similar one exists across all sources, not just the current source
- Confidence tracking: Record match confidence in
public_event_performersmetadata
The primary admin interface for managing the autonomous pipeline.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Source Candidates + Add New │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Filters: [All ▾] [Pending ▾] [Feasible ▾] [Deployed ▾] [Broken ▾] │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ ● Analyzing example-events.pl │ │
│ │ Strategy: Detecting... Score: -- Attempt: 1/5 │ │
│ │ URL: https://example-events.pl/agenda │ │
│ │ Added 2 hours ago │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ ◉ Feasible warsaw-concerts.com Feasibility: 87 │ │
│ │ Strategy: JSON-LD Quality: -- Attempt: 0/5 │ │
│ │ URL: https://warsaw-concerts.com/events │ │
│ │ Analyzed 1 hour ago [Generate Source ►] [Reject ✕] │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ ◉ Testing krakow-nightlife.pl Feasibility: 72 │ │
│ │ Strategy: API endpoint Quality: 64 Attempt: 2/5 │ │
│ │ URL: https://krakow-nightlife.pl/api/events │ │
│ │ Last test: 34 min ago — "Date parsing: 3 of 5 events used │ │
│ │ noon-UTC fallback" │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ ✓ Deployed gdansk-kultura.pl Quality: 91 │ │
│ │ Strategy: JSON-LD Events: 847 Last crawl: 3h ago │ │
│ │ Deployed 12 days ago Status: Stable │ │
│ │ Health: ████████████████████░░ 91% │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ ✕ Rejected some-blog.com Feasibility: 23 │ │
│ │ Reason: "No structured event data. Blog posts mention events │ │
│ │ but no dates, venues, or machine-readable listings found." │ │
│ │ Analyzed 3 days ago [Re-analyze] [Delete] │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ ⚠ Broken old-source.com Quality: 34 (was 82) │ │
│ │ Strategy: HTML selectors Self-heal attempts: 3/3 failed │ │
│ │ "Site redesigned. JSON-LD removed. New layout uses React SPA │ │
│ │ with no SSR. Requires JS rendering." │ │
│ │ Broken since 2 days ago [Re-analyze] [Pause] [Archive] │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Summary: 47 deployed │ 3 testing │ 12 feasible │ 8 rejected │
│ 2 broken │ 5 analyzing │ 4 pending │
└─────────────────────────────────────────────────────────────────────────────┘
Shown when clicking into a candidate after analysis completes.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Feasibility Report: warsaw-concerts.com │
│ Analyzed: 2026-04-11 14:32 UTC │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Overall Feasibility Score: ████████████████████░░░░░ 87/100 │
│ Recommendation: ✓ PROCEED — high confidence, JSON-LD extraction │
│ │
│ ┌─── Data Format Detection ────────────────────────────────────────────┐ │
│ │ │ │
│ │ ✓ JSON-LD: Found <script type="application/ld+json"> │ │
│ │ Types: Event (23), MusicEvent (18), TheaterEvent (5) │ │
│ │ Fields: name ✓ startDate ✓ endDate ✓ location ✓ │ │
│ │ performer ✓ offers ✓ image ✓ description ✓ │ │
│ │ Coverage: 46/46 events have all required fields │ │
│ │ │ │
│ │ ○ API Endpoints: None discovered │ │
│ │ ○ SSR Bundles: None (__NEXT_DATA__ / __NUXT__ not found) │ │
│ │ ○ HTML Structure: Available as fallback │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Site Characteristics ─────────────────────────────────────────────┐ │
│ │ │ │
│ │ Language: Polish (pl) │ │
│ │ Rendering: Static HTML (no JS required) │ │
│ │ Pagination: /events?page=N (12 pages detected) │ │
│ │ robots.txt: Crawling allowed (no Crawl-delay) │ │
│ │ Event count: ~550 events estimated │ │
│ │ Update freq: ~15 new events/week (estimated from dates) │ │
│ │ Date range: 2026-04-05 to 2026-09-15 │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Sample Extraction (3 events) ─────────────────────────────────────┐ │
│ │ │ │
│ │ Event 1: │ │
│ │ title: "Kacey Musgraves — Deeper Well Tour" │ │
│ │ starts_at: 2026-04-18T20:00:00+02:00 │ │
│ │ venue: "Torwar Hall" (52.2178, 21.0009) │ │
│ │ performers: ["Kacey Musgraves"] │ │
│ │ price: 180-450 PLN │ │
│ │ │ │
│ │ Event 2: │ │
│ │ title: "Warsaw Jazz Weekend" │ │
│ │ starts_at: 2026-04-25T19:30:00+02:00 │ │
│ │ venue: "Palladium" (52.2297, 21.0122) │ │
│ │ performers: ["Kamasi Washington", "GoGo Penguin"] │ │
│ │ price: 120-280 PLN │ │
│ │ │ │
│ │ Event 3: ... │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Risks & Notes ────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ⚠ Some events have Polish-only descriptions (no English) │ │
│ │ ⚠ Venue coordinates missing for 3/46 events (will need geocoding) │ │
│ │ ✓ Date formats are ISO 8601 — no parsing ambiguity │ │
│ │ ✓ Performer names are clean (no "feat." / "DJ" prefixes) │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ [▶ Generate Source] [✕ Reject] [↻ Re-analyze] │
└─────────────────────────────────────────────────────────────────────────────┘
Shown after a test run completes (Phase 4), and accessible for deployed sources.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Quality Scorecard: warsaw-concerts.com │
│ Run: 2026-04-11 15:07 UTC │ Sample: 5 events │ Attempt: 1/5 │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Composite Score: ████████████████████████░░░░░ 88/100 → AUTO-DEPLOY │
│ │
│ ┌─── Dimension Breakdown ──────────────────────────────────────────────┐ │
│ │ │ │
│ │ Field Completeness (25%) ████████████████████ 100% 5/5 │ │
│ │ All events have title + starts_at + venue + URL │ │
│ │ │ │
│ │ Date Accuracy (20%) ████████████████████ 100% 5/5 │ │
│ │ All dates parsed with specific times (no noon-UTC fallback) │ │
│ │ │ │
│ │ Venue Geocoding (15%) ████████████████░░░░ 80% 4/5 │ │
│ │ 1 venue ("Klub Zmiana") needed geocoding — resolved via Google │ │
│ │ │ │
│ │ Category Confidence (10%) ██████████████░░░░░░ 72% avg │ │
│ │ ML classified: MusicEvent(3, avg 0.89), TheaterEvent(1, 0.73), │ │
│ │ Other(1, 0.41 — flagged for review) │ │
│ │ │ │
│ │ Dedup Collisions (10%) ████████████████████ 100% │ │
│ │ 2/5 events matched existing (Bandsintown overlap) — expected │ │
│ │ │ │
│ │ Performer Extraction (10%) ████████████████░░░░ 80% 4/5 │ │
│ │ 1 event ("Warsaw Jazz Weekend") has 2 performers extracted │ │
│ │ 1 event (art exhibition) correctly has 0 performers │ │
│ │ │ │
│ │ Metadata Richness (5%) ██████████████████░░ 90% │ │
│ │ 4/5 have images, 5/5 have descriptions > 50 chars │ │
│ │ │ │
│ │ URL Validity (5%) ████████████████████ 100% 5/5 │ │
│ │ All event URLs return HTTP 200 │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── LLM-as-Judge Validation ──────────────────────────────────────────┐ │
│ │ │ │
│ │ Accuracy: 96/100 │ │
│ │ │ │
│ │ ✓ Event 1: All fields match original page │ │
│ │ ✓ Event 2: All fields match original page │ │
│ │ ⚠ Event 3: Description truncated at 500 chars (original is 1200) │ │
│ │ ✓ Event 4: All fields match original page │ │
│ │ ✓ Event 5: All fields match original page │ │
│ │ │ │
│ │ Systematic issues: None detected │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ [▶ Deploy] [↻ Regenerate] [✕ Reject] │
└─────────────────────────────────────────────────────────────────────────────┘
Overview of all pipeline activity — what's currently running, what's queued, recent outcomes.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Crawler Pipeline Monitor │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─── Active Jobs ──────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ● Analyzing lodz-events.pl Started 4m ago │ │
│ │ Phase 1 → Fetching pages (3/5) │ │
│ │ │ │
│ │ ● Generating poznan-muzyka.pl Started 12m ago Attempt 2 │ │
│ │ Phase 3 → LLM writing transformer.ex (retry: date parse error) │ │
│ │ │ │
│ │ ● Testing wroclaw-bilety.com Started 8m ago │ │
│ │ Phase 4 → mix scraper.test running (3/5 events processed) │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Recent Outcomes (last 24h) ───────────────────────────────────────┐ │
│ │ │ │
│ │ ✓ 14:32 gdansk-kultura.pl Deployed Score: 91 JSON-LD │ │
│ │ ✓ 13:15 torun-festiwale.pl Deployed Score: 84 API │ │
│ │ ✕ 12:44 random-blog.net Rejected Feas: 18 (blog) │ │
│ │ ✓ 11:30 bielsko-biala-events.pl Deployed Score: 77 JSON-LD │ │
│ │ ⚠ 10:15 szczecin-noc.pl Testing Score: 58 HTML │ │
│ │ → Attempt 3: "Venue names extracted as addresses" │ │
│ │ ✕ 09:00 some-restaurant.pl Rejected Feas: 31 (menu) │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Pipeline Stats ──────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Today This Week This Month All Time │ │
│ │ ───── ───────── ────────── ──────── │ │
│ │ Analyzed: 8 Analyzed: 34 Analyzed: 127 Analyzed: 312 │ │
│ │ Deployed: 3 Deployed: 14 Deployed: 47 Deployed: 108 │ │
│ │ Rejected: 2 Rejected: 11 Rejected: 52 Rejected: 143 │ │
│ │ Broken: 0 Broken: 1 Broken: 3 Broken: 8 │ │
│ │ Healed: 0 Healed: 1 Healed: 2 Healed: 5 │ │
│ │ │ │
│ │ Success rate: 41% │ Avg attempts: 1.8 │ Avg time: 23 min │ │
│ │ JSON-LD %: 64% │ API %: 21% │ HTML %: 15% │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Health Alerts ────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ⚠ old-source.com — Structural change detected 2h ago │ │
│ │ Self-healing in progress (attempt 1/3) │ │
│ │ │ │
│ │ ⚠ krakow-imprezy.pl — Quality score dropped: 82 → 61 │ │
│ │ Last 3 runs: geocoding failures spiked (provider issue?) │ │
│ │ │ │
│ │ ✕ warsaw-nightlife.com — Broken for 5 days, 3/3 heal attempts │ │
│ │ Needs human intervention │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Detail view when a deployed source triggers the self-healing loop.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Self-Healing: old-source.com │
│ Status: Re-analyzing (attempt 1 of 3) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─── What Happened ────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Timeline: │ │
│ │ ─────────────────────────────────────────────── │ │
│ │ Apr 9 22:00 Crawl succeeded, score 84 │ │
│ │ Apr 10 06:00 Crawl succeeded, score 82 │ │
│ │ Apr 10 14:00 Structural fingerprint changed (hash: a4f2 → c8d1) │ │
│ │ Apr 10 14:01 Crawl attempted — extraction failed │ │
│ │ Error: "CSS selector .event-card returned 0 results" │ │
│ │ Apr 10 14:02 Self-healing triggered │ │
│ │ Apr 10 14:05 Re-analysis complete: site redesigned │ │
│ │ New structure uses <article class="listing-item"> │ │
│ │ Apr 10 14:10 New selectors generated by LLM │ │
│ │ Apr 10 14:12 Validation: 4/5 events extracted correctly │ │
│ │ Apr 10 14:13 ⚠ 1 event missing venue (new layout nests it │ │
│ │ differently) │ │
│ │ Apr 10 14:15 Regenerating with updated venue selector... │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Fingerprint Diff ─────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Before (hash a4f2): After (hash c8d1): │ │
│ │ <div class="event-list"> <section class="listings"> │ │
│ │ <div class="event-card"> <article class="listing-item"> │ │
│ │ <h3 class="title"> <h2 class="listing-title"> │ │
│ │ <span class="date"> <time datetime="..."> │ │
│ │ <span class="venue"> <div class="venue-info"> │ │
│ │ <span class="venue-name"> │ │
│ │ │ │
│ │ Changes: class names, tag types, venue nesting depth │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Actions ──────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ [Continue Healing] [Pause Source] [Manual Fix] [Archive] │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Simple form for adding a new source candidate to the pipeline.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Add Source Candidate │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ URL * │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ https://example-events.pl/agenda │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Name * │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Example Events │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Instructions (optional — hints for the analyzer) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Events are on the /agenda page. Polish language. Pagination via │ │
│ │ "Load more" button. Some events span multiple days. │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Priority │
│ ○ Normal — analyze when queue is clear │
│ ◉ High — analyze next │
│ ○ Low — analyze when nothing else is pending │
│ │
│ Auto-deploy if quality score ≥ 80? │
│ [✓] Yes, auto-deploy high-quality sources │
│ [ ] No, always require manual approval │
│ │
│ [Cancel] [Add & Analyze ►] │
└─────────────────────────────────────────────────────────────────────────────┘
How our system maps to established crawler architectures:
| Concept | Googlebot | Scrapy | Diffbot | Eventasaurus (proposed) |
|---|---|---|---|---|
| URL frontier | Priority queue by PageRank + freshness | Scheduler with per-domain queues | N/A (on-demand) | source_candidates table + Oban priority queues |
| Fetcher | Custom HTTP + WRS (JS rendering) | Downloader + middleware | Cloud rendering | Http.Client + adapters (Direct/Crawlbase/Zyte) |
| Structure detection | JSON-LD, Microdata, RDFa parsers | Manual spider per site | ML page-type classifier | Tiered: JSON-LD → API → SSR → LLM |
| Field extraction | Schema.org field mapping | CSS/XPath selectors per spider | ML per-type models | Universal JSON-LD + per-source transformer |
| Quality validation | Index quality signals | Item pipeline validators | Confidence scores | Quality scoring (8 dims) + LLM-as-judge |
| Dedup | URL canonicalization + content fingerprint | Request fingerprint filter | Entity resolution (Knowledge Graph) | ExternalID + PostGIS proximity + fuzzy name |
| Scheduling | Crawl budget per domain | Spider scheduling | On-demand | Oban cron + event-aware priority |
| Self-healing | Re-rendering on change | Manual spider updates | Auto-adapts (ML) | Structural fingerprinting + LLM regeneration |
| Politeness | robots.txt + adaptive rate | DOWNLOAD_DELAY + AutoThrottle | Managed | Per-source rate limits + robots.txt |
Key differentiator: We're a domain-specific event crawler. Unlike Googlebot (indexes everything) or Diffbot (extracts any page type), we only care about events. This constraint means:
- We know exactly what fields we need (title, date, venue, etc.)
- We can validate much more deeply (geocoding, date parsing, category classification)
- We can cross-reference across sources (dedup, performer matching)
- Our quality standards are higher but narrower
These are prerequisites. Without them, the pipeline can't score quality or handle categories for new sources.
| # | Gap | Effort | What it unblocks |
|---|---|---|---|
| 1 | Source Quality Scoring (Gap 1) | 2 weeks | Phase 4 validation, deployment gates, monitoring |
| 2 | ML Category Classification (Gap 2) | 2 weeks | Category data for autonomous sources |
| 3 | JSON-LD Universal Extractor (Gap 3) | 1 week | Zero-code sources, highest ROI path |
| 4 | Source Candidate Table (Gap 4) | 1 week | Pipeline state machine, admin dashboard |
Milestone A: Can manually add a JSON-LD source URL to the candidate table, and the system scores its quality. Category classification works for new events.
The autonomous pipeline core. After this phase, the system can go from URL → deployed source.
| # | Gap | Effort | What it unblocks |
|---|---|---|---|
| 5 | Site Analysis Module (Gap 5) | 3 weeks | Automated feasibility assessment |
| 6 | LLM Code Generation Loop (Gap 6) | 4 weeks | Automated source building |
| 7 | LLM-as-Judge Validation (Gap 7) | 1 week | Semantic quality gate |
Milestone B: Add a URL to the candidate table. System analyzes feasibility, generates source code, tests it, scores quality, and deploys if score ≥ 80. Full loop works for JSON-LD and API sources.
Long-term maintenance and zero-deployment operations.
| # | Gap | Effort | What it unblocks |
|---|---|---|---|
| 8 | Structural Fingerprinting (Gap 8) | 2 weeks | Self-healing on site changes |
| 9 | Dynamic Source Registration (Gap 9) | 1 week | No deployment needed for new sources |
| 10 | Performer Enrichment (Gap 10) | 1 week | Better performer data from unknown sources |
Milestone C: Sources self-heal when sites change. New sources deploy without code changes. System manages 50+ sources autonomously.
The "event-specific Googlebot" vision.
| # | Feature | Description |
|---|---|---|
| 11 | Proactive discovery | Follow links from known event sites, detect event-like structured data on new sites, propose them as candidates |
| 12 | Adaptive scheduling | Crawl frequency based on observed change rate (Microsoft Optimal Freshness model) |
| 13 | Raw response archival | WARC-like storage for re-extraction without re-crawling |
| 14 | Conditional HTTP | ETag/Last-Modified tracking to skip unchanged pages |
| 15 | Multi-region expansion | Automatically discover event sources for new cities/countries |
Milestone D: System proactively discovers and onboards new event sources. Manages 200+ sources. Crawl budget and scheduling optimized per source.
| Operation | When | Cost per invocation | Frequency |
|---|---|---|---|
| Site analysis (Phase 1) | Per candidate | ~$0.02-0.10 | Once per candidate |
| Code generation (Phase 3) | Per source | ~$0.10-0.50 | 1-5x per source |
| LLM-as-judge (Phase 4) | Per test run | ~$0.05-0.15 | 1-5x per source |
| Self-healing (Phase 5) | On breakage | ~$0.10-0.30 | Rare (monthly?) |
| LLM runtime extraction (Tier 4) | Per crawl | ~$0.01-0.05/page | Per event page |
For JSON-LD/API sources: LLM costs are one-time during onboarding (~$0.20-1.00 per source). For HTML/LLM-runtime sources: Ongoing per-crawl costs. At 100 HTML sources × 50 events × weekly crawl = ~$250-1250/month.
Recommendation: Prioritize JSON-LD and API sources first. Reserve LLM runtime extraction for high-value sources where no structured data exists.
| Source count | Expected composition | Monthly LLM cost | Oban job volume |
|---|---|---|---|
| 50 | 30 JSON-LD, 10 API, 10 HTML | ~$50-150 | ~50k jobs |
| 100 | 55 JSON-LD, 25 API, 20 HTML | ~$150-400 | ~120k jobs |
| 200 | 100 JSON-LD, 50 API, 50 HTML | ~$400-1200 | ~300k jobs |
| 500 | 250 JSON-LD, 125 API, 125 HTML | ~$1000-3000 | ~750k jobs |
-
LLM extraction as bootstrap vs. runtime — Should LLM HTML extraction generate CSS selectors once (cheaper, but brittle) or extract at runtime every crawl (expensive, but self-adapting)?
-
Trust levels for auto-deployed sources — Should autonomous sources have a "probation" flag visible in the UI? Or is the quality score + 7-day elevated monitoring sufficient?
-
Starting scope — Begin with JSON-LD-only sources (easy wins, many sites, zero per-source code) or build the full pipeline including HTML sources from day one?
-
JS rendering — Some sites require JavaScript rendering. Add Playwright/headless browser capability, or continue relying on Crawlbase proxy for JS rendering?
-
Geographic scope — Start by discovering sources for cities we already serve (Warsaw, Krakow, Paris, etc.) or expand to new cities simultaneously?
-
Human-in-the-loop placement — Where in the pipeline should humans be required vs. optional? Current proposal: humans only for 40-59 score sources and broken sources that fail self-healing.
-
Code generation approach — Oban job chain (fully in-process) vs. external agent process (Claude Code / similar) that writes and tests code? The Oban approach is simpler but may hit context window limits for complex sources.
-
Proactive discovery (Phase D) — How aggressive should the crawler be in discovering new sources? Follow links from known sources? Use search engines to find "[city] events" sites? Accept community submissions?