Skip to content

Instantly share code, notes, and snippets.

@holden
Created April 11, 2026 07:50
Show Gist options
  • Select an option

  • Save holden/a4a44fc4446d0f76306c7c226cc6bbdf to your computer and use it in GitHub Desktop.

Select an option

Save holden/a4a44fc4446d0f76306c7c226cc6bbdf to your computer and use it in GitHub Desktop.
Autonomous Event Crawler Pipeline — Full Spec

Autonomous Event Crawler Pipeline

Transform Eventasaurus from a system where developers manually build each scraper into an autonomous event crawler that discovers, analyzes, builds, tests, and deploys new sources with minimal human intervention — scaling from ~17 sources to hundreds.


Table of Contents


Motivation

Today, adding a new event source requires a developer to:

  1. Manually inspect the target site (DevTools, network tab, page source)
  2. Determine the best extraction strategy (API? JSON-LD? HTML scraping?)
  3. Run mix discovery.generate_source to scaffold empty stubs
  4. Write ~200-500 lines of custom code (client, transformer, config, jobs)
  5. Manually register the source in SourceRegistry and scraper.test
  6. Run end-to-end tests, debug, iterate
  7. Deploy and monitor

This takes hours to days per source. At this rate, reaching 100+ sources is impractical.

The insight: We've already built most of the pipeline infrastructure — BaseJob, processing pipeline, geocoding, dedup, metrics, testing harness. The missing piece is automating the per-source customization (steps 1-6 above). Every crawler system — from Googlebot to Scrapy to Diffbot — solves this same problem: how do you go from "here's a URL" to "here's structured data" at scale?

Long-term vision: An event-specific web crawler. Like Googlebot, but instead of indexing all web content, it specifically looks for events — discovers them, extracts structured data, validates quality, and feeds them into our pipeline. Today it processes sources we point it at. Tomorrow it proactively discovers new event sources by following links and detecting event-like structured data across the web.


Current State Assessment

What We Have (and its automation readiness)

Component Status Automation Ready?
Source generator (mix discovery.generate_source) Scaffolds files with TODO stubs Partial — stubs are empty
BaseJob + processing pipeline (Venue → Event → Performer) Solid, battle-tested Yes
End-to-end testing (mix scraper.test) Real Oban workers, pass/fail Partial — binary, no scoring
MetricsTracker + 13 error categories Per-job outcome tracking Partial — no aggregate score
Geocoding orchestrator (9 providers) Auto-fallback chain Yes
Dedup (ExternalID + PostGIS proximity + fuzzy) Same-source + cross-source Yes
Category classification (CategoryClassifier) ML model exists (BART-large-MNLI) No — not wired into pipeline
Performer matching (fuzzy 0.85 threshold) Basic but functional Mostly
HTTP adapter system (Direct/Crawlbase/Zyte) Auto-fallback on blocking Yes
Source registry (source_registry.ex) Compile-time map No — manual edit required
Fixture recording for tests --fixture flag No — manual recording
Monitoring dashboards (error trends, health) Multiple admin views Yes
API discovery guide Human-readable doc only No — completely manual
Source Implementation Guide Comprehensive, 7 steps No — human instructions

What's Working Autonomously Already

The downstream pipeline is remarkably autonomous. Once a transformer produces an event map with the right shape, everything after that is automatic:

event_map → Processor → VenueProcessor (geocode, dedup, create)
                      → EventProcessor (category, collision, create)
                      → PerformerStore (fuzzy match, enrich)
                      → MetricsTracker (record outcome)

What's Completely Manual

The upstream work — going from "here's a website with events" to "here's a working transformer that produces event maps" — is 100% manual. This is the automation target.


Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                       SOURCE CANDIDATES TABLE                       │
│                                                                     │
│  url: "https://example.com/events"                                 │
│  name: "Example Events"                                            │
│  instructions: "Events listed on /agenda, Polish language"         │
│  status: pending_analysis | analyzing | feasible | generating |    │
│           testing | deployed | rejected | broken                   │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
            ┌───────────────────┼───────────────────┐
            │                   │                   │
            ▼                   ▼                   ▼
  ┌──────────────────┐ ┌───────────────┐ ┌──────────────────┐
  │  PHASE 1         │ │  PHASE 2      │ │  PHASE 3         │
  │  Analysis &      │ │  Extraction   │ │  Code            │
  │  Feasibility     │ │  Strategy     │ │  Generation      │
  │                  │ │               │ │                  │
  │  • robots.txt    │ │  • JSON-LD    │ │  • client.ex     │
  │  • Format detect │ │  • API        │ │  • transformer   │
  │  • LLM analysis  │ │  • SSR bundle │ │  • config.ex     │
  │  • Score 0-100   │ │  • HTML+LLM   │ │  • sync_job.ex   │
  │  • Gate: ≥60     │ │               │ │  • registration   │
  └────────┬─────────┘ └───────┬───────┘ └────────┬─────────┘
           │                   │                   │
           └───────────────────┼───────────────────┘
                               │
                               ▼
              ┌─────────────────────────────────┐
              │  PHASE 4: VALIDATION LOOP       │
              │                                 │
              │  • Compile check                │
              │  • mix scraper.test --limit 5   │
              │  • Quality scoring (8 dims)     │
              │  • LLM-as-judge comparison      │
              │  • If fail → back to Phase 3    │
              │    (max 5 iterations)           │
              └───────────────┬─────────────────┘
                              │
                    Pass (score ≥ 80)
                              │
                              ▼
              ┌─────────────────────────────────┐
              │  PHASE 5: DEPLOY & MONITOR      │
              │                                 │
              │  • Auto-register in production  │
              │  • 7-day elevated monitoring    │
              │  • Structural fingerprinting    │
              │  • Self-healing on breakage     │
              │  • Health scoring via Metrics   │
              └─────────────────────────────────┘

Pipeline Phases (Detail)

Phase 1: Analysis & Feasibility

Input: A URL + optional human instructions Output: Feasibility report with score (0-100)

┌─────────────────────────────────────────────────┐
│              ANALYSIS PIPELINE                   │
│                                                  │
│  URL ──► Fetch HTML ──► robots.txt check         │
│              │                                   │
│              ▼                                   │
│  ┌─── Format Detection ───┐                     │
│  │                         │                     │
│  │  1. JSON-LD present?    │──► score += 40     │
│  │  2. API endpoints?     │──► score += 30     │
│  │  3. SSR data bundles?  │──► score += 25     │
│  │  4. Structured HTML?   │──► score += 15     │
│  │  5. JS-rendered only?  │──► score += 5      │
│  └─────────────────────────┘                     │
│              │                                   │
│              ▼                                   │
│  ┌─── Content Assessment ──┐                    │
│  │                          │                    │
│  │  • Event-like content?   │  (LLM check)      │
│  │  • Date/time patterns?   │  score += 0-20    │
│  │  • Venue information?    │  score += 0-10    │
│  │  • Pagination detected?  │  score += 0-10    │
│  │  • Language identified?  │                    │
│  └──────────────────────────┘                    │
│              │                                   │
│              ▼                                   │
│  Feasibility Score: 0-100                        │
│  ≥ 60: proceed    < 60: reject (with reason)    │
└─────────────────────────────────────────────────┘

Key checks:

  • robots.txt compliance — reject if crawling is disallowed
  • Data format detection — JSON-LD is the golden path (zero per-source code needed)
  • Content assessment via LLM — "Does this page list events? What fields are available?"
  • Language detection — for MultilingualDateParser configuration
  • Pagination pattern — infinite scroll, numbered pages, load-more button, API cursor
  • Rendering requirements — static HTML vs JS-rendered (affects cost and complexity)

Phase 2: Extraction Strategy Selection

Input: Feasibility report from Phase 1 Output: Chosen strategy + sample extracted data

The tiered extraction model (borrowed from Crawl4AI / Diffbot patterns):

┌─────────────────────────────────────────────────┐
│           EXTRACTION STRATEGY TIERS              │
│                                                  │
│  ┌─── Tier 1: JSON-LD ────────────────────────┐ │
│  │  Cost: $0     Reliability: ★★★★★           │ │
│  │  Maintenance: Zero                          │ │
│  │                                             │ │
│  │  Parse <script type="application/ld+json">  │ │
│  │  Map schema.org Event → our event_map       │ │
│  │  Handles: Event, MusicEvent, TheaterEvent,  │ │
│  │           ScreeningEvent, etc.              │ │
│  └─────────────────────────────────────────────┘ │
│          │ not found                             │
│          ▼                                       │
│  ┌─── Tier 2: API Endpoint ───────────────────┐ │
│  │  Cost: $0     Reliability: ★★★★☆           │ │
│  │  Maintenance: Low (API versioning)          │ │
│  │                                             │ │
│  │  Discovered XHR/fetch endpoints from page   │ │
│  │  GraphQL introspection if available         │ │
│  │  Returns structured JSON already            │ │
│  └─────────────────────────────────────────────┘ │
│          │ not found                             │
│          ▼                                       │
│  ┌─── Tier 3: SSR Data Bundle ────────────────┐ │
│  │  Cost: $0     Reliability: ★★★☆☆           │ │
│  │  Maintenance: Medium (framework updates)    │ │
│  │                                             │ │
│  │  Extract __NEXT_DATA__, __NUXT__,           │ │
│  │  window.__INITIAL_STATE__, etc.             │ │
│  │  Already JSON — just needs field mapping    │ │
│  └─────────────────────────────────────────────┘ │
│          │ not found                             │
│          ▼                                       │
│  ┌─── Tier 4: LLM HTML Extraction ───────────┐  │
│  │  Cost: ~$0.01-0.05/page  Reliability: ★★★☆☆│ │
│  │  Maintenance: Self-healing via LLM          │ │
│  │                                             │  │
│  │  Clean HTML (strip nav/footer/ads)          │  │
│  │  Send to Claude: "Extract event fields"     │  │
│  │  Validate output against schema             │  │
│  │  Option A: Generate CSS selectors (once)    │  │
│  │  Option B: LLM extract every crawl          │  │
│  └─────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────┘

Strategy selection criteria:

Strategy When to use Per-crawl cost Breakage risk
JSON-LD Site has @type: "Event" markup $0 Very low
API Discoverable REST/GraphQL endpoint $0 Low-medium
SSR Bundle Next.js/Nuxt/SvelteKit with SSR $0 Medium
LLM (selector gen) HTML only, generate selectors once $0 after init Medium-high
LLM (runtime) Frequently changing HTML structure $0.01-0.05/page Low

Phase 3: Code Generation

Input: Extraction strategy + sample data + feasibility report Output: Complete source directory (client.ex, transformer.ex, config.ex, sync_job.ex)

┌────────────────────────────────────────────────────────┐
│                CODE GENERATION LOOP                     │
│                                                        │
│  Context Window:                                       │
│  ┌──────────────────────────────────────────────────┐  │
│  │  • Source Implementation Guide (docs/)           │  │
│  │  • 2-3 similar existing sources (few-shot)       │  │
│  │  • Feasibility report + sample data              │  │
│  │  • Extraction strategy + sample raw response     │  │
│  │  • BaseJob source code                           │  │
│  │  • Processor / EventProcessor source code        │  │
│  │  • Target event_map schema                       │  │
│  └──────────────────────────────────────────────────┘  │
│                         │                              │
│                         ▼                              │
│  ┌─── Generate ────────────────────────────────────┐   │
│  │                                                  │   │
│  │  client.ex       HTTP calls for chosen strategy │   │
│  │  transformer.ex  Raw response → event_map       │   │
│  │  config.ex       URLs, rate limits, dedup       │   │
│  │  sync_job.ex     Orchestration (BaseJob or      │   │
│  │                  custom for multi-page)          │   │
│  │  + SourceRegistry entry                         │   │
│  │  + scraper.test entry                           │   │
│  └──────────────────────────────────────────────────┘   │
│                         │                              │
│                         ▼                              │
│  For JSON-LD sources: may skip client.ex/              │
│  transformer.ex entirely — use universal extractor     │
└────────────────────────────────────────────────────────┘

For JSON-LD sources (Tier 1): The universal JSON-LD extractor handles everything. Code generation may only need a thin config specifying the base URL and pagination pattern. This is the highest-leverage path — dozens of sources with near-zero per-source code.

For API/SSR sources (Tiers 2-3): LLM generates client.ex (HTTP calls) and transformer.ex (field mapping). These are relatively straightforward since the data is already structured JSON.

For HTML sources (Tier 4): Most complex. LLM generates either CSS selectors (one-time) or a runtime extraction prompt. The generated transformer includes date parsing via MultilingualDateParser.

Phase 4: Validation Loop

Input: Generated source code Output: Quality score (0-100) + pass/fail decision

┌────────────────────────────────────────────────────────┐
│                  VALIDATION LOOP                        │
│                                                        │
│  Iteration 1 of max 5:                                 │
│                                                        │
│  ┌─── Step 1: Compile ──────────────────────────────┐  │
│  │  mix compile --warnings-as-errors                │  │
│  │  If fail → feed errors to LLM → regenerate       │  │
│  └──────────────────────────────────────────────────┘  │
│                         │ pass                         │
│                         ▼                              │
│  ┌─── Step 2: End-to-End Test ──────────────────────┐  │
│  │  mix scraper.test <source> --limit 5             │  │
│  │  If SyncJob fails → feed errors to LLM           │  │
│  └──────────────────────────────────────────────────┘  │
│                         │ pass                         │
│                         ▼                              │
│  ┌─── Step 3: Quality Scoring ──────────────────────┐  │
│  │                                                   │  │
│  │  Dimension          Weight  Score                 │  │
│  │  ─────────────────  ──────  ─────                 │  │
│  │  Field completeness  25%    title+date+venue+url  │  │
│  │  Date accuracy       20%    parsed, not noon-UTC  │  │
│  │  Venue geocoding     15%    geocoded successfully │  │
│  │  Category confidence  10%   ML classification avg │  │
│  │  Dedup collision      10%   reasonable rate       │  │
│  │  Performer extract    10%   at least 1 performer  │  │
│  │  Metadata richness     5%   images, descriptions  │  │
│  │  URL validity          5%   resolvable event URLs │  │
│  │                                                   │  │
│  │  Composite: weighted average → 0-100              │  │
│  └──────────────────────────────────────────────────┘  │
│                         │                              │
│                         ▼                              │
│  ┌─── Step 4: LLM-as-Judge ────────────────────────┐   │
│  │                                                  │   │
│  │  For 3-5 sample events:                         │   │
│  │  • Fetch original page                          │   │
│  │  • Compare extracted data vs. what's on page    │   │
│  │  • Flag: hallucinated fields? mismatched data?  │   │
│  │  • Catch systematic errors (e.g., venue as      │   │
│  │    title, wrong date interpretation)            │   │
│  └──────────────────────────────────────────────────┘   │
│                         │                              │
│              ┌──────────┴──────────┐                   │
│              │                     │                   │
│         Score ≥ 80            Score < 80               │
│              │                     │                   │
│              ▼                     ▼                   │
│      PHASE 5: Deploy      Feed errors + scores         │
│                           back to Phase 3              │
│                           (iteration N+1)              │
└────────────────────────────────────────────────────────┘

Quality thresholds:

Score Action
≥ 80 Auto-deploy with standard monitoring
60-79 Auto-deploy with elevated monitoring (7-day probation)
40-59 Queue for human review — too risky to auto-deploy
< 40 Reject — mark candidate as rejected with reason

Phase 5: Deployment & Monitoring

Input: Validated source with quality score ≥ 60 Output: Live source in production with ongoing health monitoring

┌────────────────────────────────────────────────────────┐
│              DEPLOYMENT & ONGOING MONITORING             │
│                                                        │
│  Day 0: Deploy                                         │
│  ┌──────────────────────────────────────────────────┐  │
│  │  • Register source in DB (dynamic registration)  │  │
│  │  • Set initial crawl schedule                    │  │
│  │  • Record structural fingerprint of key pages    │  │
│  │  • Flag as "probation" for 7 days                │  │
│  └──────────────────────────────────────────────────┘  │
│                                                        │
│  Days 1-7: Probation                                   │
│  ┌──────────────────────────────────────────────────┐  │
│  │  • Every crawl: compare quality score to initial │  │
│  │  • If score drops > 20 points → pause + alert    │  │
│  │  • If 3+ consecutive crawls succeed → confidence │  │
│  │  • Human can promote to "stable" early           │  │
│  └──────────────────────────────────────────────────┘  │
│                                                        │
│  Ongoing: Health Monitoring                            │
│  ┌──────────────────────────────────────────────────┐  │
│  │  • Structural fingerprint check each crawl       │  │
│  │  • Quality score computed per run                │  │
│  │  • Error rate trends via existing monitoring     │  │
│  │  • Event-aware freshness:                        │  │
│  │    - starts_at < 72h: high crawl priority        │  │
│  │    - starts_at passed: stop crawling             │  │
│  └──────────────────────────────────────────────────┘  │
│                                                        │
│  Self-Healing Loop:                                    │
│  ┌──────────────────────────────────────────────────┐  │
│  │  Fingerprint changed?                            │  │
│  │    │                                             │  │
│  │    ▼                                             │  │
│  │  Run validation on existing selectors            │  │
│  │    │                                             │  │
│  │    ├── Still works → no action                   │  │
│  │    │                                             │  │
│  │    └── Broken → Trigger re-analysis (Phase 2)    │  │
│  │         │                                        │  │
│  │         ├── LLM regenerates selectors            │  │
│  │         │                                        │  │
│  │         ├── Validate new selectors               │  │
│  │         │                                        │  │
│  │         └── If 3 failures → mark "broken",       │  │
│  │             alert human                          │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘

Gap Analysis

Gap 1: Source Quality Scoring System

Priority: P0 — prerequisite for everything Effort: Medium (1-2 weeks) Dependencies: None (builds on existing MetricsTracker data)

Currently: mix scraper.test gives binary pass/fail. MetricsTracker records per-job outcomes but there's no aggregate source-level score.

Needed: A SourceQualityScore module that grades a scraper run across multiple dimensions:

# Proposed API
defmodule EventasaurusDiscovery.Metrics.SourceQualityScore do
  @type dimension :: %{
    name: String.t(),
    weight: float(),
    score: float(),       # 0.0 - 1.0
    details: String.t()
  }

  @type report :: %{
    source_slug: String.t(),
    run_id: String.t(),
    composite_score: float(),  # 0-100
    dimensions: [dimension()],
    sample_size: integer(),
    computed_at: DateTime.t()
  }

  @spec score_run(String.t(), String.t()) :: {:ok, report()} | {:error, term()}
  def score_run(source_slug, run_id)

  @spec score_latest(String.t()) :: {:ok, report()} | {:error, term()}
  def score_latest(source_slug)
end

Scoring dimensions:

Dimension Weight How Measured
Field completeness 25% % events with title + starts_at + venue + URL
Date accuracy 20% % dates that aren't noon-UTC fallback (time_tbd: false)
Venue geocoding 15% % venues that geocode to a real location
Category confidence 10% Average ML classification confidence
Dedup collision rate 10% Between 5-40% is healthy (too low with known overlap = broken dedup)
Performer extraction 10% % events with ≥ 1 performer (where applicable)
Metadata richness 5% % with images, descriptions > 50 chars
URL validity 5% % of event URLs that resolve (HTTP 200)

Also needed: mix quality.score <source> CLI task and integration into admin dashboards.


Gap 2: ML Category Classification Integration

Priority: P0 — required for autonomous sources (no rule-based mapping exists for new sources) Effort: Medium (1-2 weeks) Dependencies: CategoryClassifier already exists with BART-large-MNLI

Currently: CategoryClassifier exists but is not wired into the processing pipeline. CategoryExtractor (called from EventProcessor) uses rule-based per-source mapping only.

Needed:

  • Wire CategoryClassifier into CategoryExtractor as a fallback when rule-based mapping produces no result
  • Confidence threshold: only apply ML category if confidence > 0.7
  • For autonomous sources, ML classification is the primary method (no rule-based mapping will exist)
  • Backtest against existing categorized events to validate accuracy before going live
  • Add classification confidence to MetricsTracker metadata for quality scoring
Event Title + Description
         │
         ▼
┌─── CategoryExtractor ───────────────────┐
│                                          │
│  1. Check source-specific rule mapping   │
│     (existing sources only)              │
│     │                                    │
│     ├── Match found → use it             │
│     │                                    │
│     └── No match ──▼                     │
│                                          │
│  2. ML Classification (CategoryClassifier)│
│     Input: title + description text      │
│     Output: category + confidence        │
│     │                                    │
│     ├── confidence ≥ 0.7 → use it        │
│     │                                    │
│     └── confidence < 0.7 → "other" +     │
│         flag for review                  │
└──────────────────────────────────────────┘

Gap 3: JSON-LD / Schema.org Universal Extractor

Priority: P0 — highest ROI single feature Effort: Medium (1-2 weeks) Dependencies: None

Currently: Some sources have bespoke JSON-LD parsing. No universal extractor.

Needed: A generic module that can extract events from any page with schema.org Event markup:

defmodule EventasaurusDiscovery.Extraction.JsonLdExtractor do
  @supported_types ~w(Event MusicEvent TheaterEvent ScreeningEvent
                       DanceEvent EducationEvent SportsEvent
                       VisualArtsEvent Festival)

  @spec extract_events(String.t()) :: {:ok, [event_map()]} | {:error, term()}
  def extract_events(html)

  @spec extract_events_from_url(String.t()) :: {:ok, [event_map()]} | {:error, term()}
  def extract_events_from_url(url)
end

Schema.org Event → our event_map mapping:

Schema.org field Our field Notes
name title
startDate starts_at ISO 8601, handle timezone
endDate ends_at Optional
url url
location.name venue_data.name
location.address venue_data.address Can be string or PostalAddress
location.geo.latitude venue_data.latitude
location.geo.longitude venue_data.longitude
performer[].name performer_names
image metadata.image_url
description metadata.description
offers.price metadata.min_price
offers.priceCurrency metadata.currency

Why this is highest-ROI: Many event websites already emit JSON-LD because Google requires it for rich search results. A single universal extractor could add dozens of sources with zero per-source custom code. The only per-source configuration needed is the base URL and pagination pattern.


Gap 4: Source Candidate Table & State Machine

Priority: P0 — the "crawl frontier" for source onboarding Effort: Small (1 week) Dependencies: None

Needed: A new source_candidates table:

CREATE TABLE source_candidates (
  id BIGSERIAL PRIMARY KEY,
  url TEXT NOT NULL,
  name TEXT NOT NULL,
  instructions TEXT,              -- human notes ("events on /agenda page")
  status TEXT NOT NULL DEFAULT 'pending_analysis',
  feasibility_score INTEGER,      -- 0-100
  quality_score INTEGER,          -- 0-100 (after test run)
  extraction_strategy TEXT,       -- json_ld, api, ssr, html_selectors, llm_runtime
  analysis_report JSONB,          -- full feasibility report
  quality_report JSONB,           -- full quality score breakdown
  generation_attempts INTEGER DEFAULT 0,
  error_log JSONB,                -- history of failures
  structural_fingerprint TEXT,    -- DOM hash for change detection
  deployed_source_id BIGINT REFERENCES sources(id),
  last_analyzed_at TIMESTAMPTZ,
  deployed_at TIMESTAMPTZ,
  inserted_at TIMESTAMPTZ NOT NULL,
  updated_at TIMESTAMPTZ NOT NULL
);

State machine:

pending_analysis ──► analyzing ──► feasible ──► generating ──► testing ──► deployed
                         │              │            │             │
                         ▼              ▼            ▼             ▼
                      rejected      rejected    failed ──►    failed ──►
                                               (retry if      (retry if
                                               attempts < 5)  attempts < 5)

deployed ──► broken ──► re_analyzing ──► ... (self-healing loop)

Gap 5: Site Analysis / Feasibility Module

Priority: P1 — needed for automated onboarding Effort: Large (2-3 weeks) Dependencies: Gap 4 (candidate table)

Currently: Completely manual. The API discovery guide is a human-readable document.

Needed: An Oban job that:

  1. Fetches the target URL + 2-3 linked pages
  2. Checks robots.txt for crawl permissions
  3. Detects data format:
    • Scan for <script type="application/ld+json"> with Event types
    • Scan for XHR/fetch patterns in page source (API endpoints)
    • Scan for __NEXT_DATA__, __NUXT__, __INITIAL_STATE__ etc.
    • Analyze HTML structure if none of the above
  4. Sends cleaned page content to LLM with prompt: "Analyze this page. Does it contain event listings? What fields are available (title, date, time, venue, price, performers)? Is there pagination? What language is the content in?"
  5. Computes feasibility score
  6. Stores report in source_candidates.analysis_report

The LLM analysis prompt would include:

  • Examples of what we consider "good" event data
  • Our required fields vs. nice-to-have fields
  • Known patterns that indicate high feasibility (structured dates, clear venue names)

Gap 6: LLM-Based Code Generation Loop

Priority: P1 — the core automation Effort: Large (3-4 weeks) Dependencies: Gap 1 (quality scoring), Gap 3 (JSON-LD extractor), Gap 5 (analysis module)

Currently: mix discovery.generate_source creates empty stubs. A developer fills them in.

Needed: An agent loop (could be an Oban job chain or an external agent process) that:

  1. Takes the feasibility report + extraction strategy
  2. Selects 2-3 similar existing sources as few-shot examples based on:
    • Same extraction strategy (e.g., other JSON-LD sources for a JSON-LD candidate)
    • Similar content type (music events → use Bandsintown/RA as examples)
    • Similar language/region
  3. Constructs a prompt with:
    • Source Implementation Guide
    • BaseJob source code
    • Example sources (full client.ex + transformer.ex)
    • Sample raw response from the target site
    • Target event_map schema
  4. Generates complete source files
  5. Writes files to disk
  6. Runs validation (Phase 4)
  7. If validation fails, feeds error messages + quality report back and regenerates

For JSON-LD sources: This step may be trivial or unnecessary — the universal extractor (Gap 3) handles extraction, so code generation only needs a minimal config file specifying:

  • Base URL(s) for crawling
  • Pagination pattern (if any)
  • Any source-specific field overrides

For API/HTML sources: Full code generation is needed. The key insight is that our existing sources serve as excellent few-shot examples — the LLM can see how client.ex calls an API and how transformer.ex maps the response.


Gap 7: Data Validation / LLM-as-Judge

Priority: P1 — prevents erroneous data from entering the DB Effort: Medium (1-2 weeks) Dependencies: None

Currently: Validation is structural only (required fields present, types correct). No semantic validation.

Needed: For new/autonomous sources, an LLM validation pass:

┌─────────────────────────────────────────────────┐
│             LLM-AS-JUDGE VALIDATION              │
│                                                  │
│  For each of 3-5 sample events:                  │
│                                                  │
│  Input:                                          │
│  ┌────────────────────────────────────────────┐  │
│  │  Extracted event_map:                      │  │
│  │  {title: "Jazz Night", starts_at: ...,     │  │
│  │   venue: "Blue Note", ...}                 │  │
│  │                                            │  │
│  │  Original page HTML (cleaned):             │  │
│  │  <div class="event">...</div>              │  │
│  └────────────────────────────────────────────┘  │
│                                                  │
│  Prompt:                                         │
│  "Compare the extracted data to the original     │
│   page. Score accuracy 0-100. Flag any:          │
│   - Hallucinated fields (not on page)            │
│   - Mismatched data (wrong date, wrong venue)    │
│   - Systematic errors (field swap patterns)      │
│   - Missing available data (on page, not         │
│     extracted)"                                  │
│                                                  │
│  Output:                                         │
│  ┌────────────────────────────────────────────┐  │
│  │  accuracy_score: 92                        │  │
│  │  issues: [                                 │  │
│  │    {field: "starts_at", type: "mismatch",  │  │
│  │     expected: "20:00", got: "08:00 PM"}    │  │
│  │  ]                                         │  │
│  └────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────┘

Key principle: This runs on a sample (3-5 events) during onboarding, not on every crawl. It's a one-time quality gate, not a runtime cost.


Gap 8: Structural Fingerprinting & Self-Healing

Priority: P2 — needed for long-term maintenance at scale Effort: Medium (2 weeks) Dependencies: Gap 6 (code generation, for regeneration capability)

Currently: When a site changes its HTML structure, scrapers break silently. Detection relies on humans noticing via monitoring dashboards.

Needed: Inspired by Kadoa's self-healing scraper pattern:

  1. Fingerprinting: After each successful crawl, hash the DOM structure of key pages (listing page, detail page). Store in source_candidates.structural_fingerprint.

  2. Change detection: Before each crawl, fetch and fingerprint. If hash differs significantly from stored:

    • Still try existing extraction logic
    • If extraction still works → update fingerprint, no action
    • If extraction fails → trigger self-healing
  3. Self-healing cascade:

    Fingerprint changed + extraction failed
      │
      ├── Tier 1 source (JSON-LD): Check if JSON-LD still present
      │   └── Usually still works (JSON-LD survives redesigns)
      │
      ├── Tier 2-3 (API/SSR): Check if endpoint still responds
      │   └── Usually still works (API versioned separately from UI)
      │
      └── Tier 4 (HTML selectors): Re-analyze page structure
          ├── LLM generates new selectors
          ├── Validate on sample
          ├── If works → update source code, continue
          └── If fails 3x → mark "broken", alert human
    

This is why JSON-LD sources are so valuable for autonomous crawling — they're naturally resistant to site redesigns because the structured data block is independent of visual layout.


Gap 9: Dynamic Source Registration

Priority: P2 — needed for zero-deployment onboarding Effort: Small-Medium (1 week) Dependencies: Gap 4 (candidate table)

Currently: SourceRegistry is a compile-time map. Adding a source requires a code change and deployment.

Needed: Move source resolution to a DB-first approach:

# Current (compile-time)
@source_to_job %{
  "bandsintown" => EventasaurusDiscovery.Sources.Bandsintown.Jobs.SyncJob,
  # ... hardcoded
}

# Proposed (DB-first with compile-time fallback)
def get_sync_job(slug) do
  case get_from_db(slug) do
    {:ok, module_name} -> {:ok, String.to_existing_atom(module_name)}
    :not_found -> get_from_compile_time_map(slug)
  end
end

For autonomous sources using the universal JSON-LD extractor, the "SyncJob module" could be a generic configurable job that reads its configuration from the DB rather than from a source-specific module.


Gap 10: Performer Enrichment Hardening

Priority: P2 — quality improvement for autonomous sources Effort: Small (1 week) Dependencies: None

Currently: Performer matching works via name-based fuzzy match at 0.85 threshold. For new/unknown sources, performer names may come in unusual formats.

Needed:

  • Name normalization before matching: strip roles ("DJ", "MC", "feat."), normalize casing, handle "Last, First" vs "First Last"
  • Disambiguation assist for edge cases: "The National" (band? venue?), "Paris" (performer? city?)
  • Duplicate prevention: Before creating a new performer, check if a similar one exists across all sources, not just the current source
  • Confidence tracking: Record match confidence in public_event_performers metadata

Wireframes

Wireframe: Source Candidate Dashboard

The primary admin interface for managing the autonomous pipeline.

┌─────────────────────────────────────────────────────────────────────────────┐
│  Source Candidates                                              + Add New  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Filters: [All ▾] [Pending ▾] [Feasible ▾] [Deployed ▾] [Broken ▾]       │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  ● Analyzing    example-events.pl                                   │   │
│  │  Strategy: Detecting...    Score: --    Attempt: 1/5               │   │
│  │  URL: https://example-events.pl/agenda                             │   │
│  │  Added 2 hours ago                                                  │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  ◉ Feasible     warsaw-concerts.com          Feasibility: 87       │   │
│  │  Strategy: JSON-LD    Quality: --    Attempt: 0/5                  │   │
│  │  URL: https://warsaw-concerts.com/events                           │   │
│  │  Analyzed 1 hour ago       [Generate Source ►]  [Reject ✕]        │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  ◉ Testing      krakow-nightlife.pl          Feasibility: 72       │   │
│  │  Strategy: API endpoint    Quality: 64    Attempt: 2/5            │   │
│  │  URL: https://krakow-nightlife.pl/api/events                      │   │
│  │  Last test: 34 min ago — "Date parsing: 3 of 5 events used        │   │
│  │  noon-UTC fallback"                                                │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  ✓ Deployed     gdansk-kultura.pl            Quality: 91          │   │
│  │  Strategy: JSON-LD    Events: 847    Last crawl: 3h ago           │   │
│  │  Deployed 12 days ago    Status: Stable                            │   │
│  │  Health: ████████████████████░░ 91%                                │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  ✕ Rejected     some-blog.com                Feasibility: 23       │   │
│  │  Reason: "No structured event data. Blog posts mention events      │   │
│  │  but no dates, venues, or machine-readable listings found."        │   │
│  │  Analyzed 3 days ago       [Re-analyze]  [Delete]                 │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  ⚠ Broken       old-source.com               Quality: 34 (was 82) │   │
│  │  Strategy: HTML selectors    Self-heal attempts: 3/3 failed       │   │
│  │  "Site redesigned. JSON-LD removed. New layout uses React SPA     │   │
│  │  with no SSR. Requires JS rendering."                              │   │
│  │  Broken since 2 days ago   [Re-analyze]  [Pause]  [Archive]      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Summary: 47 deployed  │  3 testing  │  12 feasible  │  8 rejected        │
│           2 broken     │  5 analyzing │  4 pending                        │
└─────────────────────────────────────────────────────────────────────────────┘

Wireframe: Feasibility Analysis Report

Shown when clicking into a candidate after analysis completes.

┌─────────────────────────────────────────────────────────────────────────────┐
│  Feasibility Report: warsaw-concerts.com                                   │
│  Analyzed: 2026-04-11 14:32 UTC                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Overall Feasibility Score:  ████████████████████░░░░░ 87/100              │
│  Recommendation: ✓ PROCEED — high confidence, JSON-LD extraction           │
│                                                                             │
│  ┌─── Data Format Detection ────────────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  ✓ JSON-LD: Found <script type="application/ld+json">               │  │
│  │    Types: Event (23), MusicEvent (18), TheaterEvent (5)              │  │
│  │    Fields: name ✓  startDate ✓  endDate ✓  location ✓               │  │
│  │            performer ✓  offers ✓  image ✓  description ✓            │  │
│  │    Coverage: 46/46 events have all required fields                   │  │
│  │                                                                       │  │
│  │  ○ API Endpoints: None discovered                                    │  │
│  │  ○ SSR Bundles: None (__NEXT_DATA__ / __NUXT__ not found)           │  │
│  │  ○ HTML Structure: Available as fallback                             │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌─── Site Characteristics ─────────────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  Language:        Polish (pl)                                        │  │
│  │  Rendering:       Static HTML (no JS required)                      │  │
│  │  Pagination:      /events?page=N (12 pages detected)                │  │
│  │  robots.txt:      Crawling allowed (no Crawl-delay)                 │  │
│  │  Event count:     ~550 events estimated                              │  │
│  │  Update freq:     ~15 new events/week (estimated from dates)        │  │
│  │  Date range:      2026-04-05 to 2026-09-15                          │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌─── Sample Extraction (3 events) ─────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  Event 1:                                                            │  │
│  │    title: "Kacey Musgraves — Deeper Well Tour"                      │  │
│  │    starts_at: 2026-04-18T20:00:00+02:00                             │  │
│  │    venue: "Torwar Hall" (52.2178, 21.0009)                          │  │
│  │    performers: ["Kacey Musgraves"]                                   │  │
│  │    price: 180-450 PLN                                                │  │
│  │                                                                       │  │
│  │  Event 2:                                                            │  │
│  │    title: "Warsaw Jazz Weekend"                                      │  │
│  │    starts_at: 2026-04-25T19:30:00+02:00                             │  │
│  │    venue: "Palladium" (52.2297, 21.0122)                            │  │
│  │    performers: ["Kamasi Washington", "GoGo Penguin"]                 │  │
│  │    price: 120-280 PLN                                                │  │
│  │                                                                       │  │
│  │  Event 3: ...                                                        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌─── Risks & Notes ────────────────────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  ⚠ Some events have Polish-only descriptions (no English)           │  │
│  │  ⚠ Venue coordinates missing for 3/46 events (will need geocoding) │  │
│  │  ✓ Date formats are ISO 8601 — no parsing ambiguity                 │  │
│  │  ✓ Performer names are clean (no "feat." / "DJ" prefixes)          │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  [▶ Generate Source]    [✕ Reject]    [↻ Re-analyze]                      │
└─────────────────────────────────────────────────────────────────────────────┘

Wireframe: Source Quality Scorecard

Shown after a test run completes (Phase 4), and accessible for deployed sources.

┌─────────────────────────────────────────────────────────────────────────────┐
│  Quality Scorecard: warsaw-concerts.com                                    │
│  Run: 2026-04-11 15:07 UTC  │  Sample: 5 events  │  Attempt: 1/5        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Composite Score:  ████████████████████████░░░░░ 88/100  → AUTO-DEPLOY    │
│                                                                             │
│  ┌─── Dimension Breakdown ──────────────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  Field Completeness (25%)      ████████████████████  100%  5/5      │  │
│  │  All events have title + starts_at + venue + URL                     │  │
│  │                                                                       │  │
│  │  Date Accuracy (20%)           ████████████████████  100%  5/5      │  │
│  │  All dates parsed with specific times (no noon-UTC fallback)         │  │
│  │                                                                       │  │
│  │  Venue Geocoding (15%)         ████████████████░░░░   80%  4/5      │  │
│  │  1 venue ("Klub Zmiana") needed geocoding — resolved via Google     │  │
│  │                                                                       │  │
│  │  Category Confidence (10%)     ██████████████░░░░░░   72%  avg      │  │
│  │  ML classified: MusicEvent(3, avg 0.89), TheaterEvent(1, 0.73),     │  │
│  │  Other(1, 0.41 — flagged for review)                                 │  │
│  │                                                                       │  │
│  │  Dedup Collisions (10%)        ████████████████████  100%           │  │
│  │  2/5 events matched existing (Bandsintown overlap) — expected       │  │
│  │                                                                       │  │
│  │  Performer Extraction (10%)    ████████████████░░░░   80%  4/5      │  │
│  │  1 event ("Warsaw Jazz Weekend") has 2 performers extracted         │  │
│  │  1 event (art exhibition) correctly has 0 performers                 │  │
│  │                                                                       │  │
│  │  Metadata Richness (5%)        ██████████████████░░   90%           │  │
│  │  4/5 have images, 5/5 have descriptions > 50 chars                  │  │
│  │                                                                       │  │
│  │  URL Validity (5%)             ████████████████████  100%  5/5      │  │
│  │  All event URLs return HTTP 200                                      │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌─── LLM-as-Judge Validation ──────────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  Accuracy: 96/100                                                    │  │
│  │                                                                       │  │
│  │  ✓ Event 1: All fields match original page                          │  │
│  │  ✓ Event 2: All fields match original page                          │  │
│  │  ⚠ Event 3: Description truncated at 500 chars (original is 1200)  │  │
│  │  ✓ Event 4: All fields match original page                          │  │
│  │  ✓ Event 5: All fields match original page                          │  │
│  │                                                                       │  │
│  │  Systematic issues: None detected                                    │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  [▶ Deploy]    [↻ Regenerate]    [✕ Reject]                               │
└─────────────────────────────────────────────────────────────────────────────┘

Wireframe: Pipeline Monitor

Overview of all pipeline activity — what's currently running, what's queued, recent outcomes.

┌─────────────────────────────────────────────────────────────────────────────┐
│  Crawler Pipeline Monitor                                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─── Active Jobs ──────────────────────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  ● Analyzing    lodz-events.pl          Started 4m ago               │  │
│  │    Phase 1 → Fetching pages (3/5)                                    │  │
│  │                                                                       │  │
│  │  ● Generating   poznan-muzyka.pl        Started 12m ago  Attempt 2  │  │
│  │    Phase 3 → LLM writing transformer.ex (retry: date parse error)   │  │
│  │                                                                       │  │
│  │  ● Testing      wroclaw-bilety.com      Started 8m ago              │  │
│  │    Phase 4 → mix scraper.test running (3/5 events processed)        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌─── Recent Outcomes (last 24h) ───────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  ✓ 14:32  gdansk-kultura.pl        Deployed     Score: 91  JSON-LD  │  │
│  │  ✓ 13:15  torun-festiwale.pl       Deployed     Score: 84  API      │  │
│  │  ✕ 12:44  random-blog.net          Rejected     Feas: 18   (blog)   │  │
│  │  ✓ 11:30  bielsko-biala-events.pl  Deployed     Score: 77  JSON-LD  │  │
│  │  ⚠ 10:15  szczecin-noc.pl          Testing      Score: 58  HTML     │  │
│  │           → Attempt 3: "Venue names extracted as addresses"          │  │
│  │  ✕ 09:00  some-restaurant.pl       Rejected     Feas: 31   (menu)   │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌─── Pipeline Stats ──────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  Today        This Week     This Month     All Time                 │   │
│  │  ─────        ─────────     ──────────     ────────                 │   │
│  │  Analyzed: 8  Analyzed: 34  Analyzed: 127  Analyzed: 312            │   │
│  │  Deployed: 3  Deployed: 14  Deployed: 47   Deployed: 108            │   │
│  │  Rejected: 2  Rejected: 11  Rejected: 52   Rejected: 143           │   │
│  │  Broken:   0  Broken:   1   Broken:   3    Broken:   8             │   │
│  │  Healed:   0  Healed:   1   Healed:   2    Healed:   5            │   │
│  │                                                                      │   │
│  │  Success rate: 41%  │  Avg attempts: 1.8  │  Avg time: 23 min      │   │
│  │  JSON-LD %: 64%     │  API %: 21%         │  HTML %: 15%           │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─── Health Alerts ────────────────────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  ⚠ old-source.com — Structural change detected 2h ago               │  │
│  │    Self-healing in progress (attempt 1/3)                            │  │
│  │                                                                       │  │
│  │  ⚠ krakow-imprezy.pl — Quality score dropped: 82 → 61              │  │
│  │    Last 3 runs: geocoding failures spiked (provider issue?)         │  │
│  │                                                                       │  │
│  │  ✕ warsaw-nightlife.com — Broken for 5 days, 3/3 heal attempts     │  │
│  │    Needs human intervention                                          │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Wireframe: Self-Healing Alert View

Detail view when a deployed source triggers the self-healing loop.

┌─────────────────────────────────────────────────────────────────────────────┐
│  Self-Healing: old-source.com                                              │
│  Status: Re-analyzing (attempt 1 of 3)                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─── What Happened ────────────────────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  Timeline:                                                            │  │
│  │  ───────────────────────────────────────────────                      │  │
│  │  Apr 9  22:00  Crawl succeeded, score 84                             │  │
│  │  Apr 10 06:00  Crawl succeeded, score 82                             │  │
│  │  Apr 10 14:00  Structural fingerprint changed (hash: a4f2 → c8d1)   │  │
│  │  Apr 10 14:01  Crawl attempted — extraction failed                   │  │
│  │                Error: "CSS selector .event-card returned 0 results"  │  │
│  │  Apr 10 14:02  Self-healing triggered                                │  │
│  │  Apr 10 14:05  Re-analysis complete: site redesigned                │  │
│  │                New structure uses <article class="listing-item">     │  │
│  │  Apr 10 14:10  New selectors generated by LLM                       │  │
│  │  Apr 10 14:12  Validation: 4/5 events extracted correctly           │  │
│  │  Apr 10 14:13  ⚠ 1 event missing venue (new layout nests it        │  │
│  │                differently)                                          │  │
│  │  Apr 10 14:15  Regenerating with updated venue selector...          │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌─── Fingerprint Diff ─────────────────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  Before (hash a4f2):              After (hash c8d1):                 │  │
│  │  <div class="event-list">         <section class="listings">         │  │
│  │    <div class="event-card">         <article class="listing-item">  │  │
│  │      <h3 class="title">              <h2 class="listing-title">     │  │
│  │      <span class="date">             <time datetime="...">          │  │
│  │      <span class="venue">            <div class="venue-info">       │  │
│  │                                         <span class="venue-name">   │  │
│  │                                                                       │  │
│  │  Changes: class names, tag types, venue nesting depth                │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌─── Actions ──────────────────────────────────────────────────────────┐  │
│  │                                                                       │  │
│  │  [Continue Healing]    [Pause Source]    [Manual Fix]    [Archive]   │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Wireframe: Add New Candidate

Simple form for adding a new source candidate to the pipeline.

┌─────────────────────────────────────────────────────────────────────────────┐
│  Add Source Candidate                                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  URL *                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  https://example-events.pl/agenda                                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Name *                                                                    │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Example Events                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Instructions (optional — hints for the analyzer)                          │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Events are on the /agenda page. Polish language. Pagination via    │   │
│  │  "Load more" button. Some events span multiple days.               │   │
│  │                                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Priority                                                                  │
│  ○ Normal — analyze when queue is clear                                   │
│  ◉ High — analyze next                                                    │
│  ○ Low — analyze when nothing else is pending                             │
│                                                                             │
│  Auto-deploy if quality score ≥ 80?                                       │
│  [✓] Yes, auto-deploy high-quality sources                                │
│  [ ] No, always require manual approval                                    │
│                                                                             │
│                          [Cancel]    [Add & Analyze ►]                     │
└─────────────────────────────────────────────────────────────────────────────┘

Crawler Architecture Comparison

How our system maps to established crawler architectures:

Concept Googlebot Scrapy Diffbot Eventasaurus (proposed)
URL frontier Priority queue by PageRank + freshness Scheduler with per-domain queues N/A (on-demand) source_candidates table + Oban priority queues
Fetcher Custom HTTP + WRS (JS rendering) Downloader + middleware Cloud rendering Http.Client + adapters (Direct/Crawlbase/Zyte)
Structure detection JSON-LD, Microdata, RDFa parsers Manual spider per site ML page-type classifier Tiered: JSON-LD → API → SSR → LLM
Field extraction Schema.org field mapping CSS/XPath selectors per spider ML per-type models Universal JSON-LD + per-source transformer
Quality validation Index quality signals Item pipeline validators Confidence scores Quality scoring (8 dims) + LLM-as-judge
Dedup URL canonicalization + content fingerprint Request fingerprint filter Entity resolution (Knowledge Graph) ExternalID + PostGIS proximity + fuzzy name
Scheduling Crawl budget per domain Spider scheduling On-demand Oban cron + event-aware priority
Self-healing Re-rendering on change Manual spider updates Auto-adapts (ML) Structural fingerprinting + LLM regeneration
Politeness robots.txt + adaptive rate DOWNLOAD_DELAY + AutoThrottle Managed Per-source rate limits + robots.txt

Key differentiator: We're a domain-specific event crawler. Unlike Googlebot (indexes everything) or Diffbot (extracts any page type), we only care about events. This constraint means:

  • We know exactly what fields we need (title, date, venue, etc.)
  • We can validate much more deeply (geocoding, date parsing, category classification)
  • We can cross-reference across sources (dedup, performer matching)
  • Our quality standards are higher but narrower

Implementation Roadmap

Phase A — Foundation (weeks 1-6)

These are prerequisites. Without them, the pipeline can't score quality or handle categories for new sources.

# Gap Effort What it unblocks
1 Source Quality Scoring (Gap 1) 2 weeks Phase 4 validation, deployment gates, monitoring
2 ML Category Classification (Gap 2) 2 weeks Category data for autonomous sources
3 JSON-LD Universal Extractor (Gap 3) 1 week Zero-code sources, highest ROI path
4 Source Candidate Table (Gap 4) 1 week Pipeline state machine, admin dashboard

Milestone A: Can manually add a JSON-LD source URL to the candidate table, and the system scores its quality. Category classification works for new events.

Phase B — Analysis & Generation (weeks 7-14)

The autonomous pipeline core. After this phase, the system can go from URL → deployed source.

# Gap Effort What it unblocks
5 Site Analysis Module (Gap 5) 3 weeks Automated feasibility assessment
6 LLM Code Generation Loop (Gap 6) 4 weeks Automated source building
7 LLM-as-Judge Validation (Gap 7) 1 week Semantic quality gate

Milestone B: Add a URL to the candidate table. System analyzes feasibility, generates source code, tests it, scores quality, and deploys if score ≥ 80. Full loop works for JSON-LD and API sources.

Phase C — Resilience & Scale (weeks 15-20)

Long-term maintenance and zero-deployment operations.

# Gap Effort What it unblocks
8 Structural Fingerprinting (Gap 8) 2 weeks Self-healing on site changes
9 Dynamic Source Registration (Gap 9) 1 week No deployment needed for new sources
10 Performer Enrichment (Gap 10) 1 week Better performer data from unknown sources

Milestone C: Sources self-heal when sites change. New sources deploy without code changes. System manages 50+ sources autonomously.

Phase D — Full Crawler (beyond week 20)

The "event-specific Googlebot" vision.

# Feature Description
11 Proactive discovery Follow links from known event sites, detect event-like structured data on new sites, propose them as candidates
12 Adaptive scheduling Crawl frequency based on observed change rate (Microsoft Optimal Freshness model)
13 Raw response archival WARC-like storage for re-extraction without re-crawling
14 Conditional HTTP ETag/Last-Modified tracking to skip unchanged pages
15 Multi-region expansion Automatically discover event sources for new cities/countries

Milestone D: System proactively discovers and onboards new event sources. Manages 200+ sources. Crawl budget and scheduling optimized per source.


Cost & Scale Considerations

LLM Costs

Operation When Cost per invocation Frequency
Site analysis (Phase 1) Per candidate ~$0.02-0.10 Once per candidate
Code generation (Phase 3) Per source ~$0.10-0.50 1-5x per source
LLM-as-judge (Phase 4) Per test run ~$0.05-0.15 1-5x per source
Self-healing (Phase 5) On breakage ~$0.10-0.30 Rare (monthly?)
LLM runtime extraction (Tier 4) Per crawl ~$0.01-0.05/page Per event page

For JSON-LD/API sources: LLM costs are one-time during onboarding (~$0.20-1.00 per source). For HTML/LLM-runtime sources: Ongoing per-crawl costs. At 100 HTML sources × 50 events × weekly crawl = ~$250-1250/month.

Recommendation: Prioritize JSON-LD and API sources first. Reserve LLM runtime extraction for high-value sources where no structured data exists.

Scale Projections

Source count Expected composition Monthly LLM cost Oban job volume
50 30 JSON-LD, 10 API, 10 HTML ~$50-150 ~50k jobs
100 55 JSON-LD, 25 API, 20 HTML ~$150-400 ~120k jobs
200 100 JSON-LD, 50 API, 50 HTML ~$400-1200 ~300k jobs
500 250 JSON-LD, 125 API, 125 HTML ~$1000-3000 ~750k jobs

Open Questions

  1. LLM extraction as bootstrap vs. runtime — Should LLM HTML extraction generate CSS selectors once (cheaper, but brittle) or extract at runtime every crawl (expensive, but self-adapting)?

  2. Trust levels for auto-deployed sources — Should autonomous sources have a "probation" flag visible in the UI? Or is the quality score + 7-day elevated monitoring sufficient?

  3. Starting scope — Begin with JSON-LD-only sources (easy wins, many sites, zero per-source code) or build the full pipeline including HTML sources from day one?

  4. JS rendering — Some sites require JavaScript rendering. Add Playwright/headless browser capability, or continue relying on Crawlbase proxy for JS rendering?

  5. Geographic scope — Start by discovering sources for cities we already serve (Warsaw, Krakow, Paris, etc.) or expand to new cities simultaneously?

  6. Human-in-the-loop placement — Where in the pipeline should humans be required vs. optional? Current proposal: humans only for 40-59 score sources and broken sources that fail self-healing.

  7. Code generation approach — Oban job chain (fully in-process) vs. external agent process (Claude Code / similar) that writes and tests code? The Oban approach is simpler but may hit context window limits for complex sources.

  8. Proactive discovery (Phase D) — How aggressive should the crawler be in discovering new sources? Follow links from known sources? Use search engines to find "[city] events" sites? Accept community submissions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment