Skip to content

Instantly share code, notes, and snippets.

@lotas
Created April 24, 2026 08:32
Show Gist options
  • Select an option

  • Save lotas/3a80ae308f9997c3db0a75f7dab76caa to your computer and use it in GitHub Desktop.

Select an option

Save lotas/3a80ae308f9997c3db0a75f7dab76caa to your computer and use it in GitHub Desktop.
Queue Forecasting

Queue Forecasting — Spec

Overview

A standalone tool that listens to Mozilla's Taskcluster Pulse event stream, collects task lifecycle data, and predicts how long tasks will take to run and wait in queue. Lives in tools/queue-forecasting/ within the Taskcluster monorepo.

Environment

  • Runtime: Node.js (ESM) for collection and real-time inference; Python for nightly model training
  • Database: Postgres 15 (shared with Taskcluster; all tables prefixed queue_forecast_ to avoid collisions)
  • Deployment: Docker Compose (collector + predictor + trainer + postgres)
  • Data source: Taskcluster Pulse (AMQP) — real-time lifecycle events
  • Supplemental data: Taskcluster Queue API — task definitions, queue depth

Goals

V1 (ship first)

  1. Per-task run duration prediction — given a newly pending run, predict execution time (p50/p90) using LightGBM trained on task identity, queue, tags, and other stored attributes.
  2. Per-task wait time prediction — predict queue wait time (p50/p90) using queue depth, priority, time-of-day, and queue identity. Goals 1+2 compose into an ETA.
  3. Prediction API — expose predictions for newly pending runs with model version and confidence metadata. TC UI as first consumer.

V2 (data collection starts now, features ship later)

  1. Queue-level forecasting — "if I submit to this queue now, how long will it wait?" and "what is the expected drain time for the current backlog?" Reuses the wait-time model with hypothetical inputs.
  2. Queue load prediction — predict pending count for a given queue at a given hour and day-of-week. Requires time-series queue depth data collected from V1 onward.

Non-goals

  • Predicting from task-defined (dependency resolution is a different problem)
  • Re-predicting while a task is already running
  • Provisioning or autoscaling decisions
  • Predicting for unscheduled tasks waiting on dependencies

Data Volume (observed)

  • ~250k task runs/day (~1.1M rows over first 5 days of collection)
  • ~7.5M rows/month projected
  • ~2GB/week raw
  • tags JSONB field averages ~200 bytes per row

Data Model

The system uses a normalized two-table model separating task-level definition facts from run-level execution facts. This replaces the original single task_events table. All table names are prefixed queue_forecast_ to avoid collisions in a shared database.

queue_forecast_tasks

One row per task_id. Stores definition-time identity and metadata. Column ordering optimized for Postgres tuple alignment (8-byte, 4-byte, variable-length).

CREATE TABLE queue_forecast_tasks (
    -- 8-byte types
    task_created       TIMESTAMPTZ,
    enriched_at        TIMESTAMPTZ,

    -- 4-byte types
    max_run_time_s     INTEGER,

    -- Variable-length
    task_id            TEXT PRIMARY KEY,
    task_queue_id      TEXT,
    task_group_id      TEXT,
    scheduler_id       TEXT,
    project_id         TEXT,
    metadata_name      TEXT,
    normalized_name    TEXT,
    original_priority  TEXT,
    tags               JSONB
);

Notes:

  • normalized_name is metadata_name with trailing hash suffixes stripped (e.g. test-linux2404-64/opt-mochitest-1@a3b4c5d6e7f8test-linux2404-64/opt-mochitest-1). Must come from a deterministic, versioned normalization function.
  • tags is raw JSONB preserved as-is from the task definition. All tag-based feature extraction (kind, os, test-type, worker-implementation, build type) happens at training time in Python, keeping the schema deployment-agnostic.
  • enriched_at is set when the Queue API fetch fills in metadata_name and tags.

queue_forecast_task_runs

One row per execution attempt (task_id, run_id).

CREATE TABLE queue_forecast_task_runs (
    -- 8-byte types
    pending_at         TIMESTAMPTZ,
    started_at         TIMESTAMPTZ,
    resolved_at        TIMESTAMPTZ,
    wait_duration_s    DOUBLE PRECISION,
    run_duration_s     DOUBLE PRECISION,

    -- 4-byte types
    run_id             INT NOT NULL,
    queue_pending      INTEGER,

    -- Variable-length
    task_id            TEXT NOT NULL
                       REFERENCES queue_forecast_tasks(task_id) ON DELETE CASCADE,
    priority_at_pending TEXT,
    reason_created     TEXT,
    reason_resolved    TEXT,

    PRIMARY KEY (task_id, run_id)
);

Notes:

  • priority_at_pending is a snapshot at enqueue time. We do not train on mutable "current priority".
  • queue_pending is the approximate queue depth snapshot nearest to pending_at, sourced from in-memory counters seeded and periodically synced from the Queue API.
  • wait_duration_s = started_at - pending_at
  • run_duration_s = resolved_at - started_at
  • Runs that never start keep both duration fields NULL.

queue_forecast_run_predictions

Every prediction is logged before the outcome is known, enabling evaluation.

CREATE TABLE queue_forecast_run_predictions (
    -- 8-byte types
    predicted_at                 TIMESTAMPTZ DEFAULT now(),
    expected_completion_time     TIMESTAMPTZ,
    guaranteed_completion_time   TIMESTAMPTZ,
    wait_p50_s                   DOUBLE PRECISION,
    wait_p90_s                   DOUBLE PRECISION,
    run_p50_s                    DOUBLE PRECISION,
    run_p90_s                    DOUBLE PRECISION,

    -- 4-byte types
    run_id                       INT NOT NULL,

    -- Variable-length
    task_id                      TEXT NOT NULL,
    model_version                TEXT NOT NULL,
    input_features               JSONB,

    PRIMARY KEY (task_id, run_id)
);

Notes:

  • expected_completion_time = pending_at + wait_p50_s + run_p50_s
  • guaranteed_completion_time = pending_at + wait_p90_s + run_p90_s
  • input_features captures the exact feature vector fed to the model, enabling post-hoc debugging ("why did the model predict 45 minutes?").
  • One prediction per run. If models are updated, the old prediction is overwritten.

Indexes

-- Training sweep: grab last N days of clean completed runs
CREATE INDEX idx_qf_task_runs_training
    ON queue_forecast_task_runs (resolved_at)
    WHERE started_at IS NOT NULL
      AND run_duration_s IS NOT NULL
      AND reason_resolved IN ('completed', 'failed');

-- Reconciler: find stuck runs
CREATE INDEX idx_qf_task_runs_unresolved
    ON queue_forecast_task_runs (pending_at)
    WHERE resolved_at IS NULL;

-- Enrichment backfill: find tasks missing metadata
CREATE INDEX idx_qf_tasks_unenriched
    ON queue_forecast_tasks (task_id)
    WHERE metadata_name IS NULL;

Collection

Architecture

Data ingestion is Pulse-first, API-reconciled. The collector subscribes to all task lifecycle events via AMQP and upserts into the normalized tables. A separate reconciler repairs missed or incomplete state via the Queue API.

The collector must not assume event order. Any event may arrive before another event for the same task or run. Every handler must:

  • upsert the row if it does not exist,
  • fill only the fields it knows,
  • avoid overwriting a later lifecycle state with an earlier one.

Queue Pending Counters

The collector maintains in-memory pending counts per task_queue_id:

  • Seeded from the Queue API on first encounter via taskQueueCounts()
  • Incremented on task-pending, decremented on task-running
  • Periodically synced against the API (every 60s) to correct drift
  • Snapshot written to queue_forecast_task_runs.queue_pending at task-pending time

These are approximate values — documented as such. Good enough for modeling.

Event Routing

task-defined

  • UPSERT into queue_forecast_tasks
  • Extracts: task_queue_id, scheduler_id, project_id, tags
  • No row created in queue_forecast_task_runs — a run has not been enqueued yet
  • Triggers background API enrichment if metadata_name is NULL

task-pending

  • UPSERT into queue_forecast_tasks (in case task-defined was missed)
  • UPSERT into queue_forecast_task_runs for (task_id, run_id)
  • Captures: pending_at, priority_at_pending, queue_pending snapshot, reason_created
  • Triggers prediction via predictor.js, stores result in queue_forecast_run_predictions

task-running

  • UPSERT into queue_forecast_tasks (in case earlier events were missed)
  • UPDATE queue_forecast_task_runs for (task_id, run_id)
  • Captures: started_at
  • Computes wait_duration_s if pending_at is already set

task-completed / task-failed

  • UPSERT into queue_forecast_tasks
  • UPDATE queue_forecast_task_runs for (task_id, run_id)
  • Captures: resolved_at, reason_resolved
  • Computes run_duration_s if started_at is already set

task-exception

  • Same as completed/failed for runs with a run_id
  • Special case: exception with no run_id (deadline-exceeded before any run started) — update the last known run in queue_forecast_task_runs if one exists, otherwise no run row to update

task-priority-changed / task-group-priority-changed

  • Updates queue_forecast_tasks only (informational, not used for training since we snapshot priority_at_pending at enqueue time)

Background API Enrichment

On every event, if the task's metadata_name is NULL in queue_forecast_tasks:

  • Check in-memory cache first (keyed by task_id)
  • If not cached, fetch task definition from Queue API
  • Fill: metadata_name, normalized_name, original_priority, max_run_time_s, tags, task_created
  • Cache the enrichment data so subsequent run events for the same task don't require another API call
  • Concurrency-limited (max 50 in-flight fetches)

Reconciliation

Taskcluster Pulse is at-most-once delivery. Events can be dropped, and automated retries may not publish task-exception for dead runs. The reconciler ensures training data stays clean.

Reconciler Job

Runs as a background cron (every 15 minutes).

Stuck Runs

  1. Query queue_forecast_task_runs for rows where resolved_at IS NULL and either:
    • pending_at + max_run_time_s + 1 hour < now() (when max_run_time_s known via join to queue_forecast_tasks)
    • pending_at + INTERVAL '24 hours' < now() (fallback)
  2. Fetch true state from Queue API taskStatus() for each stuck task
  3. If API shows terminal: update queue_forecast_task_runs with correct timestamps and resolution
  4. If API shows the run was silently dropped: set reason_resolved = 'reconciler-dropped' so training explicitly excludes it

Missing Enrichment

  1. Query queue_forecast_tasks for rows where metadata_name IS NULL and enriched_at IS NULL and task first seen more than 5 minutes ago
  2. Fetch task definition from Queue API
  3. Fill metadata fields

This merges the current backfill sweep into the reconciler — one repair job instead of two.

Machine Learning Pipeline

Algorithm: LightGBM

We use LightGBM (Light Gradient Boosting Machine), a gradient-boosted decision tree algorithm. For tabular data with high-cardinality categories (like task_queue_id and metadata_name), tree-based models outperform neural networks in both speed and accuracy.

LightGBM builds hundreds of shallow decision trees sequentially. Each tree corrects the errors of the previous ones. This naturally captures feature interactions — for example, learning that high queue depth on weekends affects wait time differently than on weekdays.

Two Separate Models

The system trains two independent models nightly. They have different targets, training filters, feature sets, and lookback windows.

Run Duration Model (run_duration_model.onnx)

Predicts how long a task will execute once a worker picks it up.

Target: run_duration_s

Training filter:

SELECT r.run_duration_s, r.queue_pending,
       t.task_queue_id, t.metadata_name, t.normalized_name,
       t.scheduler_id, t.max_run_time_s, t.tags,
       r.pending_at
FROM queue_forecast_task_runs r
JOIN queue_forecast_tasks t ON r.task_id = t.task_id
WHERE r.resolved_at > now() - INTERVAL '30 days'
  AND r.started_at IS NOT NULL
  AND r.run_duration_s IS NOT NULL
  AND r.reason_resolved IN ('completed', 'failed')

Lookback: 30 days. Run times are tied to code and payloads, relatively stable over time.

Why include failed: A test that runs for 25 minutes then fails still took 25 minutes. Excluding failures would bias the model toward only successful (often shorter) runs.

Exclude: worker-shutdown, claim-expired, malformed-payload, reconciler-dropped — these are infrastructure artifacts, not workload runtime.

Features:

Feature Type Source Notes
metadata_name categorical tasks Most specific identifier (~5k unique/day)
normalized_name categorical tasks Groups retriggered variants
task_queue_id categorical tasks Worker pool identity (~50-100 unique)
scheduler_id categorical tasks Broad cohort (gecko-level-1, etc.)
max_run_time_s numeric tasks Declared timeout, correlates with task weight
tags->>'kind' categorical tasks.tags mochitest, build, signing, etc.
tags->>'test-type' categorical tasks.tags mochitest, wpt, reftest, etc.
tags->>'os' categorical tasks.tags linux, windows, macos
tags->>'project' categorical tasks.tags try, autoland, mozilla-central
tags->>'worker-implementation' categorical tasks.tags docker-worker vs generic-worker

Build type extraction: The Python trainer extracts debug vs opt from metadata_name via regex (e.g. test-linux2404-64/debug-...debug). This is one of the strongest run duration predictors in Firefox CI.

Wait Time Model (wait_time_model.onnx)

Predicts how long a task will sit in queue before a worker picks it up.

Target: wait_duration_s

Training filter:

SELECT r.wait_duration_s, r.queue_pending, r.priority_at_pending,
       t.task_queue_id, t.scheduler_id, t.tags,
       r.pending_at
FROM queue_forecast_task_runs r
JOIN queue_forecast_tasks t ON r.task_id = t.task_id
WHERE r.resolved_at > now() - INTERVAL '14 days'
  AND r.started_at IS NOT NULL
  AND r.queue_pending IS NOT NULL

Lookback: 14 days. Wait times reflect current infrastructure capacity and are highly recency-sensitive. Stale capacity data hurts more than limited sample size.

Why resolution doesn't matter: Once a run started, queue wait is observed regardless of whether it later completed or failed.

Features:

Feature Type Source Notes
task_queue_id categorical tasks Most important baseline
priority_at_pending categorical task_runs Critical for scheduling order
queue_pending numeric task_runs Backlog depth at enqueue time
scheduler_id categorical tasks Cohort behavior
max_run_time_s numeric tasks Task weight signal
tags->>'kind' categorical tasks.tags Workload type
tags->>'os' categorical tasks.tags Platform
tags->>'project' categorical tasks.tags try vs autoland behave differently
hour_sin, hour_cos numeric derived Cyclical encoding of hour-of-day (UTC)
day_sin, day_cos numeric derived Cyclical encoding of day-of-week

Cyclical time encoding:

hour_sin = sin(2 * pi * hour / 24)
hour_cos = cos(2 * pi * hour / 24)
day_sin  = sin(2 * pi * day_of_week / 7)
day_cos  = cos(2 * pi * day_of_week / 7)

This ensures the model understands that 23:00 and 00:00 are adjacent, and Friday and Monday are close.

Training Strategy

Sliding window retrain, not incremental learning. Every night:

  1. Python trainer queries Postgres for the relevant lookback window
  2. Trains a fresh LightGBM model from scratch (discards yesterday's model)
  3. Uses objective=quantile with alpha=0.5 for p50, alpha=0.9 for p90 (two training passes per model, or a single multi-quantile model)
  4. Exports to ONNX format
  5. Writes run_duration_model.onnx and wait_time_model.onnx to a shared volume

Why not incremental: Decision tree incremental learning leads to tree bloat (slowing inference) and struggles to adapt when new queue names or task types appear. A fresh retrain automatically forgets outdated patterns.

Feature Engineering (Python)

All feature engineering happens in the Python training script:

  • Categorical handling: High-cardinality strings cast to Pandas category dtype. LightGBM handles these natively without one-hot encoding.
  • Tag extraction: tags->>'kind', tags->>'os', etc. extracted from JSONB into typed columns. Deployment-specific — only the trainer knows which tag keys matter.
  • Build type: Regex extraction of debug/opt from metadata_name.
  • Time features: Cyclical encoding derived from pending_at timestamp.
  • NULL handling: LightGBM handles NaN/NULL natively for both numeric and categorical features.

Real-Time Inference

ONNX Runtime in Node.js

The predictor.js service loads both .onnx model files into memory using onnxruntime-node. When a task-pending event arrives:

  1. Collector upserts the run into queue_forecast_task_runs
  2. Collector calls the predictor with the task/run features
  3. Predictor applies the same feature engineering as training:
    • Categorical encoding (string -> integer mapping, loaded alongside the ONNX model as a JSON sidecar file)
    • Cyclical time encoding from pending_at
    • Build type regex extraction from metadata_name
  4. Runs both models (run duration + wait time) in-memory
  5. Composes the ETA:
    • expected_completion_time = pending_at + wait_p50 + run_p50
    • guaranteed_completion_time = pending_at + wait_p90 + run_p90
  6. Writes prediction to queue_forecast_run_predictions

Inference latency target: low single-digit milliseconds per prediction. No network calls to Python. No database reads for historical stats.

Category Mapping Sidecar

LightGBM categorical features are integer-coded during training. The Python trainer must export a category_mappings.json alongside each ONNX model containing the string-to-integer mapping for every categorical feature. The Node.js predictor loads this at startup and on model reload.

Parity requirement: Float/double precision can drift between Python and ONNX inference. Automated parity tests between Python predictions and Node.js ONNX predictions are a strict requirement before any model is deployed.

Model Hot-Reload

The predictor watches the shared model volume for new .onnx files. When the nightly trainer writes a new model:

  • Predictor detects the new file (filesystem watch or polling)
  • Loads new model + category mappings into memory
  • Swaps atomically (old model serves requests until new one is ready)
  • Logs the model version transition

Cold Start Handling

When LightGBM encounters a categorical value it has never seen during training (e.g., a brand new metadata_name or task_queue_id):

  • LightGBM treats unseen categoricals as a separate "unknown" bucket and routes them through decision tree branches based on other features
  • This means a brand new task type still gets a prediction — it just relies more heavily on task_queue_id, tags, scheduler_id, and other features the model has seen
  • The input_features JSONB in queue_forecast_run_predictions should flag which features were unknown, enabling evaluation of cold-start accuracy
  • After one nightly retrain cycle, the new task type enters the training data and gets proper coverage

ML Pipeline Architecture Options

The sections above describe the ML algorithm, features, and training strategy independently of where training and inference run. There are three viable deployment architectures. All share the same data model, collection layer, and evaluation methodology — they differ only in who trains the model and where inference happens.

Shared Component: Daily Data Export (approaches A and B)

Both bugbug-based approaches require a daily Taskcluster task that exports training data from Postgres and publishes it as a TC artifact.

┌─────────────┐    daily TC task    ┌──────────────────────┐
│  Postgres    │ ──────────────────→ │ training_data.json.zst│
│  (collector) │   SQL query +       │ (TC artifact, 7-day  │
│              │   zstd compress     │  expiry, TC-indexed)  │
└─────────────┘                     └──────────────────────┘
  • Runs as a scheduled TC task (not in docker-compose)
  • Queries the training SQL from the run duration and wait time model sections above, exports as newline-delimited JSON compressed with zstandard (.json.zst)
  • Published as a public TC artifact, indexed via project.queue-forecasting.data.latest
  • Estimated size: 1-3 GB compressed for a 30-day window (~7.5M rows)
  • Artifact expiry: 7 days (training only needs the latest snapshot)

This aligns with bugbug's existing data pipeline pattern — every data source in bugbug (Bugzilla, Mercurial, CI failures) follows the same retrieval-task → artifact → training-task flow.

Approach A: Full bugbug Integration (training + serving)

Data flow:

Node.js collector → Postgres → daily export task → TC artifact
  → bugbug data-retrieval task downloads artifact
  → bugbug training task (XGBoost) → model stored as pickle
  → bugbug HTTP service serves predictions
  → Node.js services call bugbug HTTP API

What lives where:

Component Location Owner
Collector, reconciler tools/queue-forecasting/ (TC repo) TC team
Data export task tools/queue-forecasting/ (TC repo) TC team
Data retrieval script bugbug repo bugbug team
Model class + training bugbug repo bugbug team
HTTP prediction endpoint bugbug HTTP service bugbug team
Prediction API (proxy) tools/queue-forecasting/ (TC repo) TC team

What needs to be added to bugbug:

  1. Data retrieval script — downloads the training_data.json.zst artifact from TC index, decompresses, yields records. Similar to existing bugbug/bugzilla.py retrieval pattern.
  2. Model class — extends bugbug.model.Model, defines feature extraction from the exported task/run records. Uses XGBoost (bugbug's standard) with quantile regression for p50/p90.
  3. Training task — entry in infra/data-pipeline.yml depending on the data retrieval task.
  4. HTTP endpoint — new route in http_service/bugbug_http/app.py that accepts task features and returns wait time + run duration predictions.

Prediction flow:

  1. task-pending event arrives at collector
  2. Collector upserts run, then calls bugbug HTTP API with features
  3. bugbug API enqueues prediction job (Redis + RQ)
  4. Collector polls for result (bugbug's standard async pattern)
  5. Result written to queue_forecast_run_predictions

Pros:

  • No Python or ML code in the TC repo
  • Leverages existing Mozilla ML infrastructure (CI, monitoring, deployment, model management)
  • bugbug team already maintains training orchestration and HTTP serving
  • Existing patterns for model rollback and evaluation

Cons:

  • Network latency: bugbug uses async polling (enqueue → poll for result). At ~250k predictions/day (~3/sec sustained), each prediction incurs HTTP round-trips instead of sub-ms local inference. Batching can amortize this but adds complexity.
  • XGBoost vs LightGBM: bugbug standardizes on XGBoost. XGBoost requires manual categorical encoding (label encoding or one-hot) where LightGBM handles high-cardinality categoricals natively. Quality is comparable for tabular data, but feature engineering is more involved.
  • External service dependency: bugbug HTTP downtime means no new predictions. Stale predictions in queue_forecast_run_predictions remain available but won't update.
  • Cross-team coordination: model changes require PRs to bugbug repo and alignment with bugbug release cadence.

Cost summary:

Cost Estimate
Data export artifact storage ~1-3 GB/day, 7-day expiry = ~7-21 GB peak
Network transfer (export → bugbug) ~1-3 GB/day (TC-internal, free)
bugbug training compute 1 TC task/day, ~10-30 min
HTTP API calls ~250k/day, async polling

Approach B: Mixed Mode (bugbug training, ONNX local inference)

Data flow:

Node.js collector → Postgres → daily export task → TC artifact
  → bugbug data-retrieval task downloads artifact
  → bugbug training task → ONNX export as TC artifact
  → Node.js predictor downloads ONNX model + category mappings
  → Local inference via onnxruntime-node

What lives where:

Component Location Owner
Collector, reconciler, predictor tools/queue-forecasting/ (TC repo) TC team
Data export task tools/queue-forecasting/ (TC repo) TC team
Data retrieval + training bugbug repo bugbug team
ONNX model artifact TC artifact storage produced by bugbug

What needs to be added to bugbug (same as A, plus):

  • ONNX export step after training. bugbug does not support ONNX today. XGBoost models can be converted via onnxmltools or skl2onnx, but this is less battle-tested than LightGBM's ONNX export path.
  • Category mapping sidecar (category_mappings.json) exported alongside the ONNX model.
  • Parity tests between Python XGBoost predictions and ONNX runtime predictions (float precision can drift).

Prediction flow:

  1. task-pending event arrives at collector
  2. Collector calls local predictor.js (same as current spec)
  3. Predictor runs ONNX model in-process, sub-ms latency
  4. Result written to queue_forecast_run_predictions

Model hot-reload:

  • Predictor polls TC index for new ONNX artifact (or watches a local volume synced from TC artifacts)
  • Loads new model + category mappings atomically

Pros:

  • Sub-ms local inference preserved — no runtime dependency on bugbug
  • Leverages bugbug's training orchestration and CI
  • Model is a static artifact — predictor is self-contained after download

Cons:

  • ONNX export is new to bugbug — needs to be implemented and maintained. Adds a capability bugbug doesn't currently have.
  • XGBoost ONNX maturity: XGBoost → ONNX conversion exists but is less mature than LightGBM → ONNX. Quantile regression ONNX export may need validation.
  • Category mapping sidecar: same complexity as approach C (the Node.js predictor must replicate categorical encoding).
  • Cross-team dependency for training changes, but not for runtime.

Cost summary:

Cost Estimate
Data export artifact storage ~1-3 GB/day, 7-day expiry
ONNX model artifact storage ~10-50 MB/day, 7-day expiry
bugbug training compute 1 TC task/day, ~10-30 min
Network transfer at inference None (local)

Approach C: All-in-TC Standalone (current spec baseline)

Data flow:

Node.js collector → Postgres
  → Nightly Python trainer (docker-compose) queries Postgres directly
  → LightGBM training → ONNX export to shared volume
  → Node.js predictor loads ONNX, runs local inference

This is the architecture described in the preceding sections. The Python trainer lives in tools/queue-forecasting/ alongside the Node.js code, runs as a docker-compose service on a nightly cron.

Pros:

  • Full control over the entire pipeline — no external dependencies
  • LightGBM with native categorical support (no manual encoding needed for high-cardinality features like metadata_name)
  • Sub-ms local inference
  • Self-contained: one docker-compose up runs everything
  • Simpler debugging — all code in one repo

Cons:

  • Own the entire ML pipeline: training infrastructure, monitoring, model versioning, rollback
  • Python code in the TC repo (TC is primarily Node.js and Go)
  • Must build training orchestration, evaluation automation, and model management from scratch

Cost summary:

Cost Estimate
Training compute docker-compose container, ~10-30 min/day
Storage ONNX models on local/shared volume, ~50 MB
External dependencies None

Comparison Matrix

Dimension A: Full bugbug B: Mixed mode C: Standalone
Inference latency ~100ms+ (HTTP poll) Sub-ms (local ONNX) Sub-ms (local ONNX)
Runtime dependency bugbug HTTP service None (static artifact) None
Training orchestration bugbug (existing) bugbug (existing) Self-built
ML framework XGBoost XGBoost LightGBM
Categorical handling Manual encoding Manual encoding Native
Python in TC repo No No Yes
New bugbug work Model + endpoint Model + ONNX export None
Operational ownership Shared (TC + bugbug) Shared (training only) TC team only

Recommendation

Start with Approach C (standalone) to validate the model quality and prediction pipeline end-to-end with minimal cross-team coordination. The evaluation metrics (within-2x rate, pinball loss, p90 calibration) will determine whether the ML approach works before investing in infrastructure integration. If the model proves valuable, migrate training to bugbug (Approach B) to offload pipeline maintenance, with Approach A as an option if local inference complexity becomes a burden.

API

Prediction Endpoint

GET /v1/predict/:taskId/:runId

Response:

{
  "taskId": "VGx8Q3kRTe2...",
  "runId": 0,
  "prediction": {
    "waitTime": {
      "p50_seconds": 142.3,
      "p90_seconds": 412.8
    },
    "runDuration": {
      "p50_seconds": 1823.7,
      "p90_seconds": 2401.2
    },
    "eta": {
      "expected": "2026-03-27T14:32:00Z",
      "guaranteed": "2026-03-27T15:05:00Z"
    },
    "modelVersion": "2026-03-27-nightly",
    "predictedAt": "2026-03-27T14:00:12Z"
  }
}

This endpoint reads from queue_forecast_run_predictions. If the prediction already exists (generated at task-pending time), it returns it. If the run exists but has no prediction yet (race condition or missed event), it generates one on the fly.

Queue Status Endpoint (V2)

GET /v1/queue/:taskQueueId/estimate

Returns predicted wait time for a hypothetical new task entering this queue right now, using current queue_pending count and the wait-time model. Deferred to V2 but the data model supports it from day 1.

Evaluation

Every prediction is stored in queue_forecast_run_predictions before the outcome is known. A daily evaluation job compares predictions against actuals.

Methodology

  • Strict time-split only. Never random split. Train on days 1-N, evaluate on day N+1. Random splitting leaks future information.
  • Evaluation runs automatically after each nightly training cycle.

Metrics

Metric Description
Within-2x rate % of eta_estimate predictions within 0.5x-2x of actual total time. Target: >80%
Pinball loss (p50) Measures median prediction accuracy. Lower is better.
Pinball loss (p90) Measures upper-bound prediction accuracy.
p90 calibration Does the p90 prediction actually cover ~90% of observed durations?
Coverage % of pending runs that received a prediction (vs cold-start fallback)
Fallback rate % of predictions where key features were unseen by the model

Slices

Metrics must be computed across slices, not just globally:

  • By task_queue_id (top 20 queues by volume)
  • By priority_at_pending
  • By tags->>'project' (try vs autoland vs mozilla-central)
  • By cold-start status (was metadata_name in the training set?)

A model that looks great globally but fails on the highest-volume queue is not deployable.

Evaluation Query

SELECT
    rp.task_id,
    rp.run_id,
    rp.wait_p50_s,
    rp.run_p50_s,
    rp.expected_completion_time,
    rp.guaranteed_completion_time,
    rp.model_version,
    r.wait_duration_s   AS actual_wait,
    r.run_duration_s    AS actual_run,
    r.pending_at,
    r.resolved_at,
    t.task_queue_id,
    t.tags
FROM queue_forecast_run_predictions rp
JOIN queue_forecast_task_runs r
  ON rp.task_id = r.task_id AND rp.run_id = r.run_id
JOIN queue_forecast_tasks t
  ON rp.task_id = t.task_id
WHERE r.resolved_at IS NOT NULL
  AND r.started_at IS NOT NULL
  AND r.resolved_at >= $1::date
  AND r.resolved_at < $1::date + INTERVAL '1 day'

Rollout

Phase What ships Predictions visible?
Phase 1 Collector, reconciler, nightly trainer, predictor Stored only. Internal evaluation.
Phase 2 Prediction API Debug/internal consumers. TC UI behind flag.
Phase 3 TC UI integration Default-on for supported queues.

No model is exposed to users without passing automated evaluation on the metrics above.

Data Retention

Raw Data

queue_forecast_tasks and queue_forecast_task_runs enforce a rolling 45-day retention window. 45 days provides margin beyond the 30-day training window for debugging, evaluation lookback, and reconciliation of late-arriving events.

To avoid expensive row-by-row DELETE operations:

  • queue_forecast_task_runs is partitioned by week on pending_at using Postgres native range partitioning
  • Expired data is dropped by detaching and destroying the oldest partition
  • A weekly cron handles partition management (create next week's partition, drop partitions older than 45 days)

queue_forecast_tasks rows are cleaned up via CASCADE when their last associated run partition is dropped. Alternatively, a lightweight sweep deletes orphaned queue_forecast_tasks rows with no remaining queue_forecast_task_runs references.

Predictions

queue_forecast_run_predictions follows the same 45-day retention, partitioned on predicted_at.

Model Artifacts

Keep the last 7 days of .onnx model files and category_mappings.json on the shared volume. Allows quick rollback if a nightly model degrades. Older artifacts are deleted.

Migration from task_events

The existing task_events table contains ~1.1M rows (5 days of data). This migration splits it into the normalized two-table model without data loss.

Step 1: Create the new tables

CREATE TABLE queue_forecast_tasks (
    -- 8-byte types
    task_created       TIMESTAMPTZ,
    enriched_at        TIMESTAMPTZ,

    -- 4-byte types
    max_run_time_s     INTEGER,

    -- Variable-length
    task_id            TEXT PRIMARY KEY,
    task_queue_id      TEXT,
    task_group_id      TEXT,
    scheduler_id       TEXT,
    project_id         TEXT,
    metadata_name      TEXT,
    normalized_name    TEXT,
    original_priority  TEXT,
    tags               JSONB
);

CREATE TABLE queue_forecast_task_runs (
    -- 8-byte types
    pending_at         TIMESTAMPTZ,
    started_at         TIMESTAMPTZ,
    resolved_at        TIMESTAMPTZ,
    wait_duration_s    DOUBLE PRECISION,
    run_duration_s     DOUBLE PRECISION,

    -- 4-byte types
    run_id             INT NOT NULL,
    queue_pending      INTEGER,

    -- Variable-length
    task_id            TEXT NOT NULL
                       REFERENCES queue_forecast_tasks(task_id) ON DELETE CASCADE,
    priority_at_pending TEXT,
    reason_created     TEXT,
    reason_resolved    TEXT,

    PRIMARY KEY (task_id, run_id)
);

CREATE TABLE queue_forecast_run_predictions (
    -- 8-byte types
    predicted_at                 TIMESTAMPTZ DEFAULT now(),
    expected_completion_time     TIMESTAMPTZ,
    guaranteed_completion_time   TIMESTAMPTZ,
    wait_p50_s                   DOUBLE PRECISION,
    wait_p90_s                   DOUBLE PRECISION,
    run_p50_s                    DOUBLE PRECISION,
    run_p90_s                    DOUBLE PRECISION,

    -- 4-byte types
    run_id                       INT NOT NULL,

    -- Variable-length
    task_id                      TEXT NOT NULL,
    model_version                TEXT NOT NULL,
    input_features               JSONB,

    PRIMARY KEY (task_id, run_id)
);

Step 2: Migrate the data

-- A. Populate queue_forecast_tasks
--    DISTINCT ON grabs the most complete metadata per task_id
--    (latest run_id tends to have the richest enrichment)
INSERT INTO queue_forecast_tasks (
    task_id, task_queue_id, task_group_id, scheduler_id, project_id,
    metadata_name, normalized_name, original_priority,
    max_run_time_s, tags, task_created, enriched_at
)
SELECT DISTINCT ON (task_id)
    task_id, task_queue_id, task_group_id, scheduler_id, project_id,
    metadata_name, normalized_name, original_priority,
    max_run_time_s, tags, task_created,
    CASE WHEN metadata_name IS NOT NULL THEN now() END
FROM task_events
ORDER BY task_id, run_id DESC NULLS LAST;

-- B. Populate queue_forecast_task_runs
--    Skip NULL run_id rows (task-defined placeholders with no actual run)
INSERT INTO queue_forecast_task_runs (
    task_id, run_id, priority_at_pending, reason_created, reason_resolved,
    pending_at, started_at, resolved_at, queue_pending,
    wait_duration_s, run_duration_s
)
SELECT
    task_id, run_id, priority, reason_created, reason_resolved,
    scheduled, started, resolved, queue_pending,
    wait_duration_s, run_duration_s
FROM task_events
WHERE run_id IS NOT NULL;

Step 3: Create indexes

CREATE INDEX idx_qf_task_runs_training
    ON queue_forecast_task_runs (resolved_at)
    WHERE started_at IS NOT NULL
      AND run_duration_s IS NOT NULL
      AND reason_resolved IN ('completed', 'failed');

CREATE INDEX idx_qf_task_runs_unresolved
    ON queue_forecast_task_runs (pending_at)
    WHERE resolved_at IS NULL;

CREATE INDEX idx_qf_tasks_unenriched
    ON queue_forecast_tasks (task_id)
    WHERE metadata_name IS NULL;

Step 4: Verify and cutover

-- Verify row counts
SELECT 'queue_forecast_tasks' AS tbl, count(*) FROM queue_forecast_tasks
UNION ALL
SELECT 'queue_forecast_task_runs', count(*) FROM queue_forecast_task_runs
UNION ALL
SELECT 'task_events (total)', count(*) FROM task_events
UNION ALL
SELECT 'task_events (with run_id)', count(*)
  FROM task_events WHERE run_id IS NOT NULL;

-- queue_forecast_task_runs count should match task_events-with-run_id count
-- queue_forecast_tasks count should match distinct task_id count

Once verified:

  1. Stop the collector
  2. Run the migration
  3. Deploy updated collector that writes to the new tables
  4. Verify new events land correctly
  5. Rename or drop task_events when confident

Step 5: Add partitioning (post-migration)

After the initial migration is stable, convert queue_forecast_task_runs to range-partitioned on pending_at by week. This is a separate step because partitioning an existing table requires recreating it.

-- Create partitioned version
CREATE TABLE queue_forecast_task_runs_part (
    LIKE queue_forecast_task_runs INCLUDING ALL
) PARTITION BY RANGE (pending_at);

-- Create weekly partitions
CREATE TABLE queue_forecast_task_runs_w2026_12
    PARTITION OF queue_forecast_task_runs_part
    FOR VALUES FROM ('2026-03-23') TO ('2026-03-30');
CREATE TABLE queue_forecast_task_runs_w2026_13
    PARTITION OF queue_forecast_task_runs_part
    FOR VALUES FROM ('2026-03-30') TO ('2026-04-06');
-- ... etc

-- Migrate data, swap tables

Deferred / Future Work

  • Queue depth time-series (queue_forecast_queue_depth_samples table) — needed for goal 5 (queue load prediction by time/day). Start collecting once V1 prediction pipeline is stable.
  • Queue drain forecasting — builds on wait-time model + queue depth data. V2 feature.
  • Trend/regression detection — daily cohort rollups comparing trailing 7-day vs 28-day quantiles. Requires stable evaluation pipeline first.
  • bugbug migration — V1 starts standalone (Approach C) to validate model quality. Once evaluation confirms the approach works, migrate training to bugbug (Approach B or A) to leverage existing Mozilla ML infrastructure. See "ML Pipeline Architecture Options" above for the full comparison. Decision point: after Phase 1 evaluation metrics are stable.
  • Shadow mode comparison — running a new model version side-by-side with production and auto-promoting only if it wins on evaluation metrics. Applicable regardless of pipeline architecture.
  • TC UI integration — wiring the prediction API into the task detail view.
  • Lando landing queue as a leading indicator — Lando's merge queue shows what's about to land and therefore what will be scheduled soon. This is a forward-looking signal the current models lack — today the wait-time model only reacts to queue_pending at enqueue time. A periodic snapshot of the Lando queue depth (and optionally the repos being landed) could feed into the wait-time model and V2 queue load prediction. Complexity: requires a new data source (Lando API), and the signal is indirect — a landing doesn't map 1:1 to specific task queues without understanding the push-to-taskgraph relationship.
  • Tree status and sheriff activity — tree closures halt new tasks, and sheriff-initiated backfills cause sudden load spikes. Both are regime changes that dramatically shift queue behavior. TreeHerder exposes tree status (open / closed / approval-required) via API. Adding tree state as a categorical feature to both models would help them distinguish normal load from closure-recovery bursts. Backfill detection is harder — may require identifying sheriff-triggered task groups via scheduler_id or push metadata.
  • Guiding principle: TC-only first, extend if needed — V1 deliberately uses only Taskcluster-internal data (Pulse events, Queue API). The evaluation pipeline (within-2x rate, pinball loss, p90 calibration) provides an objective checkpoint: if TC-only features don't meet accuracy targets after Phase 1 evaluation, that is the signal to integrate external sources like Lando and TreeHerder.

Queue Forecasting — Phase 2 Decision

Date: 2026-04-23 Companion to: trainer-spec.md, trainer-plan.md Authors: residual-model experiment, wait-time transform variants, run-duration residual experiment

1. Decision

Proceed to productionizing the residual architecture for both run_duration and wait_time.

The residual approach — baseline percentile prediction as an input feature, LightGBM learning a log-ratio correction — outperforms both the baseline-only predictor and model-only LightGBM on both targets. It meets the MAE spec threshold for both targets and improves within-2x for both, though wait-time within-2x falls short of the stated +5pp target. The pattern of improvement is consistent across buckets, diagnosable where it isn't, and the known gaps do not block investment in the production path.

2. Evidence

Five-day holdout (Apr 18-22), cohort-matched, primary slice (reason_resolved = 'completed').

Run duration

Metric Baseline LGB-only Residual Δ vs Baseline Spec
MAE 138.8s 146.6s 130.1s −6.3% ≥5% ✅
within-2x 88.7% 89.1% 89.7% +1.0pp (MAE primary)
p90 coverage 88.0% 87.9% [85, 95]% ✅

Phase 1 classified duration as a "clean miss" because LightGBM-only lost by +5.6% MAE to the baseline's metadata_name exact-match. Residual reverses that verdict — same memorization becomes an input to the model rather than a competitor to it.

Wait time

Metric Baseline LGB-only Residual (log_ratio) Δ vs Baseline Spec
MAE 613.7s 539.1s 519.9s −15.3% ≥15% ✅
within-2x 51.7% 42.7% 54.6% +2.9pp +5pp ❌
p90 coverage 94.2% 85.8% [85, 95]% ✅ (edge)

Wait clears the MAE spec (−15.3%) and moves within-2x from regression to improvement, but the +2.9pp gain does not reach the original +5pp target. This is carried as a known gap, not a blocker.

Per-bucket wait breakdown

Bucket n % Base MAE LGB MAE Res MAE Base w/in-2x LGB w/in-2x Res w/in-2x
<1m 357k 50% 32.0s 35.9s 29.4s 43.3% 23.5% 44.8%
1-5m 182k 26% 117.6s 82.3s 105.3s 67.5% 76.1% 69.8%
5-30m 127k 18% 423.2s 455.5s 478.3s 62.3% 53.7% 65.1%
30m+ 43k 6% 8175s 6956s 6525s 22.8% 27.0% 38.6%

Residual wins on 82% of the cohort (<1m, 1-5m, 30m+) on either MAE or within-2x or both. The one bucket where it loses to baseline on MAE (5-30m) is the smallest except for the tail.

Transform variants (tested, rejected)

Both additive (y_t = y - bl) and log_diff (y_t = log1p(y) - log1p(bl)) were trained and evaluated against log_ratio:

  • log_diff is algebraically identical to log_ratio; numbers match to the last decimal.
  • additive regresses MAE (+5.8%) and within-2x (−1.2pp) vs log_ratio. Only win: p90 calibration improves from 85.8% → 90.9%, closer to the ideal 90%. Logged as an option if p90 calibration becomes a higher priority than MAE.

3. Chosen design

Residual LightGBM with log_ratio transform. Both targets use the same shape:

  • Input features include the baseline p50 prediction (bl_wait_p50 for wait, bl_duration_p50 for duration) as a numeric feature, alongside the existing categorical and numeric features from the Phase 1 spec.
  • Training target: y_t = log((y + 1) / (bl + 1))
  • Inverse at inference: y_hat = exp(model_raw) * (bl + 1) - 1
  • Two quantile models per target (p50, p90), trained independently with alpha ∈ {0.5, 0.9}.
  • Baseline remains part of the serving path. The serving flow computes the baseline prediction first (percentile lookup), feeds it into the LightGBM model, inverse-transforms the output. This is not a replacement for the baseline — it is a layered system where the baseline provides memorization and the model provides correction.

Configs in use:

  • configs/run_duration_residual.yaml
  • configs/wait_time_residual.yaml

4. Known gaps (carried as Phase 3 optimization work)

  1. Wait within-2x below the +5pp spec target. Attained +2.9pp; target was +5pp. The gap is concentrated in two places:

    • 1-5m bucket: residual regresses vs LightGBM-only (−6.3pp) because the residual pulls toward baseline memorization, partially undoing LightGBM-only's strength there.
    • 5-30m bucket: neither variant beats baseline on within-2x by a margin large enough to move the aggregate.
  2. 5-30m wait MAE regression. Residual MAE (478s) is 13% worse than baseline (423s). The transform variants did not fix this — additive is 25% worse. Working hypothesis: the available features (queue_pending, priority, time of day, tags) saturate in this range and can't distinguish 8-minute waits from 20-minute waits. Fix is feature-side, not architecture-side. Candidate features for Phase 3: queue velocity over the last N minutes, recent p50-drift per queue, tree-closure / landing-queue signal.

  3. Wait p90 coverage at lower edge. 85.8% for log_ratio is in the acceptable [85, 95]% band but tight against the lower bound. additive gets 90.9% at the cost of MAE. Tunable either via transform choice or by over-training the p90 quantile (higher alpha — e.g. 0.92).

5. Next phase — production path

From trainer-spec.md §"Out of scope (deferred)":

  1. ONNX export from Python trainer for both p50 and p90 models of both targets, with category-mapping sidecar JSONs.
  2. Parity tests between Python LightGBM predictions and ONNX-runtime predictions (required before any model is deployed — float precision differences are the usual failure mode).
  3. Node.js inference wiring in src/predictor.js via onnxruntime-node: load both models at startup, replicate the FeatureBuilder transforms in JS (including the baseline-as-feature join and the log_ratio inverse), write predictions to queue_forecast_run_predictions on each task-pending event.
  4. Model hot-reload — predictor watches the models volume for new .onnx files, swaps atomically after parity check.
  5. Nightly training cron in docker-compose, producing new models daily.
  6. Baseline prediction export as a standard ops step — the current predictor.js --export-baseline-predictions mode runs once per training window; in production it would run nightly ahead of training.

Deferred / optimization:

  • 5-30m bucket feature work (carried from §4.2 above).
  • within-2x calibration improvements — either via loss-function tuning or bucket-conditional quantile choice.
  • p90 coverage tightening toward 90% (transform choice or alpha tuning).
  • XGBoost QuantileModel subclass (pluggable interface already in place; experiment once production path is stable).

Recommendation

Ship the residual architecture. The evidence is strong enough and the remaining gaps are diagnosable optimization targets rather than architectural blockers. Build the production path next; revisit within-2x and 5-30m feature work once the serving pipeline is in place and we're getting real production feedback.

Queue Forecasting — Trainer Spec (Phase 1)

Companion to spec.md. This document specifies the first pass at the Python training pipeline: a hybrid experimentation-and-production scaffold that answers "does LightGBM beat the percentile baseline, and by how much?" while laying down structure we can keep for the nightly retrain pipeline later.

Assumes familiarity with the overall design in spec.md. See that document for the broader goals, schema, and deployment architecture options.

Goal

Train LightGBM quantile regression models for both targets (run duration and queue wait time), evaluate them on held-out data, and produce enough signal to decide whether to invest in the full nightly pipeline (ONNX export, real-time inference, hot-reload).

Current baseline (numbers to beat)

Measured on 2026-04-20 holdout (src/predictor.js):

Target Within-2x MAE
Run duration 87.4% 150.2s
Wait time 50.3% 193.2s

Run duration baseline is dominated by metadata_name exact-match percentiles (93% coverage). Wait time baseline uses task_queue_id + pending_bucket — the obvious 2-factor interaction, which is why LightGBM should improve on it significantly.

Scope

In scope (Phase 1)

  • Python 3.13 trainer under tools/queue-forecasting/trainer/
  • LightGBM quantile regression for both run duration and wait time
  • Two models per target: p50 and p90 (separate trained models, same config)
  • Parquet-cached data loading from Postgres
  • Feature engineering: categorical casts, tag JSONB extraction, build-type regex, cyclical time encoding
  • Per-day holdout evaluation (MAE, within-2x, pinball loss, p90 calibration)
  • Dockerized: docker compose run --rm trainer --config ...
  • Model abstraction layer so XGBoost can be swapped in later

Out of scope (deferred)

  • ONNX export (needed only when wiring inference into Node.js)
  • Category mapping sidecar (only needed for ONNX)
  • XGBoost implementation (pluggable interface is enough for now)
  • Real-time inference path
  • Writes to queue_forecast_run_predictions
  • Nightly cron scheduling
  • Model hot-reload in the predictor
  • Parity tests between Python and ONNX runtime

Directory layout

tools/queue-forecasting/
├── src/                              # existing Node.js (collector, baseline predictor)
├── trainer/                          # NEW
│   ├── Dockerfile
│   ├── pyproject.toml
│   ├── uv.lock
│   ├── configs/
│   │   ├── run_duration.yaml
│   │   └── wait_time.yaml
│   ├── src/
│   │   ├── __init__.py
│   │   ├── data_loader.py
│   │   ├── features.py
│   │   ├── model.py
│   │   ├── train.py
│   │   └── evaluate.py
│   └── data/                         # gitignored
│       ├── cache/                    # Parquet caches
│       └── models/                   # Trained models + manifests, per run date
└── ...

Training cache and model outputs live under trainer/data/ and are volume-mounted so they persist across container runs.

Module responsibilities

Each module has one clear purpose and can be tested independently.

data_loader.py

  • Query Postgres with the config's training filter and lookback window
  • Cache result as Parquet under data/cache/<target>_lb<N>_asof<ISO8601>_<cfg8>.parquet where <cfg8> is the first 8 hex chars of sha256(canonical_json(query_shaping_config)). "Query-shaping config" includes: target, target_column, filters, selected columns (derived from categorical_features + numeric_features), and lookback/window dates. It does not include model hyperparameters or output paths. This ensures a filter or column-list change produces a different cache key automatically.
  • Subsequent loads with the same query-shape hit the cache (sub-second)
  • --refresh-cache flag forces a re-query even if the cache is present

Interface: load(config) -> pd.DataFrame

features.py

Stateful builder — vocabulary fit on train only, applied verbatim to val and holdout. This prevents category-code drift across splits (a queue that's code 5 in train cannot become code 8 in holdout).

  • Tag JSONB extraction (tags.kind, tags.os, tags.project, tags.test-type, tags.worker-implementation)
  • Build-type regex extraction from metadata_name (debug/opt)
  • Cyclical time encoding from pending_at (hour_sin, hour_cos, day_sin, day_cos) — only for wait time model
  • Cast categorical columns to pandas.Categorical using the fixed vocabulary learned during fit; unseen values in val/holdout become NaN (LightGBM handles natively as "unknown")
  • Record per-split stats: which features are categorical vs numeric, cardinalities, NULL rates, and — critically — per-feature unseen-rate on val/holdout (the real cold-start metric)

Interface:

@dataclass
class Split:
    X: pd.DataFrame        # feature matrix, LightGBM-ready
    y: pd.Series           # target column (guaranteed non-null by the
                           # loader's config filters)
    meta: pd.DataFrame     # non-feature columns used for slicing/reporting:
                           #   pending_at, reason_resolved, task_id, run_id
                           # resolved_at is intentionally NOT here — the
                           # evaluator never needs to check "has ground
                           # truth?" because the loader's filters enforce
                           # that upstream (see Evaluation Protocol)
    stats: dict            # per-split feature stats (cardinalities, NULL
                           # rates, unseen rates for val/holdout)

class FeatureBuilder:
    def __init__(self, config): ...
    def fit_transform(self, df) -> Split    # called once, on train
    def transform(self, df) -> Split        # called on val and holdout

The meta DataFrame is row-aligned with X and y. It carries the columns evaluate.py needs for slicing — pending_at (for per-day breakdown) and reason_resolved (for primary vs supplemental slices) — without putting them into the feature matrix. task_id and run_id are included for per-row debugging.

Typical use from train.py:

builder = FeatureBuilder(config)
train = builder.fit_transform(train_df)
val   = builder.transform(val_df)
hold  = builder.transform(hold_df)

model.py

Abstract interface plus LightGBM implementation. Designed so XGBoost can slot in as a sibling subclass with no changes elsewhere.

class QuantileModel(ABC):
    def __init__(self, alpha: float, params: dict): ...
    @abstractmethod
    def fit(self, X_train, y_train, X_val, y_val): ...
    @abstractmethod
    def predict(self, X) -> np.ndarray: ...
    @abstractmethod
    def save(self, path: Path): ...
    @classmethod
    @abstractmethod
    def load(cls, path: Path) -> "QuantileModel": ...

class LightGBMQuantileModel(QuantileModel):
    # objective='quantile', alpha=<0.5 or 0.9>
    # Uses LightGBM's native categorical support via dtype='category'
    # Early stopping on validation set
    ...

# Deferred (no implementation yet, just pattern):
# class XGBoostQuantileModel(QuantileModel): ...

One instance = one quantile. The trainer creates two instances (p50 and p90) and saves each to its own file.

train.py

CLI entrypoint. Orchestrates:

  1. Load config YAML
  2. Call data_loader.load(config) to pull all required rows
  3. Split the DataFrame by pending_at into train_df, val_df, hold_df (see Evaluation Protocol below for bounds)
  4. train = builder.fit_transform(train_df); val = builder.transform(val_df); hold = builder.transform(hold_df)FeatureBuilder is fit once on train, applied to the other splits. No features are recomputed downstream.
  5. For each quantile in config: instantiate model, fit(train.X, train.y, eval_set=(val.X, val.y)), save
  6. Call evaluate.evaluate(models, hold, config, baseline_dir) — the Split object carries meta.reason_resolved and meta.pending_at so the evaluator can slice to primary (completed-only) and supplemental (completed + failed) populations and compute per-day breakdowns without ever touching the raw frame
  7. Write manifest JSON alongside models

CLI:

python -m trainer.src.train --config configs/wait_time.yaml [--refresh-cache]

evaluate.py

Holdout evaluation. Reports metrics per-day and aggregate, broken out into two slices:

  • Primary (completed-only) — the apples-to-apples comparison against the Node.js percentile baseline. This is the go/no-go metric for Phase 1.
  • Supplemental (completed + failed) — full production population. Reported for visibility; not used for the go/no-go decision.

Metrics (computed identically for both slices):

  • MAE (mean absolute error, seconds)
  • Within-2x rate — defined in Evaluation Protocol below, matches the zero-handling rule in src/predictor.js
  • Pinball loss at the trained quantile (p50 model → pinball-0.5; p90 model → pinball-0.9)
  • p90 calibrationmean(actual <= pred_p90); target ~0.90

Also reads per-day baseline JSONs (one per holdout day) and prints per-day + aggregate deltas. Baseline and trainer aggregate the same way (per-row, not per-day-mean) to keep numbers comparable.

Interface: evaluate(models, hold: Split, config, baseline_dir) -> MetricsReport

The evaluator operates purely on the already-transformed Split. It never sees hold_df and never calls FeatureBuilder, so there is no way for feature recomputation to drift from what the model was trained on. Primary vs supplemental slicing is done via hold.meta.reason_resolved; per-day slicing is done via hold.meta.pending_at.dt.floor("D").

Config files

One YAML per target. Driven entirely by config; no hardcoded feature lists in train.py.

configs/wait_time.yaml

target: wait_time
target_column: wait_duration_s

lookback_days: 14
holdout_days: 5            # configurable — will try 7 later
validation_days: 1
as_of_date: 2026-04-24     # exclusive upper bound on pending_at. See "Time bounds" below for the null rule.

filters:
  - "r.started_at IS NOT NULL"
  - "r.queue_pending IS NOT NULL"
  - "r.wait_duration_s IS NOT NULL"
  - "r.wait_duration_s >= 0"

categorical_features:
  - task_queue_id
  - scheduler_id
  - priority_at_pending
  - tags.kind
  - tags.os
  - tags.project
  - tags.worker-implementation

numeric_features:
  - queue_pending
  - max_run_time_s
  - hour_sin
  - hour_cos
  - day_sin
  - day_cos

derived_features:
  cyclical_time: { source: pending_at }

model_type: lightgbm
quantiles: [0.5, 0.9]
model_params:
  num_leaves: 63
  learning_rate: 0.05
  n_estimators: 500
  early_stopping_rounds: 20
  min_data_in_leaf: 100

configs/run_duration.yaml

Key differences from wait time:

  • target_column: run_duration_s
  • lookback_days: 30
  • Filters: reason_resolved IN ('completed', 'failed'), run_duration_s IS NOT NULL, started_at IS NOT NULL
  • Features drop priority_at_pending, queue_pending, cyclical time
  • Features add metadata_name, normalized_name, tags.test-type, build_type (derived)
  • Derived features include build_type_regex: { source: metadata_name, pattern: "(debug|opt)" }

Deliberate population mismatch between training and primary evaluation: The duration filter above keeps both completed and failed runs in the training data — a 25-minute test that fails still took 25 minutes, and excluding failures biases training toward only successful (often shorter) runs. But the primary evaluation slice is completed only (to match the Node.js baseline, which also evaluates only completed runs). This is intentional. The supplemental slice includes failures so the model's behavior on that population is still measured; it just doesn't feed the go/no-go decision. This asymmetry is noted here so nobody reads the two values as aligned by default.

Evaluation protocol

This is a pending-time forecaster: predictions are made the instant a run enters pending. All splitting, training, and evaluation is anchored on pending_at — never resolved_at.

Time bounds

All window bounds are half-open intervals [start, end) over pending_at, using UTC. as_of_date is an ISO instant (e.g. 2026-04-24T00:00:00Z) and is the exclusive upper bound. Configs may specify a date-only string (2026-04-24); the loader interprets this as T00:00:00Z of that day.

If as_of_date is null, the loader normalizes it to today's UTC midnight (the most recent past midnight relative to wall-clock time at invocation). This guarantees every holdout day is a complete [D 00:00Z, D+1 00:00Z) window and drops any partial current-day data on the floor. Consequences:

  • Training run at any time on 2026-04-23 with null as_of_date resolves to 2026-04-23T00:00:00Z. Holdout is Apr 18 → Apr 22 (five complete days). Apr 23 data is excluded entirely.
  • To include Apr 23 in evaluation, wait until Apr 24 or set as_of_date: 2026-04-24 explicitly.

Partial-day holdouts are disallowed because they silently poison per-day metric aggregation (one short day pulls down counts and distorts aggregates) and break the baseline's per-whole-day JSON contract.

Given as_of_date = A, lookback_days = L, validation_days = V, holdout_days = H:

train:   [A - (L + V + H) days,  A - (V + H) days)
val:     [A - (V + H) days,      A - H days)
holdout: [A - H days,            A)

Training is clipped to max(train_start, earliest_available_pending_at) if the database has less than L days of history. The materialized windows (actual start/end + row counts) are recorded in the manifest.

Feature-available time (no leakage)

At the moment a prediction is made, only information known at pending_at is legitimate input. This matters for the baseline percentile history as well as anything a future model might use:

  • For a prediction on a run whose pending_at falls in day D, history used to compute percentiles must come from rows with resolved_at < D 00:00Z (start of D).
  • This is slightly conservative — it's the start-of-day cutoff rather than a per-row resolved_at < pending_at cutoff — but it prevents leakage and is cheap to compute. Per-row cutoffs are deferred until shown to matter.

The trainer enforces this by construction (training set is entirely before the validation window; validation is entirely before holdout). The baseline must enforce the same rule — see "Baseline export" below.

Evaluation population

For each holdout day D, the evaluation set = runs where pending_at ∈ [D 00:00Z, D+1 00:00Z).

Primary evaluation filters further to reason_resolved = 'completed' — the population the Node.js baseline uses. This is the apples-to-apples slice and the go/no-go metric for Phase 1.

Supplemental evaluation adds reason_resolved = 'failed' for visibility but is not used to decide the Phase 1 outcome.

Rows without ground truth for the target are excluded upstream by the loader's config filter, not by the evaluator. Per target:

  • Wait time: the loader's filter requires started_at IS NOT NULL AND wait_duration_s IS NOT NULL. Any row satisfying these has observable wait duration, even if the run is still executing and resolved_at is NULL.
  • Run duration: the loader's filter requires reason_resolved IN ('completed', 'failed') AND run_duration_s IS NOT NULL AND started_at IS NOT NULL. These jointly imply the run is resolved.

Because the evaluator only ever sees rows that have already passed these filters, Split.meta does not need to carry resolved_at — the evaluator never has to re-check "is there ground truth?". If a future target requires different filtering, the config changes and the invariant is still maintained at load time.

Within-2x rule (zero handling)

Matches src/predictor.js:471-474 exactly:

if predicted > 0 AND actual > 0:
    within_2x = max(pred/actual, actual/pred) <= 2
else:
    row is counted in n but excluded from the within_2x numerator
    and denominator

Wait time can legitimately be zero, so without this rule the ratio is undefined. This definition is documented here as the single source of truth for both sides of the comparison.

Data loading

Query template

Per config, the loader emits a query like:

SELECT r.<target_column> AS y,
       r.pending_at,
       r.resolved_at,
       r.reason_resolved,
       r.queue_pending,
       r.priority_at_pending,
       t.task_queue_id,
       t.scheduler_id,
       t.metadata_name,
       t.normalized_name,
       t.max_run_time_s,
       t.tags
FROM queue_forecast_task_runs r
JOIN queue_forecast_tasks t ON r.task_id = t.task_id
WHERE r.pending_at >= $train_start
  AND r.pending_at <  $as_of_date
  AND <config.filters joined by AND>

Only the columns the config needs are selected. For the wait model, metadata_name/normalized_name are skipped; for the duration model, queue_pending/priority_at_pending are skipped. reason_resolved is always selected — it's used at split time to build the primary vs supplemental evaluation populations.

The query pulls the full [train_start, as_of_date) range in one go; the trainer splits into train/val/holdout after the DataFrame is loaded.

Splitting

Pure pending_at-based split using the half-open bounds defined under "Evaluation protocol":

[------- train -------][--- val ---][------ holdout ------)
train_start         val_start     hold_start            as_of

LightGBM uses (X_val, y_val) as the eval_set for early stopping. No shuffling, no random splits anywhere in the pipeline.

Feature engineering

Tag extraction

JSONB tags column is already a Python dict after pandas.read_parquet (pandas handles JSON fields correctly when round-tripping through Parquet — verify this in implementation; if not, cast to dict explicitly).

For each tags.<key> in the config, extract into a new column, cast Categorical, missing values become NaN (LightGBM handles natively).

Build type (duration model only)

df["build_type"] = df["metadata_name"].str.extract(r"/(debug|opt)[-/]", expand=False)
df["build_type"] = df["build_type"].astype("category")

Rows that don't match (e.g. non-test tasks) get NaN and LightGBM routes them through a default split.

Cyclical time (wait model only)

hour = df["pending_at"].dt.hour
dow = df["pending_at"].dt.dayofweek
df["hour_sin"] = np.sin(2 * np.pi * hour / 24)
df["hour_cos"] = np.cos(2 * np.pi * hour / 24)
df["day_sin"] = np.sin(2 * np.pi * dow / 7)
df["day_cos"] = np.cos(2 * np.pi * dow / 7)

Categorical handling

Every feature in categorical_features is cast to pandas.Categorical using the train-fit vocabulary via FeatureBuilder (see features.py above). LightGBM's categorical_feature='auto' detects these and uses native categorical splits — no one-hot encoding, no manual integer encoding. This is one of the main reasons we chose LightGBM over XGBoost for v1.

Unseen categorical values in val/holdout become NaN and are routed through LightGBM's default "missing" branches. FeatureBuilder tracks the unseen-rate per column per split so cold-start performance is measurable (see manifest).

Evaluation

Metrics

Computed per holdout day and aggregated across the holdout window. Aggregation is per-row (concatenate all holdout rows, then compute), not per-day-mean — this matches how the baseline aggregates and keeps comparisons consistent.

For each quantile model:

  • n: number of predictions in the slice
  • MAE: mean(|pred - actual|)
  • Within-2x: see Evaluation Protocol above for the exact zero-handling rule (matches src/predictor.js)
  • Pinball loss at target quantile q: mean(max(q*(actual-pred), (q-1)*(actual-pred)))
  • p90 coverage (for the p90 model): mean(actual <= pred_p90) — target ~0.90

Every metric is reported twice per model: once on the primary slice (reason_resolved = 'completed') and once on the supplemental slice (reason_resolved IN ('completed', 'failed')).

Output

Numbers below are illustrative only (not from a real run):

=== Wait Time Model — Holdout Evaluation ===
Config: configs/wait_time.yaml
Windows (UTC), lookback_days=14, validation_days=1, holdout_days=5:
  train:   [2026-04-04T00Z, 2026-04-18T00Z)   14d, 2.35M rows
  val:     [2026-04-18T00Z, 2026-04-19T00Z)    1d, 171k rows
  holdout: [2026-04-19T00Z, 2026-04-24T00Z)    5d, 823k rows

--- Primary slice: reason_resolved = 'completed' ---
Per-day (p50 model):
              N       MAE    w/in-2x  pinball-p50  pinball-p90  p90-cov
Apr 19 Sun   72k    142s    58.4%        71.0         38.2       88.1%
Apr 20 Mon  154k    156s    54.8%        78.1         41.9       89.4%
...
Aggregate   ...

Baseline (--pending-eval-date, completed-only, same holdout days):
  Aggregate: 193.2s MAE, 50.3% within-2x

Delta (LightGBM - baseline):
  MAE:    -22.9%
  w/in-2x: +6.8pp

--- Supplemental slice: reason_resolved IN ('completed','failed') ---
(same table structure; no baseline delta reported)

Baseline numbers are read from per-day JSON files under trainer/data/baseline/ (one per holdout day, produced by predictor.js --pending-eval-date D --output-json ...). The trainer aggregates these identically to how it aggregates its own per-day numbers, so the deltas compare like to like.

Artifacts

Model files

Per training run:

trainer/data/models/2026-04-24/
├── wait_time_p50.lgb
├── wait_time_p90.lgb
├── wait_time_manifest.json
├── run_duration_p50.lgb
├── run_duration_p90.lgb
└── run_duration_manifest.json

Directory name is the as_of_date of the training run. LightGBM's native text format (.lgb via booster.save_model()). Readable, versionable, no pickle security risk.

Manifest

One JSON per target per training run. Field values below are illustrative:

{
  "target": "wait_time",
  "config_path": "configs/wait_time.yaml",
  "config_hash": "a3f1c28e",
  "trained_at": "2026-04-24T02:11:03Z",
  "model_type": "lightgbm",
  "lightgbm_version": "4.5.0",
  "windows": {
    "as_of_date": "2026-04-24T00:00:00Z",
    "lookback_days": 14,
    "validation_days": 1,
    "holdout_days": 5,
    "train":   { "start": "2026-04-04T00:00:00Z", "end": "2026-04-18T00:00:00Z", "rows": 2347102 },
    "val":     { "start": "2026-04-18T00:00:00Z", "end": "2026-04-19T00:00:00Z", "rows": 171032 },
    "holdout": { "start": "2026-04-19T00:00:00Z", "end": "2026-04-24T00:00:00Z", "rows": 823419 }
  },
  "features": {
    "categorical": [...],
    "numeric": [...],
    "cardinalities": { "task_queue_id": 73, "scheduler_id": 12, ... },
    "null_rates":    { "tags.test-type": 0.31, ... },
    "unseen_rates_holdout": { "task_queue_id": 0.002, "metadata_name": 0.047, ... }
  },
  "model_params": {...},
  "quantiles": [0.5, 0.9],
  "evaluation": {
    "primary": {
      "slice": "reason_resolved = 'completed'",
      "per_day": [...],
      "aggregate": {...},
      "baseline_delta": {...}
    },
    "supplemental": {
      "slice": "reason_resolved IN ('completed','failed')",
      "per_day": [...],
      "aggregate": {...}
    }
  }
}

Enough to reproduce the run and diff between runs.

Dockerization

docker-compose.yml additions

New trainer service:

trainer:
  build:
    context: ../..
    dockerfile: tools/queue-forecasting/trainer/Dockerfile
  depends_on:
    postgres:
      condition: service_healthy
  profiles:
    - trainer
    - full
  env_file:
    - .env
  environment:
    DATABASE_URL: postgresql://postgres@postgres:5432/forecasting
  volumes:
    - ./trainer:/app/trainer
    - ./trainer/data:/app/trainer/data
  entrypoint: ["uv", "run", "python", "-m", "trainer.src.train"]

Update the existing predictor service so it can write baseline JSONs to the same shared directory the trainer reads from:

predictor:
  # ... existing fields unchanged ...
  volumes:
    - ./src:/app/tools/queue-forecasting/src
    - ./trainer/data/baseline:/app/tools/queue-forecasting/trainer/data/baseline

The second mount is the only addition. The path /app/tools/queue-forecasting/trainer/data/baseline inside the predictor container is where --output-json writes; it resolves to ./trainer/data/baseline/ on the host, which the trainer mounts at /app/trainer/data/baseline/.

trainer/Dockerfile

FROM ghcr.io/astral-sh/uv:python3.13-bookworm-slim

WORKDIR /app/trainer
COPY tools/queue-forecasting/trainer/pyproject.toml \
     tools/queue-forecasting/trainer/uv.lock ./
RUN uv sync --frozen --no-install-project

COPY tools/queue-forecasting/trainer /app/trainer

ENV PYTHONPATH=/app

Source is volume-mounted in compose so iteration doesn't require rebuild; bake-in keeps the image self-contained for reproducibility.

Baseline generation (separate container)

The trainer image is Python-only. It does not contain Node.js and cannot execute src/predictor.js. Baseline JSONs are generated by the existing predictor service (which already has Node + the collector's src/ mounted) and the trainer reads the resulting files from a shared volume.

Both containers mount the same ./trainer/data/baseline/ directory:

  • predictor writes <D>.json files there via --pending-eval-date D --output-json ...
  • trainer reads them during evaluation

At trainer startup, it scans trainer/data/baseline/ for one JSON per holdout day. If any are missing it exits with a clear error telling the user which days to generate, rather than silently skipping baseline comparison.

Orchestration script

tools/queue-forecasting/scripts/run_training.sh wraps the two-step workflow so the user only invokes one thing:

#!/usr/bin/env bash
# Usage: run_training.sh configs/wait_time.yaml
set -euo pipefail
CONFIG="$1"

# Step 1: resolve the holdout days from the config (tiny helper, no DB
# access required — just parses the config and computes the window).
# Note on compose semantics: --entrypoint takes a single executable;
# positional args after the service name become that entrypoint's argv.
# Passing "uv run python ..." as a single --entrypoint string would make
# docker look for an executable literally named "uv run python ...".
HOLDOUT_DAYS=$(docker compose run --rm \
  --entrypoint uv \
  trainer \
  run python -m trainer.src.resolve_holdout_days --config "$CONFIG")

# Step 2: generate per-day baselines via the predictor service.
for d in $HOLDOUT_DAYS; do
  OUT="trainer/data/baseline/$d.json"
  if [[ -f "$OUT" ]]; then
    echo "baseline exists: $OUT"
    continue
  fi
  docker compose run --rm predictor \
    node src/predictor.js \
      --pending-eval-date "$d" \
      --output-json "/app/tools/queue-forecasting/$OUT"
done

# Step 3: train + evaluate.
docker compose run --rm trainer --config "$CONFIG"

Usage

# Full pipeline: baselines + training + evaluation
./scripts/run_training.sh configs/wait_time.yaml
./scripts/run_training.sh configs/run_duration.yaml

# Train only (skips baseline generation; errors if baselines missing)
docker compose run --rm trainer --config configs/wait_time.yaml

# Refresh Parquet cache
docker compose run --rm trainer --config configs/wait_time.yaml --refresh-cache

# Regenerate baselines only (useful after predictor.js changes)
rm -f trainer/data/baseline/*.json
./scripts/run_training.sh configs/wait_time.yaml   # will now regenerate

# Drop into a shell for ad-hoc exploration
docker compose run --rm --entrypoint bash trainer

The compose update mounts ./trainer/data into the predictor service as well, so both containers see the same baseline/ directory on the host.

Follows the same profile convention as the existing collector and predictor services.

Dependencies

pyproject.toml (managed by uv):

[project]
name = "queue-forecasting-trainer"
version = "0.1.0"
requires-python = ">=3.13"
dependencies = [
    "lightgbm>=4.5.0",
    "pandas>=2.2.0",
    "numpy>=1.26.0",
    "pyarrow>=15.0.0",          # Parquet I/O
    "psycopg[binary]>=3.2.0",   # Postgres driver
    "pyyaml>=6.0.0",
    "scikit-learn>=1.5.0",      # utilities (splits, metrics)
]

[tool.uv]
dev-dependencies = [
    "pytest>=8.0",
    "ruff>=0.5.0",
]

XGBoost is deliberately not added yet — it comes in when the XGBoost QuantileModel subclass is implemented.

Baseline export from Node.js

Two additions to src/predictor.js:

--pending-eval-date D mode

Mirrors the trainer's evaluation semantics. Given date D:

  • Evaluation set: runs where pending_at ∈ [D 00:00Z, D+1 00:00Z) AND reason_resolved = 'completed' AND run_duration_s / wait_duration_s is populated (depending on target).
  • History cutoff for percentile stats: rows with resolved_at < D 00:00Z. This is the feature-available-time rule from the Evaluation Protocol section — the baseline must not peek at any resolution that happened after the prediction would have been made.
  • History lookback window: same trailing 7 days used by --date mode, but clipped to resolved_at < D 00:00Z.

The existing --date D mode (evaluate by resolved_at) remains for historical continuity, but is not used for Phase 1 go/no-go comparisons. All model-vs-baseline numbers in the trainer output come from --pending-eval-date.

--output-json <path> mode

When combined with --pending-eval-date, writes a single-day JSON blob containing raw numerators and denominators per metric. This is required for correct cross-day aggregation: the within-2x rule (Evaluation Protocol) excludes rows where either predicted or actual is ≤ 0, so its denominator differs from the overall row count. A per-day percentage cannot be re-aggregated correctly — raw counts can.

{
  "mode": "pending-eval-date",
  "eval_date": "2026-04-20",
  "eval_window": {
    "pending_start": "2026-04-20T00:00:00Z",
    "pending_end":   "2026-04-21T00:00:00Z"
  },
  "history_cutoff": "2026-04-20T00:00:00Z",
  "history_lookback_days": 7,
  "slice": "reason_resolved = 'completed'",

  "duration": {
    "n": 162041,
    "mae": {
      "eligible_n":    162041,
      "sum_abs_error": 24344560.4
    },
    "within_2x": {
      "eligible_n": 162005,
      "hit_n":      141625
    }
  },
  "wait": {
    "n": 162041,
    "mae": {
      "eligible_n":    162041,
      "sum_abs_error": 31306320.2
    },
    "within_2x": {
      "eligible_n": 159722,
      "hit_n":      80342
    }
  }
}

Field values are illustrative. n is total evaluated rows (for reporting and sanity checks). The eligible_n under each metric is the denominator actually used for that metric:

  • mae.eligible_n — rows with a valid prediction and actual (normally equals n; may be smaller if a prediction was NULL)
  • within_2x.eligible_n — rows where both predicted and actual are strictly positive (per the zero-handling rule)
  • within_2x.hit_n — rows meeting the within-2x criterion

Aggregation formulas

Across K holdout days, the trainer computes:

aggregate_mae       = sum(day.mae.sum_abs_error)   / sum(day.mae.eligible_n)
aggregate_within_2x = sum(day.within_2x.hit_n)     / sum(day.within_2x.eligible_n)

The trainer's own per-day metrics on the holdout are emitted in the same shape. This is what "aggregate identically to the baseline" means — identical formulas over the same raw-count fields.

Invocation pattern

For each holdout day, the trainer's orchestration runs:

node src/predictor.js \
  --pending-eval-date 2026-04-19 \
  --output-json trainer/data/baseline/2026-04-19.json

Five invocations for a 5-day holdout. The Python evaluator reads all five JSONs, applies the formulas above, and compares against its own aggregates computed the same way. No Python reimplementation of the percentile logic, and no aggregation drift.

Success criteria for Phase 1

Go/no-go is decided on the primary slice (reason_resolved = 'completed', pending-at-anchored, same cohort as the baseline --pending-eval-date run).

A "go" signal to invest in the full pipeline (ONNX export, inference wiring, nightly scheduling) is:

  • Wait time: MAE improves by ≥15% and within-2x improves by ≥5pp over baseline (aggregate across holdout)
  • Run duration: MAE improves by ≥5% (harder because baseline is strong)
  • Improvement is consistent across at least 3 of the 5 holdout days (not driven by a single outlier day)
  • p90 coverage is within [85%, 95%] for both models

The supplemental slice (completed + failed) is reported alongside but does not feed into the go/no-go decision — changing the population at the same time as changing the model would make "did it improve?" unanswerable.

If any of the above fail, we stop, inspect feature importance / residuals, and iterate before committing to the broader pipeline.

Deferred work (references spec.md for full design)

  • ONNX export + category mapping sidecar
  • XGBoost QuantileModel subclass
  • Node.js predictor integration via onnxruntime-node
  • Parity tests (Python vs ONNX)
  • queue_forecast_run_predictions writes
  • Nightly cron / scheduling
  • Model hot-reload
  • Migration to bugbug (Approaches A/B in spec.md)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment