A standalone tool that listens to Mozilla's Taskcluster Pulse event stream,
collects task lifecycle data, and predicts how long tasks will take to run
and wait in queue. Lives in tools/queue-forecasting/ within the Taskcluster
monorepo.
- Runtime: Node.js (ESM) for collection and real-time inference; Python for nightly model training
- Database: Postgres 15 (shared with Taskcluster; all tables prefixed
queue_forecast_to avoid collisions) - Deployment: Docker Compose (collector + predictor + trainer + postgres)
- Data source: Taskcluster Pulse (AMQP) — real-time lifecycle events
- Supplemental data: Taskcluster Queue API — task definitions, queue depth
- Per-task run duration prediction — given a newly pending run, predict execution time (p50/p90) using LightGBM trained on task identity, queue, tags, and other stored attributes.
- Per-task wait time prediction — predict queue wait time (p50/p90) using queue depth, priority, time-of-day, and queue identity. Goals 1+2 compose into an ETA.
- Prediction API — expose predictions for newly pending runs with model version and confidence metadata. TC UI as first consumer.
- Queue-level forecasting — "if I submit to this queue now, how long will it wait?" and "what is the expected drain time for the current backlog?" Reuses the wait-time model with hypothetical inputs.
- Queue load prediction — predict pending count for a given queue at a given hour and day-of-week. Requires time-series queue depth data collected from V1 onward.
- Predicting from
task-defined(dependency resolution is a different problem) - Re-predicting while a task is already running
- Provisioning or autoscaling decisions
- Predicting for unscheduled tasks waiting on dependencies
- ~250k task runs/day (~1.1M rows over first 5 days of collection)
- ~7.5M rows/month projected
- ~2GB/week raw
- tags JSONB field averages ~200 bytes per row
The system uses a normalized two-table model separating task-level definition
facts from run-level execution facts. This replaces the original single
task_events table. All table names are prefixed queue_forecast_ to
avoid collisions in a shared database.
One row per task_id. Stores definition-time identity and metadata.
Column ordering optimized for Postgres tuple alignment (8-byte, 4-byte,
variable-length).
CREATE TABLE queue_forecast_tasks (
-- 8-byte types
task_created TIMESTAMPTZ,
enriched_at TIMESTAMPTZ,
-- 4-byte types
max_run_time_s INTEGER,
-- Variable-length
task_id TEXT PRIMARY KEY,
task_queue_id TEXT,
task_group_id TEXT,
scheduler_id TEXT,
project_id TEXT,
metadata_name TEXT,
normalized_name TEXT,
original_priority TEXT,
tags JSONB
);Notes:
normalized_nameismetadata_namewith trailing hash suffixes stripped (e.g.test-linux2404-64/opt-mochitest-1@a3b4c5d6e7f8→test-linux2404-64/opt-mochitest-1). Must come from a deterministic, versioned normalization function.tagsis raw JSONB preserved as-is from the task definition. All tag-based feature extraction (kind, os, test-type, worker-implementation, build type) happens at training time in Python, keeping the schema deployment-agnostic.enriched_atis set when the Queue API fetch fills in metadata_name and tags.
One row per execution attempt (task_id, run_id).
CREATE TABLE queue_forecast_task_runs (
-- 8-byte types
pending_at TIMESTAMPTZ,
started_at TIMESTAMPTZ,
resolved_at TIMESTAMPTZ,
wait_duration_s DOUBLE PRECISION,
run_duration_s DOUBLE PRECISION,
-- 4-byte types
run_id INT NOT NULL,
queue_pending INTEGER,
-- Variable-length
task_id TEXT NOT NULL
REFERENCES queue_forecast_tasks(task_id) ON DELETE CASCADE,
priority_at_pending TEXT,
reason_created TEXT,
reason_resolved TEXT,
PRIMARY KEY (task_id, run_id)
);Notes:
priority_at_pendingis a snapshot at enqueue time. We do not train on mutable "current priority".queue_pendingis the approximate queue depth snapshot nearest topending_at, sourced from in-memory counters seeded and periodically synced from the Queue API.wait_duration_s = started_at - pending_atrun_duration_s = resolved_at - started_at- Runs that never start keep both duration fields NULL.
Every prediction is logged before the outcome is known, enabling evaluation.
CREATE TABLE queue_forecast_run_predictions (
-- 8-byte types
predicted_at TIMESTAMPTZ DEFAULT now(),
expected_completion_time TIMESTAMPTZ,
guaranteed_completion_time TIMESTAMPTZ,
wait_p50_s DOUBLE PRECISION,
wait_p90_s DOUBLE PRECISION,
run_p50_s DOUBLE PRECISION,
run_p90_s DOUBLE PRECISION,
-- 4-byte types
run_id INT NOT NULL,
-- Variable-length
task_id TEXT NOT NULL,
model_version TEXT NOT NULL,
input_features JSONB,
PRIMARY KEY (task_id, run_id)
);Notes:
expected_completion_time = pending_at + wait_p50_s + run_p50_sguaranteed_completion_time = pending_at + wait_p90_s + run_p90_sinput_featurescaptures the exact feature vector fed to the model, enabling post-hoc debugging ("why did the model predict 45 minutes?").- One prediction per run. If models are updated, the old prediction is overwritten.
-- Training sweep: grab last N days of clean completed runs
CREATE INDEX idx_qf_task_runs_training
ON queue_forecast_task_runs (resolved_at)
WHERE started_at IS NOT NULL
AND run_duration_s IS NOT NULL
AND reason_resolved IN ('completed', 'failed');
-- Reconciler: find stuck runs
CREATE INDEX idx_qf_task_runs_unresolved
ON queue_forecast_task_runs (pending_at)
WHERE resolved_at IS NULL;
-- Enrichment backfill: find tasks missing metadata
CREATE INDEX idx_qf_tasks_unenriched
ON queue_forecast_tasks (task_id)
WHERE metadata_name IS NULL;Data ingestion is Pulse-first, API-reconciled. The collector subscribes to all task lifecycle events via AMQP and upserts into the normalized tables. A separate reconciler repairs missed or incomplete state via the Queue API.
The collector must not assume event order. Any event may arrive before another event for the same task or run. Every handler must:
- upsert the row if it does not exist,
- fill only the fields it knows,
- avoid overwriting a later lifecycle state with an earlier one.
The collector maintains in-memory pending counts per task_queue_id:
- Seeded from the Queue API on first encounter via
taskQueueCounts() - Incremented on
task-pending, decremented ontask-running - Periodically synced against the API (every 60s) to correct drift
- Snapshot written to
queue_forecast_task_runs.queue_pendingattask-pendingtime
These are approximate values — documented as such. Good enough for modeling.
- UPSERT into
queue_forecast_tasks - Extracts:
task_queue_id,scheduler_id,project_id,tags - No row created in
queue_forecast_task_runs— a run has not been enqueued yet - Triggers background API enrichment if
metadata_nameis NULL
- UPSERT into
queue_forecast_tasks(in casetask-definedwas missed) - UPSERT into
queue_forecast_task_runsfor(task_id, run_id) - Captures:
pending_at,priority_at_pending,queue_pendingsnapshot,reason_created - Triggers prediction via
predictor.js, stores result inqueue_forecast_run_predictions
- UPSERT into
queue_forecast_tasks(in case earlier events were missed) - UPDATE
queue_forecast_task_runsfor(task_id, run_id) - Captures:
started_at - Computes
wait_duration_sifpending_atis already set
- UPSERT into
queue_forecast_tasks - UPDATE
queue_forecast_task_runsfor(task_id, run_id) - Captures:
resolved_at,reason_resolved - Computes
run_duration_sifstarted_atis already set
- Same as completed/failed for runs with a
run_id - Special case: exception with no
run_id(deadline-exceeded before any run started) — update the last known run inqueue_forecast_task_runsif one exists, otherwise no run row to update
- Updates
queue_forecast_tasksonly (informational, not used for training since we snapshotpriority_at_pendingat enqueue time)
On every event, if the task's metadata_name is NULL in queue_forecast_tasks:
- Check in-memory cache first (keyed by
task_id) - If not cached, fetch task definition from Queue API
- Fill:
metadata_name,normalized_name,original_priority,max_run_time_s,tags,task_created - Cache the enrichment data so subsequent run events for the same task don't require another API call
- Concurrency-limited (max 50 in-flight fetches)
Taskcluster Pulse is at-most-once delivery. Events can be dropped, and
automated retries may not publish task-exception for dead runs. The
reconciler ensures training data stays clean.
Runs as a background cron (every 15 minutes).
- Query
queue_forecast_task_runsfor rows whereresolved_at IS NULLand either:pending_at + max_run_time_s + 1 hour < now()(when max_run_time_s known via join toqueue_forecast_tasks)pending_at + INTERVAL '24 hours' < now()(fallback)
- Fetch true state from Queue API
taskStatus()for each stuck task - If API shows terminal: update
queue_forecast_task_runswith correct timestamps and resolution - If API shows the run was silently dropped: set
reason_resolved = 'reconciler-dropped'so training explicitly excludes it
- Query
queue_forecast_tasksfor rows wheremetadata_name IS NULLandenriched_at IS NULLand task first seen more than 5 minutes ago - Fetch task definition from Queue API
- Fill metadata fields
This merges the current backfill sweep into the reconciler — one repair job instead of two.
We use LightGBM (Light Gradient Boosting Machine), a gradient-boosted
decision tree algorithm. For tabular data with high-cardinality categories
(like task_queue_id and metadata_name), tree-based models outperform
neural networks in both speed and accuracy.
LightGBM builds hundreds of shallow decision trees sequentially. Each tree corrects the errors of the previous ones. This naturally captures feature interactions — for example, learning that high queue depth on weekends affects wait time differently than on weekdays.
The system trains two independent models nightly. They have different targets, training filters, feature sets, and lookback windows.
Predicts how long a task will execute once a worker picks it up.
Target: run_duration_s
Training filter:
SELECT r.run_duration_s, r.queue_pending,
t.task_queue_id, t.metadata_name, t.normalized_name,
t.scheduler_id, t.max_run_time_s, t.tags,
r.pending_at
FROM queue_forecast_task_runs r
JOIN queue_forecast_tasks t ON r.task_id = t.task_id
WHERE r.resolved_at > now() - INTERVAL '30 days'
AND r.started_at IS NOT NULL
AND r.run_duration_s IS NOT NULL
AND r.reason_resolved IN ('completed', 'failed')Lookback: 30 days. Run times are tied to code and payloads, relatively stable over time.
Why include failed: A test that runs for 25 minutes then fails still
took 25 minutes. Excluding failures would bias the model toward only
successful (often shorter) runs.
Exclude: worker-shutdown, claim-expired, malformed-payload,
reconciler-dropped — these are infrastructure artifacts, not workload
runtime.
Features:
| Feature | Type | Source | Notes |
|---|---|---|---|
metadata_name |
categorical | tasks | Most specific identifier (~5k unique/day) |
normalized_name |
categorical | tasks | Groups retriggered variants |
task_queue_id |
categorical | tasks | Worker pool identity (~50-100 unique) |
scheduler_id |
categorical | tasks | Broad cohort (gecko-level-1, etc.) |
max_run_time_s |
numeric | tasks | Declared timeout, correlates with task weight |
tags->>'kind' |
categorical | tasks.tags | mochitest, build, signing, etc. |
tags->>'test-type' |
categorical | tasks.tags | mochitest, wpt, reftest, etc. |
tags->>'os' |
categorical | tasks.tags | linux, windows, macos |
tags->>'project' |
categorical | tasks.tags | try, autoland, mozilla-central |
tags->>'worker-implementation' |
categorical | tasks.tags | docker-worker vs generic-worker |
Build type extraction: The Python trainer extracts debug vs opt from
metadata_name via regex (e.g. test-linux2404-64/debug-... → debug).
This is one of the strongest run duration predictors in Firefox CI.
Predicts how long a task will sit in queue before a worker picks it up.
Target: wait_duration_s
Training filter:
SELECT r.wait_duration_s, r.queue_pending, r.priority_at_pending,
t.task_queue_id, t.scheduler_id, t.tags,
r.pending_at
FROM queue_forecast_task_runs r
JOIN queue_forecast_tasks t ON r.task_id = t.task_id
WHERE r.resolved_at > now() - INTERVAL '14 days'
AND r.started_at IS NOT NULL
AND r.queue_pending IS NOT NULLLookback: 14 days. Wait times reflect current infrastructure capacity and are highly recency-sensitive. Stale capacity data hurts more than limited sample size.
Why resolution doesn't matter: Once a run started, queue wait is observed regardless of whether it later completed or failed.
Features:
| Feature | Type | Source | Notes |
|---|---|---|---|
task_queue_id |
categorical | tasks | Most important baseline |
priority_at_pending |
categorical | task_runs | Critical for scheduling order |
queue_pending |
numeric | task_runs | Backlog depth at enqueue time |
scheduler_id |
categorical | tasks | Cohort behavior |
max_run_time_s |
numeric | tasks | Task weight signal |
tags->>'kind' |
categorical | tasks.tags | Workload type |
tags->>'os' |
categorical | tasks.tags | Platform |
tags->>'project' |
categorical | tasks.tags | try vs autoland behave differently |
hour_sin, hour_cos |
numeric | derived | Cyclical encoding of hour-of-day (UTC) |
day_sin, day_cos |
numeric | derived | Cyclical encoding of day-of-week |
Cyclical time encoding:
hour_sin = sin(2 * pi * hour / 24)
hour_cos = cos(2 * pi * hour / 24)
day_sin = sin(2 * pi * day_of_week / 7)
day_cos = cos(2 * pi * day_of_week / 7)
This ensures the model understands that 23:00 and 00:00 are adjacent, and Friday and Monday are close.
Sliding window retrain, not incremental learning. Every night:
- Python trainer queries Postgres for the relevant lookback window
- Trains a fresh LightGBM model from scratch (discards yesterday's model)
- Uses
objective=quantilewithalpha=0.5for p50,alpha=0.9for p90 (two training passes per model, or a single multi-quantile model) - Exports to ONNX format
- Writes
run_duration_model.onnxandwait_time_model.onnxto a shared volume
Why not incremental: Decision tree incremental learning leads to tree bloat (slowing inference) and struggles to adapt when new queue names or task types appear. A fresh retrain automatically forgets outdated patterns.
All feature engineering happens in the Python training script:
- Categorical handling: High-cardinality strings cast to Pandas
categorydtype. LightGBM handles these natively without one-hot encoding. - Tag extraction:
tags->>'kind',tags->>'os', etc. extracted from JSONB into typed columns. Deployment-specific — only the trainer knows which tag keys matter. - Build type: Regex extraction of
debug/optfrommetadata_name. - Time features: Cyclical encoding derived from
pending_attimestamp. - NULL handling: LightGBM handles NaN/NULL natively for both numeric and categorical features.
The predictor.js service loads both .onnx model files into memory
using onnxruntime-node. When a task-pending event arrives:
- Collector upserts the run into
queue_forecast_task_runs - Collector calls the predictor with the task/run features
- Predictor applies the same feature engineering as training:
- Categorical encoding (string -> integer mapping, loaded alongside the ONNX model as a JSON sidecar file)
- Cyclical time encoding from
pending_at - Build type regex extraction from
metadata_name
- Runs both models (run duration + wait time) in-memory
- Composes the ETA:
expected_completion_time = pending_at + wait_p50 + run_p50guaranteed_completion_time = pending_at + wait_p90 + run_p90
- Writes prediction to
queue_forecast_run_predictions
Inference latency target: low single-digit milliseconds per prediction. No network calls to Python. No database reads for historical stats.
LightGBM categorical features are integer-coded during training. The
Python trainer must export a category_mappings.json alongside each
ONNX model containing the string-to-integer mapping for every categorical
feature. The Node.js predictor loads this at startup and on model reload.
Parity requirement: Float/double precision can drift between Python and ONNX inference. Automated parity tests between Python predictions and Node.js ONNX predictions are a strict requirement before any model is deployed.
The predictor watches the shared model volume for new .onnx files.
When the nightly trainer writes a new model:
- Predictor detects the new file (filesystem watch or polling)
- Loads new model + category mappings into memory
- Swaps atomically (old model serves requests until new one is ready)
- Logs the model version transition
When LightGBM encounters a categorical value it has never seen during
training (e.g., a brand new metadata_name or task_queue_id):
- LightGBM treats unseen categoricals as a separate "unknown" bucket and routes them through decision tree branches based on other features
- This means a brand new task type still gets a prediction — it just
relies more heavily on
task_queue_id,tags,scheduler_id, and other features the model has seen - The
input_featuresJSONB inqueue_forecast_run_predictionsshould flag which features were unknown, enabling evaluation of cold-start accuracy - After one nightly retrain cycle, the new task type enters the training data and gets proper coverage
The sections above describe the ML algorithm, features, and training strategy independently of where training and inference run. There are three viable deployment architectures. All share the same data model, collection layer, and evaluation methodology — they differ only in who trains the model and where inference happens.
Both bugbug-based approaches require a daily Taskcluster task that exports training data from Postgres and publishes it as a TC artifact.
┌─────────────┐ daily TC task ┌──────────────────────┐
│ Postgres │ ──────────────────→ │ training_data.json.zst│
│ (collector) │ SQL query + │ (TC artifact, 7-day │
│ │ zstd compress │ expiry, TC-indexed) │
└─────────────┘ └──────────────────────┘
- Runs as a scheduled TC task (not in docker-compose)
- Queries the training SQL from the run duration and wait time model
sections above, exports as newline-delimited JSON compressed with
zstandard (
.json.zst) - Published as a public TC artifact, indexed via
project.queue-forecasting.data.latest - Estimated size: 1-3 GB compressed for a 30-day window (~7.5M rows)
- Artifact expiry: 7 days (training only needs the latest snapshot)
This aligns with bugbug's existing data pipeline pattern — every data source in bugbug (Bugzilla, Mercurial, CI failures) follows the same retrieval-task → artifact → training-task flow.
Data flow:
Node.js collector → Postgres → daily export task → TC artifact
→ bugbug data-retrieval task downloads artifact
→ bugbug training task (XGBoost) → model stored as pickle
→ bugbug HTTP service serves predictions
→ Node.js services call bugbug HTTP API
What lives where:
| Component | Location | Owner |
|---|---|---|
| Collector, reconciler | tools/queue-forecasting/ (TC repo) |
TC team |
| Data export task | tools/queue-forecasting/ (TC repo) |
TC team |
| Data retrieval script | bugbug repo | bugbug team |
| Model class + training | bugbug repo | bugbug team |
| HTTP prediction endpoint | bugbug HTTP service | bugbug team |
| Prediction API (proxy) | tools/queue-forecasting/ (TC repo) |
TC team |
What needs to be added to bugbug:
- Data retrieval script — downloads the
training_data.json.zstartifact from TC index, decompresses, yields records. Similar to existingbugbug/bugzilla.pyretrieval pattern. - Model class — extends
bugbug.model.Model, defines feature extraction from the exported task/run records. Uses XGBoost (bugbug's standard) with quantile regression for p50/p90. - Training task — entry in
infra/data-pipeline.ymldepending on the data retrieval task. - HTTP endpoint — new route in
http_service/bugbug_http/app.pythat accepts task features and returns wait time + run duration predictions.
Prediction flow:
task-pendingevent arrives at collector- Collector upserts run, then calls bugbug HTTP API with features
- bugbug API enqueues prediction job (Redis + RQ)
- Collector polls for result (bugbug's standard async pattern)
- Result written to
queue_forecast_run_predictions
Pros:
- No Python or ML code in the TC repo
- Leverages existing Mozilla ML infrastructure (CI, monitoring, deployment, model management)
- bugbug team already maintains training orchestration and HTTP serving
- Existing patterns for model rollback and evaluation
Cons:
- Network latency: bugbug uses async polling (enqueue → poll for result). At ~250k predictions/day (~3/sec sustained), each prediction incurs HTTP round-trips instead of sub-ms local inference. Batching can amortize this but adds complexity.
- XGBoost vs LightGBM: bugbug standardizes on XGBoost. XGBoost requires manual categorical encoding (label encoding or one-hot) where LightGBM handles high-cardinality categoricals natively. Quality is comparable for tabular data, but feature engineering is more involved.
- External service dependency: bugbug HTTP downtime means no new
predictions. Stale predictions in
queue_forecast_run_predictionsremain available but won't update. - Cross-team coordination: model changes require PRs to bugbug repo and alignment with bugbug release cadence.
Cost summary:
| Cost | Estimate |
|---|---|
| Data export artifact storage | ~1-3 GB/day, 7-day expiry = ~7-21 GB peak |
| Network transfer (export → bugbug) | ~1-3 GB/day (TC-internal, free) |
| bugbug training compute | 1 TC task/day, ~10-30 min |
| HTTP API calls | ~250k/day, async polling |
Data flow:
Node.js collector → Postgres → daily export task → TC artifact
→ bugbug data-retrieval task downloads artifact
→ bugbug training task → ONNX export as TC artifact
→ Node.js predictor downloads ONNX model + category mappings
→ Local inference via onnxruntime-node
What lives where:
| Component | Location | Owner |
|---|---|---|
| Collector, reconciler, predictor | tools/queue-forecasting/ (TC repo) |
TC team |
| Data export task | tools/queue-forecasting/ (TC repo) |
TC team |
| Data retrieval + training | bugbug repo | bugbug team |
| ONNX model artifact | TC artifact storage | produced by bugbug |
What needs to be added to bugbug (same as A, plus):
- ONNX export step after training. bugbug does not support ONNX today.
XGBoost models can be converted via
onnxmltoolsorskl2onnx, but this is less battle-tested than LightGBM's ONNX export path. - Category mapping sidecar (
category_mappings.json) exported alongside the ONNX model. - Parity tests between Python XGBoost predictions and ONNX runtime predictions (float precision can drift).
Prediction flow:
task-pendingevent arrives at collector- Collector calls local
predictor.js(same as current spec) - Predictor runs ONNX model in-process, sub-ms latency
- Result written to
queue_forecast_run_predictions
Model hot-reload:
- Predictor polls TC index for new ONNX artifact (or watches a local volume synced from TC artifacts)
- Loads new model + category mappings atomically
Pros:
- Sub-ms local inference preserved — no runtime dependency on bugbug
- Leverages bugbug's training orchestration and CI
- Model is a static artifact — predictor is self-contained after download
Cons:
- ONNX export is new to bugbug — needs to be implemented and maintained. Adds a capability bugbug doesn't currently have.
- XGBoost ONNX maturity: XGBoost → ONNX conversion exists but is less mature than LightGBM → ONNX. Quantile regression ONNX export may need validation.
- Category mapping sidecar: same complexity as approach C (the Node.js predictor must replicate categorical encoding).
- Cross-team dependency for training changes, but not for runtime.
Cost summary:
| Cost | Estimate |
|---|---|
| Data export artifact storage | ~1-3 GB/day, 7-day expiry |
| ONNX model artifact storage | ~10-50 MB/day, 7-day expiry |
| bugbug training compute | 1 TC task/day, ~10-30 min |
| Network transfer at inference | None (local) |
Data flow:
Node.js collector → Postgres
→ Nightly Python trainer (docker-compose) queries Postgres directly
→ LightGBM training → ONNX export to shared volume
→ Node.js predictor loads ONNX, runs local inference
This is the architecture described in the preceding sections. The Python
trainer lives in tools/queue-forecasting/ alongside the Node.js code,
runs as a docker-compose service on a nightly cron.
Pros:
- Full control over the entire pipeline — no external dependencies
- LightGBM with native categorical support (no manual encoding needed
for high-cardinality features like
metadata_name) - Sub-ms local inference
- Self-contained: one
docker-compose upruns everything - Simpler debugging — all code in one repo
Cons:
- Own the entire ML pipeline: training infrastructure, monitoring, model versioning, rollback
- Python code in the TC repo (TC is primarily Node.js and Go)
- Must build training orchestration, evaluation automation, and model management from scratch
Cost summary:
| Cost | Estimate |
|---|---|
| Training compute | docker-compose container, ~10-30 min/day |
| Storage | ONNX models on local/shared volume, ~50 MB |
| External dependencies | None |
| Dimension | A: Full bugbug | B: Mixed mode | C: Standalone |
|---|---|---|---|
| Inference latency | ~100ms+ (HTTP poll) | Sub-ms (local ONNX) | Sub-ms (local ONNX) |
| Runtime dependency | bugbug HTTP service | None (static artifact) | None |
| Training orchestration | bugbug (existing) | bugbug (existing) | Self-built |
| ML framework | XGBoost | XGBoost | LightGBM |
| Categorical handling | Manual encoding | Manual encoding | Native |
| Python in TC repo | No | No | Yes |
| New bugbug work | Model + endpoint | Model + ONNX export | None |
| Operational ownership | Shared (TC + bugbug) | Shared (training only) | TC team only |
Start with Approach C (standalone) to validate the model quality and prediction pipeline end-to-end with minimal cross-team coordination. The evaluation metrics (within-2x rate, pinball loss, p90 calibration) will determine whether the ML approach works before investing in infrastructure integration. If the model proves valuable, migrate training to bugbug (Approach B) to offload pipeline maintenance, with Approach A as an option if local inference complexity becomes a burden.
GET /v1/predict/:taskId/:runId
Response:
{
"taskId": "VGx8Q3kRTe2...",
"runId": 0,
"prediction": {
"waitTime": {
"p50_seconds": 142.3,
"p90_seconds": 412.8
},
"runDuration": {
"p50_seconds": 1823.7,
"p90_seconds": 2401.2
},
"eta": {
"expected": "2026-03-27T14:32:00Z",
"guaranteed": "2026-03-27T15:05:00Z"
},
"modelVersion": "2026-03-27-nightly",
"predictedAt": "2026-03-27T14:00:12Z"
}
}This endpoint reads from queue_forecast_run_predictions. If the
prediction already exists (generated at task-pending time), it returns
it. If the run exists but has no prediction yet (race condition or missed
event), it generates one on the fly.
GET /v1/queue/:taskQueueId/estimate
Returns predicted wait time for a hypothetical new task entering this
queue right now, using current queue_pending count and the wait-time
model. Deferred to V2 but the data model supports it from day 1.
Every prediction is stored in queue_forecast_run_predictions before
the outcome is known. A daily evaluation job compares predictions
against actuals.
- Strict time-split only. Never random split. Train on days 1-N, evaluate on day N+1. Random splitting leaks future information.
- Evaluation runs automatically after each nightly training cycle.
| Metric | Description |
|---|---|
| Within-2x rate | % of eta_estimate predictions within 0.5x-2x of actual total time. Target: >80% |
| Pinball loss (p50) | Measures median prediction accuracy. Lower is better. |
| Pinball loss (p90) | Measures upper-bound prediction accuracy. |
| p90 calibration | Does the p90 prediction actually cover ~90% of observed durations? |
| Coverage | % of pending runs that received a prediction (vs cold-start fallback) |
| Fallback rate | % of predictions where key features were unseen by the model |
Metrics must be computed across slices, not just globally:
- By
task_queue_id(top 20 queues by volume) - By
priority_at_pending - By
tags->>'project'(try vs autoland vs mozilla-central) - By cold-start status (was
metadata_namein the training set?)
A model that looks great globally but fails on the highest-volume queue is not deployable.
SELECT
rp.task_id,
rp.run_id,
rp.wait_p50_s,
rp.run_p50_s,
rp.expected_completion_time,
rp.guaranteed_completion_time,
rp.model_version,
r.wait_duration_s AS actual_wait,
r.run_duration_s AS actual_run,
r.pending_at,
r.resolved_at,
t.task_queue_id,
t.tags
FROM queue_forecast_run_predictions rp
JOIN queue_forecast_task_runs r
ON rp.task_id = r.task_id AND rp.run_id = r.run_id
JOIN queue_forecast_tasks t
ON rp.task_id = t.task_id
WHERE r.resolved_at IS NOT NULL
AND r.started_at IS NOT NULL
AND r.resolved_at >= $1::date
AND r.resolved_at < $1::date + INTERVAL '1 day'| Phase | What ships | Predictions visible? |
|---|---|---|
| Phase 1 | Collector, reconciler, nightly trainer, predictor | Stored only. Internal evaluation. |
| Phase 2 | Prediction API | Debug/internal consumers. TC UI behind flag. |
| Phase 3 | TC UI integration | Default-on for supported queues. |
No model is exposed to users without passing automated evaluation on the metrics above.
queue_forecast_tasks and queue_forecast_task_runs enforce a rolling
45-day retention window. 45 days provides margin beyond the 30-day
training window for debugging, evaluation lookback, and reconciliation
of late-arriving events.
To avoid expensive row-by-row DELETE operations:
queue_forecast_task_runsis partitioned by week onpending_atusing Postgres native range partitioning- Expired data is dropped by detaching and destroying the oldest partition
- A weekly cron handles partition management (create next week's partition, drop partitions older than 45 days)
queue_forecast_tasks rows are cleaned up via CASCADE when their last
associated run partition is dropped. Alternatively, a lightweight sweep
deletes orphaned queue_forecast_tasks rows with no remaining
queue_forecast_task_runs references.
queue_forecast_run_predictions follows the same 45-day retention,
partitioned on predicted_at.
Keep the last 7 days of .onnx model files and category_mappings.json
on the shared volume. Allows quick rollback if a nightly model degrades.
Older artifacts are deleted.
The existing task_events table contains ~1.1M rows (5 days of data).
This migration splits it into the normalized two-table model without
data loss.
CREATE TABLE queue_forecast_tasks (
-- 8-byte types
task_created TIMESTAMPTZ,
enriched_at TIMESTAMPTZ,
-- 4-byte types
max_run_time_s INTEGER,
-- Variable-length
task_id TEXT PRIMARY KEY,
task_queue_id TEXT,
task_group_id TEXT,
scheduler_id TEXT,
project_id TEXT,
metadata_name TEXT,
normalized_name TEXT,
original_priority TEXT,
tags JSONB
);
CREATE TABLE queue_forecast_task_runs (
-- 8-byte types
pending_at TIMESTAMPTZ,
started_at TIMESTAMPTZ,
resolved_at TIMESTAMPTZ,
wait_duration_s DOUBLE PRECISION,
run_duration_s DOUBLE PRECISION,
-- 4-byte types
run_id INT NOT NULL,
queue_pending INTEGER,
-- Variable-length
task_id TEXT NOT NULL
REFERENCES queue_forecast_tasks(task_id) ON DELETE CASCADE,
priority_at_pending TEXT,
reason_created TEXT,
reason_resolved TEXT,
PRIMARY KEY (task_id, run_id)
);
CREATE TABLE queue_forecast_run_predictions (
-- 8-byte types
predicted_at TIMESTAMPTZ DEFAULT now(),
expected_completion_time TIMESTAMPTZ,
guaranteed_completion_time TIMESTAMPTZ,
wait_p50_s DOUBLE PRECISION,
wait_p90_s DOUBLE PRECISION,
run_p50_s DOUBLE PRECISION,
run_p90_s DOUBLE PRECISION,
-- 4-byte types
run_id INT NOT NULL,
-- Variable-length
task_id TEXT NOT NULL,
model_version TEXT NOT NULL,
input_features JSONB,
PRIMARY KEY (task_id, run_id)
);-- A. Populate queue_forecast_tasks
-- DISTINCT ON grabs the most complete metadata per task_id
-- (latest run_id tends to have the richest enrichment)
INSERT INTO queue_forecast_tasks (
task_id, task_queue_id, task_group_id, scheduler_id, project_id,
metadata_name, normalized_name, original_priority,
max_run_time_s, tags, task_created, enriched_at
)
SELECT DISTINCT ON (task_id)
task_id, task_queue_id, task_group_id, scheduler_id, project_id,
metadata_name, normalized_name, original_priority,
max_run_time_s, tags, task_created,
CASE WHEN metadata_name IS NOT NULL THEN now() END
FROM task_events
ORDER BY task_id, run_id DESC NULLS LAST;
-- B. Populate queue_forecast_task_runs
-- Skip NULL run_id rows (task-defined placeholders with no actual run)
INSERT INTO queue_forecast_task_runs (
task_id, run_id, priority_at_pending, reason_created, reason_resolved,
pending_at, started_at, resolved_at, queue_pending,
wait_duration_s, run_duration_s
)
SELECT
task_id, run_id, priority, reason_created, reason_resolved,
scheduled, started, resolved, queue_pending,
wait_duration_s, run_duration_s
FROM task_events
WHERE run_id IS NOT NULL;CREATE INDEX idx_qf_task_runs_training
ON queue_forecast_task_runs (resolved_at)
WHERE started_at IS NOT NULL
AND run_duration_s IS NOT NULL
AND reason_resolved IN ('completed', 'failed');
CREATE INDEX idx_qf_task_runs_unresolved
ON queue_forecast_task_runs (pending_at)
WHERE resolved_at IS NULL;
CREATE INDEX idx_qf_tasks_unenriched
ON queue_forecast_tasks (task_id)
WHERE metadata_name IS NULL;-- Verify row counts
SELECT 'queue_forecast_tasks' AS tbl, count(*) FROM queue_forecast_tasks
UNION ALL
SELECT 'queue_forecast_task_runs', count(*) FROM queue_forecast_task_runs
UNION ALL
SELECT 'task_events (total)', count(*) FROM task_events
UNION ALL
SELECT 'task_events (with run_id)', count(*)
FROM task_events WHERE run_id IS NOT NULL;
-- queue_forecast_task_runs count should match task_events-with-run_id count
-- queue_forecast_tasks count should match distinct task_id countOnce verified:
- Stop the collector
- Run the migration
- Deploy updated collector that writes to the new tables
- Verify new events land correctly
- Rename or drop
task_eventswhen confident
After the initial migration is stable, convert queue_forecast_task_runs
to range-partitioned on pending_at by week. This is a separate step
because partitioning an existing table requires recreating it.
-- Create partitioned version
CREATE TABLE queue_forecast_task_runs_part (
LIKE queue_forecast_task_runs INCLUDING ALL
) PARTITION BY RANGE (pending_at);
-- Create weekly partitions
CREATE TABLE queue_forecast_task_runs_w2026_12
PARTITION OF queue_forecast_task_runs_part
FOR VALUES FROM ('2026-03-23') TO ('2026-03-30');
CREATE TABLE queue_forecast_task_runs_w2026_13
PARTITION OF queue_forecast_task_runs_part
FOR VALUES FROM ('2026-03-30') TO ('2026-04-06');
-- ... etc
-- Migrate data, swap tables- Queue depth time-series (
queue_forecast_queue_depth_samplestable) — needed for goal 5 (queue load prediction by time/day). Start collecting once V1 prediction pipeline is stable. - Queue drain forecasting — builds on wait-time model + queue depth data. V2 feature.
- Trend/regression detection — daily cohort rollups comparing trailing 7-day vs 28-day quantiles. Requires stable evaluation pipeline first.
- bugbug migration — V1 starts standalone (Approach C) to validate model quality. Once evaluation confirms the approach works, migrate training to bugbug (Approach B or A) to leverage existing Mozilla ML infrastructure. See "ML Pipeline Architecture Options" above for the full comparison. Decision point: after Phase 1 evaluation metrics are stable.
- Shadow mode comparison — running a new model version side-by-side with production and auto-promoting only if it wins on evaluation metrics. Applicable regardless of pipeline architecture.
- TC UI integration — wiring the prediction API into the task detail view.
- Lando landing queue as a leading indicator — Lando's merge queue
shows what's about to land and therefore what will be scheduled soon.
This is a forward-looking signal the current models lack — today the
wait-time model only reacts to
queue_pendingat enqueue time. A periodic snapshot of the Lando queue depth (and optionally the repos being landed) could feed into the wait-time model and V2 queue load prediction. Complexity: requires a new data source (Lando API), and the signal is indirect — a landing doesn't map 1:1 to specific task queues without understanding the push-to-taskgraph relationship. - Tree status and sheriff activity — tree closures halt new tasks,
and sheriff-initiated backfills cause sudden load spikes. Both are
regime changes that dramatically shift queue behavior. TreeHerder
exposes tree status (open / closed / approval-required) via API.
Adding tree state as a categorical feature to both models would help
them distinguish normal load from closure-recovery bursts. Backfill
detection is harder — may require identifying sheriff-triggered task
groups via
scheduler_idor push metadata. - Guiding principle: TC-only first, extend if needed — V1 deliberately uses only Taskcluster-internal data (Pulse events, Queue API). The evaluation pipeline (within-2x rate, pinball loss, p90 calibration) provides an objective checkpoint: if TC-only features don't meet accuracy targets after Phase 1 evaluation, that is the signal to integrate external sources like Lando and TreeHerder.