Skip to content

Instantly share code, notes, and snippets.

@jmealo
Created March 14, 2026 15:16
Show Gist options
  • Select an option

  • Save jmealo/c4d7b92e44a781fbd59a40d997240805 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/c4d7b92e44a781fbd59a40d997240805 to your computer and use it in GitHub Desktop.
Staging Deploy Status — 2026-03-14 — Metrics Changes

Staging Deploy Status — 2026-03-14

Summary

Deployed 25 backend services to staging to test metrics changes. Found and fixed 5 bugs introduced by library version mismatches. All services are now running healthy.

Libraries Published

Library Version Fix
gisual-task-runner 0.6.25 component_timer.labels() — pass all 4 labels (category, service, component, operation)
gisual-http-service 1.13.21 ShutdownCoordinator — inspect-based metrics kwarg safety + relaxed service-utils pin

Bugs Fixed

  1. Task runner crash-loopValueError: Incorrect label count on component_timer startup metric. gisual-prometheus-clients redefined component_timer with 4 labels but gisual-task-runner only passed 2. Fixed in task-runner 0.6.25.

  2. intel-api 500sTypeError: StateManager.get_lock() got an unexpected keyword argument 'op_name'. The gisual-redis-client uses operation= not op_name=. Fixed 3 call sites in intel-api.

  3. outage-validation-task-manager 500s — Same op_nameoperation mismatch. Fixed 15 call sites.

  4. data-collection-api crash-loopTypeError: ShutdownCoordinator.__init__() got an unexpected keyword argument 'metrics'. gisual-http-service 1.13.19 unconditionally passes metrics but the installed gisual-service-utils didn't accept it. Fixed in http-service 1.13.21 with inspect-based detection.

  5. public-api / asset-outage-notifier Docker build failures — Local path dependencies in [tool.uv.sources] breaking Docker builds. Removed path deps, pinned from PyPI.

Docker Images Built & Pushed

All images on docker.gisual.net.

APIs (14)

Service Image Tag Fixes Applied
data-collection-api :619ada55 http-service 1.13.21 (was crash-looping)
public-api :c3627d6f http-service 1.13.21, removed path dep
intel-api :91e5f764 http-service 1.13.21, op_name→operation (3 sites)
incidents-api :f0d73215 http-service 1.13.21
decisions-api :8ec21d9a http-service 1.13.21
locations-api :adefb2c4 http-service 1.13.21
assets-api :04efddf9 http-service 1.13.21
outage-scans-api :4dcf5fe5 http-service 1.13.21
outage-validation-api :9cd8222f http-service 1.13.21
predictions-api :58129b98 http-service 1.13.21
satellite-api :0f855172 http-service 1.13.21
usage-api :b58e7ec3 http-service 1.13.21
web-tool-api :159a2462 http-service 1.13.21
dependencies-api :f9af3051 http-service 1.13.21

Task Runners (7)

Service Image Tag Fixes Applied
search-retry-feeder :59b1be6c task-runner 0.6.25
outage-validation-task-manager :90cf62d3 task-runner 0.6.25, op_name→operation (15 sites)
regional-outage-feeder :31a5be02 task-runner 0.6.25
customer-utility-updater :52a8db85 task-runner 0.6.25
current-incidents-cache-pruner :87de424f task-runner 0.6.25
archive-tagger :f0eca8d4 task-runner 0.6.25
asset-outage-notifier :ded6a1c7 task-runner 0.6.25, removed gisual-runtime path dep

Consumers (4 — no code changes needed)

notification-sender, outage-updater, asset-status-updater, alarm-lifecycle-manager — running healthy on existing images.

Pipeline Status (main branch)

Services with feat/robot-review-round-01 MR (14)

Service Pipeline MR
data-collection-api ✅ passed !19
decisions-api ✅ passed !114
dependencies-api ✅ passed !6
incidents-api ✅ passed !56
locations-api ✅ passed !24
public-api ✅ passed !69
satellite-api ✅ passed !10
alarm-lifecycle-manager ✅ passed !34
search-retry-feeder ✅ passed !18
current-incidents-cache-pruner ✅ passed !8
asset-outage-notifier ✅ passed !6
outage-validation-task-manager ✅ passed !23
customer-utility-updater ✅ passed !6
regional-outage-feeder ✅ passed !12

Services without feat/robot-review-round-01 (11)

Service Pipeline
assets-api ✅ passed
intel-api ✅ passed
outage-scans-api ✅ passed
outage-validation-api ✅ passed
predictions-api ✅ passed
usage-api ✅ passed
web-tool-api ✅ passed
notification-sender ✅ passed
outage-updater ✅ passed
asset-status-updater ✅ passed
archive-tagger failed (mypy type errors — pre-existing, unrelated to deploy)

Remaining Items

  • dependencies-api needs redeployment to staging (running old image, healthy but on stale code)
  • archive-tagger has pre-existing mypy failures (23 type errors in appdb.py / tasks.py) from stricter type stubs in updated dependencies
  • public-api shutdown errors (shutdown_opentelemetry, shutdown_logging) are non-fatal but noisy — methods not implemented in base class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment