Skip to content

Instantly share code, notes, and snippets.

@hansent
Created February 12, 2026 19:14
Show Gist options
  • Select an option

  • Save hansent/121005c8eff52c8a6bda206c9fc2dd33 to your computer and use it in GitHub Desktop.

Select an option

Save hansent/121005c8eff52c8a6bda206c9fc2dd33 to your computer and use it in GitHub Desktop.

Plan: Remote GPU Processing Time Collection for CPU Orchestrator Billing

Context

The new serverless deployment routes workflow requests to CPU-only inference instances with remote execution enabled. Model blocks on these CPU hosts make HTTP requests to GPU instances for inference, which return X-Processing-Time headers. The downstream billing service already differentiates between CPU and GPU usage records, but needs the GPU processing times from the CPU orchestrator's response to bill correctly: fixed 100ms per frame for CPU + actual GPU time.

Currently the CPU orchestrator only returns its own wall-clock X-Processing-Time. We need to additionally return the remote GPU processing times collected during workflow execution, so the billing service has the full picture.

Key constraints:

  • Don't change the usage collector (serves broader reporting purposes)
  • Don't change individual blocks or the execution engine's core logic
  • Gate behind an env var for the new deployment only

Verified Call Path Analysis

All 26 model blocks with remote execution use InferenceHTTPClient. Their HTTP calls flow through two paths in the SDK:

Path Methods Collection point
_post_images() / _execute_infer_from_api_request() -> execute_requests_packages() infer(), detect_gazes(), depth_estimation(), infer_lmm(), sam2_segment_image(), ocr_image(), sam3_3d_infer(), etc. execute_requests_packages() in executors.py
Direct requests.post() get_clip_text_embeddings() (line 1341), clip_compare() (line 1444), get_perception_encoder_text_embeddings() (line 1552) Each call site in client.py

The 2 deprecated LMM blocks use OpenAI's client directly (external API, not our GPU instances) — irrelevant.

ContextVar Safety

The remote_processing_times ContextVar lives in inference_sdk/config.py alongside the existing execution_id ContextVar. This is proven safe:

  • When the SDK is used standalone (outside the server), the ContextVar defaults to None — collection is a no-op
  • When used within the server, the middleware initializes it per-request
  • Same pattern already working in production for execution_id

Implementation (7 files)

1. inference_sdk/config.py — Add collector class + ContextVar

Add RemoteProcessingTimeCollector (list + threading.Lock) and remote_processing_times ContextVar next to existing execution_id.

class RemoteProcessingTimeCollector:
    def __init__(self):
        self._times = []
        self._lock = threading.Lock()

    def add(self, processing_time: float) -> None:
        with self._lock:
            self._times.append(processing_time)

    def get_total(self) -> float:
        with self._lock:
            return sum(self._times)

    def get_times(self) -> list:
        with self._lock:
            return list(self._times)

    def has_data(self) -> bool:
        with self._lock:
            return len(self._times) > 0

remote_processing_times = contextvars.ContextVar("remote_processing_times", default=None)

2. inference/core/env.py — Add env var

WORKFLOWS_REMOTE_EXECUTION_TIME_FORWARDING = str2bool(
    os.getenv("WORKFLOWS_REMOTE_EXECUTION_TIME_FORWARDING", "False")
)

3. inference_sdk/http/utils/executors.py — Collect from execute_requests_packages

After responses return from make_parallel_requests (back in the calling thread where ContextVar is set), read X-Processing-Time from each response. Place before the api_key_safe_raise_for_status loop so times are captured even on error.

# After results.extend(responses), BEFORE the raise-for-status loop:
collector = remote_processing_times.get()
if collector is not None:
    for response in results:
        pt = response.headers.get(PROCESSING_TIME_HEADER)
        if pt is not None:
            try:
                collector.add(float(pt))
            except (ValueError, TypeError):
                pass

4. inference_sdk/http/client.py — Collect from 3 direct requests.post() calls

Add the same collection snippet after api_key_safe_raise_for_status(response=response) in:

  • get_clip_text_embeddings() (line 1346)
  • clip_compare() (line 1449)
  • get_perception_encoder_text_embeddings() (line 1559)

These already have execution_id.get() calls right above, so the pattern is established.

5. inference/core/workflows/execution_engine/v1/executor/core.py — Propagate ContextVar to step threads

Same pattern as execution_id:

  • In execute_steps() (line 133): capture remote_processing_times.get() and pass via partial
  • In safe_execute_step() (line 160): accept param and call remote_processing_times.set(collector)

6. inference/core/workflows/core_steps/common/utils.py — Propagate ContextVar in run_in_parallel

run_in_parallel() (line 465) spawns a bare ThreadPoolExecutor used by CLIP comparison and other blocks. Capture and re-set remote_processing_times (and execution_id) in worker threads. Needed because clip_compare() makes a direct requests.post() inside these threads.

7. inference/core/interfaces/http/http_api.py — Initialize collector + add new headers

In GCPServerlessMiddleware.dispatch() (line 294):

  • Before call_next: if WORKFLOWS_REMOTE_EXECUTION_TIME_FORWARDING, create collector and set ContextVar
  • After call_next: if collector has data, add two new headers:
X-Processing-Time: 2.35                   # wall-clock CPU time (unchanged)
X-Remote-Processing-Time: 1.89            # sum of GPU remote times (new)
X-Remote-Processing-Times: 0.52,0.45,0.92 # individual GPU times (new)

Also expose both new headers in CORS config (line 394).

What Doesn't Change

  • Usage collector — untouched
  • X-Processing-Time header — stays as wall-clock time
  • Individual blocks — no changes to any run_remotely() methods
  • Execution engine core logic — only ContextVar propagation (same as execution_id)
  • External API calls (Anthropic, OpenAI, Google) — use their own HTTP clients, not collected
  • Existing deployments — gated behind WORKFLOWS_REMOTE_EXECUTION_TIME_FORWARDING=False default

Verification

  1. Run existing tests with env var unset — no behavior change
  2. Set WORKFLOWS_REMOTE_EXECUTION_TIME_FORWARDING=True + WORKFLOWS_STEP_EXECUTION_MODE=remote, run a workflow, verify both new headers present alongside unchanged X-Processing-Time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment