Skip to content

Instantly share code, notes, and snippets.

@cnolanminich
Created April 9, 2026 13:00
Show Gist options
  • Select an option

  • Save cnolanminich/4cd418df082e255a97db302e3fe60259 to your computer and use it in GitHub Desktop.

Select an option

Save cnolanminich/4cd418df082e255a97db302e3fe60259 to your computer and use it in GitHub Desktop.
Sizing Considerations for Dagster+ ECS non-isolated runs

Dagster+ ECS Sizing Guide: Non-Isolated Runs

Overview

With non-isolated runs, each run executes as a thread inside the long-running code server process. Memory and CPU are shared across all concurrent runs. This guide covers how to profile your runs, size your ECS tasks, configure Dagster's gRPC settings, and decide when to split into multiple replicas.


The Core Sizing Equation

total_memory = base_code_server_mem + (avg_run_peak_mem × max_concurrent_runs)
total_cpu    = base_code_server_cpu + (avg_run_cpu × max_concurrent_runs)

Step 1: Profile a Single Run

Run one job in isolation and measure its peak resource usage with a dedicated profiling asset:

import tracemalloc
import psutil
import os
from dagster import asset

@asset
def profiled_asset(context):
    tracemalloc.start()
    proc = psutil.Process(os.getpid())

    mem_before = proc.memory_info().rss / 1024**2  # MB

    # ... your actual work ...

    mem_after = proc.memory_info().rss / 1024**2
    peak_mem = tracemalloc.get_traced_memory()[1] / 1024**2
    tracemalloc.stop()

    context.log.info(f"RSS delta: {mem_after - mem_before:.1f} MB")
    context.log.info(f"Peak traced: {peak_mem:.1f} MB")

You can also use CloudWatch Container Insights if enabled. Keep this asset in a dev or staging deployment so you can materialize it before making sizing decisions.


Step 2: Estimate Peak Concurrency

avg_run_duration_seconds = 45   # measure from run history
runs_per_hour = 200             # expected throughput
avg_concurrent = (avg_run_duration_seconds / 3600) * runs_per_hour
peak_concurrent = avg_concurrent * 2.5  # ~2.5x burst headroom
print(f"Target concurrency: {peak_concurrent:.0f}")

Step 3: Size the ECS Task

Task size reference

Concurrent Runs Suggested Task Size Notes
5–10 light runs 2 vCPU / 4–8 GB Good starting point
10–25 medium runs 4 vCPU / 16 GB Common production size
25–50 runs 8 vCPU / 32 GB Consider replicas instead
50+ Multiple replicas Vertical scaling hits diminishing returns

Configure in dagster_cloud.yaml

For Dagster+ ECS, set server_resources (the long-running code server task) separately from run_resources (only used for isolated runs):

locations:
  - location_name: my-code-location
    image: my-image:latest
    code_source:
      package_name: my_package
    container_context:
      ecs:
        server_resources:   # this is the non-isolated run host
          cpu: 2048         # 2 vCPU
          memory: 8192      # 8 GB
        run_resources:      # only relevant for isolated runs
          cpu: 1024
          memory: 4096

Step 4: Configure Dagster Concurrency

Cap concurrency in Dagster to match your task capacity. Without this, ECS will OOM-kill the task under burst load in the Dagster+ UI under Deployment → Run Coordinator.


Step 5: Configure gRPC Workers

DAGSTER_GRPC_MAX_WORKERS controls the thread pool size on the code server. Set this as an environment variable on the code server ECS task — either directly in the task definition or via dagster_cloud.yaml:

container_context:
  ecs:
    env_vars:
      - DAGSTER_GRPC_MAX_WORKERS=20
    server_resources:
      cpu: 2048
      memory: 8192

How it works

The gRPC server uses a ThreadPoolExecutor. A run cannot start until a worker thread picks it up, making this the primary gate on concurrent run count. The default is Python's ThreadPoolExecutor default: num_vCPUs × 5.

The GIL constraint

The GIL is a performance ceiling, not a hard limit. It serializes Python bytecode execution across threads:

  • I/O-bound runs (DB queries, API calls, dbt Cloud): threads release the GIL while waiting, so concurrency scales well. 20–40+ concurrent runs on a single server is realistic.
  • CPU-bound runs (pandas, heavy Python computation): threads hold the GIL while computing. Throughput plateaus at roughly 1 vCPU equivalent regardless of thread count. Adding more threads adds overhead with no throughput gain past your vCPU count.

The GIL degrades throughput gradually — it doesn't crash the server. Memory is the hard wall.

Recommended rule

Set DAGSTER_GRPC_MAX_WORKERSmax_concurrent_runs so the gRPC thread pool is never the bottleneck. Then use throughput metrics (see below) to validate.

Hard limits summary

Constraint Type Effect when hit
DAGSTER_GRPC_MAX_WORKERS Hard Runs queue, don't start
ECS task memory Hard OOM kill — task restarts
Thread stack memory Soft (~8 MB/thread) Gradual memory pressure
GIL contention (CPU-bound) Performance ceiling Throughput plateaus, not crashes

Step 6: Enable Throughput Metrics

In your CloudFormation template, set:

AgentMetricsEnabled: true
CodeServerMetricsEnabled: true

This emits structured logs to CloudWatch:

{
  "request_utilization": {
    "max_concurrent_requests": 20,
    "num_running_requests": 12,
    "num_queued_requests": 3
  }
}

num_queued_requests > 0 consistently means your DAGSTER_GRPC_MAX_WORKERS is too low or your task is undersized.


When to Use Replicas vs. a Bigger Task

Use multiple replicas when

  • You want fault isolation — one OOM kills only that replica's in-flight runs, not everything
  • Runs are mostly CPU-bound — each replica is a separate process with its own GIL, so you get real parallelism
  • You want clean rolling deploys (one replica drains at a time)
  • Run duration is long (>5 min) — blast radius of a single-server OOM is too costly

Use a bigger single task when

  • Runs share expensive in-memory state (loaded ML models, large lookup tables) that would be costly to reload per process
  • Runs are short and idempotent — OOM blast radius is low and retries are cheap

Practical concurrency thresholds before splitting

Run type Single server limit Then do
Mostly I/O-bound ~20–30 concurrent Add a second replica
Mixed I/O + CPU ~10–15 concurrent 2–3 replicas
Mostly CPU-bound ~4–8 concurrent 3–4 smaller replicas
Long-running (>5 min) ~10 Replicas for fault isolation

Blast radius comparison

avg_concurrent_runs = 15

# Single server OOM: lose all 15 in-flight runs, task restarts
# 3 replicas, one OOM: lose ~5, other 2 replicas keep running immediately

Multi-replica max_concurrent_runs

With multiple replicas, the ECS agent distributes gRPC requests via round-robin DNS. Set your global max_concurrent_runs in the Dagster+ deployment settings to account for all replicas

Recommended default

Start with 2 replicas of a moderately sized task (4 vCPU / 16 GB), each handling ~15 concurrent runs. This gives you fault isolation, clean rolling deploys, and room to scale to 3–4 replicas before needing to rethink task sizing. Only consolidate to a single large server if shared in-memory state makes per-replica memory cost prohibitive.


Settings Reference

Setting Where Purpose
server_resources.cpu/memory dagster_cloud.yaml / container_context.yaml Code server ECS task size
DAGSTER_GRPC_MAX_WORKERS Code server container env var gRPC thread pool size
max_concurrent_runs Dagster+ UI Global run concurrency cap
AgentMetricsEnabled CloudFormation param Enable agent throughput logs
CodeServerMetricsEnabled CloudFormation param Enable code server throughput logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment