Dagster+ ECS Sizing Guide: Non-Isolated Runs

Overview

With non-isolated runs, each run executes as a thread inside the long-running code server process. Memory and CPU are shared across all concurrent runs. This guide covers how to profile your runs, size your ECS tasks, configure Dagster's gRPC settings, and decide when to split into multiple replicas.

The Core Sizing Equation

total_memory = base_code_server_mem + (avg_run_peak_mem × max_concurrent_runs)
total_cpu    = base_code_server_cpu + (avg_run_cpu × max_concurrent_runs)

Step 1: Profile a Single Run

Run one job in isolation and measure its peak resource usage with a dedicated profiling asset:

import tracemalloc
import psutil
import os
from dagster import asset

@asset
def profiled_asset(context):
    tracemalloc.start()
    proc = psutil.Process(os.getpid())

    mem_before = proc.memory_info().rss / 1024**2  # MB

    # ... your actual work ...

    mem_after = proc.memory_info().rss / 1024**2
    peak_mem = tracemalloc.get_traced_memory()[1] / 1024**2
    tracemalloc.stop()

    context.log.info(f"RSS delta: {mem_after - mem_before:.1f} MB")
    context.log.info(f"Peak traced: {peak_mem:.1f} MB")

You can also use CloudWatch Container Insights if enabled. Keep this asset in a dev or staging deployment so you can materialize it before making sizing decisions.

Step 2: Estimate Peak Concurrency

avg_run_duration_seconds = 45   # measure from run history
runs_per_hour = 200             # expected throughput
avg_concurrent = (avg_run_duration_seconds / 3600) * runs_per_hour
peak_concurrent = avg_concurrent * 2.5  # ~2.5x burst headroom
print(f"Target concurrency: {peak_concurrent:.0f}")

Step 3: Size the ECS Task

Task size reference

Concurrent Runs	Suggested Task Size	Notes
5–10 light runs	2 vCPU / 4–8 GB	Good starting point
10–25 medium runs	4 vCPU / 16 GB	Common production size
25–50 runs	8 vCPU / 32 GB	Consider replicas instead
50+	Multiple replicas	Vertical scaling hits diminishing returns

Configure in `dagster_cloud.yaml`

For Dagster+ ECS, set server_resources (the long-running code server task) separately from run_resources (only used for isolated runs):

locations:
  - location_name: my-code-location
    image: my-image:latest
    code_source:
      package_name: my_package
    container_context:
      ecs:
        server_resources:   # this is the non-isolated run host
          cpu: 2048         # 2 vCPU
          memory: 8192      # 8 GB
        run_resources:      # only relevant for isolated runs
          cpu: 1024
          memory: 4096

Step 4: Configure Dagster Concurrency

Cap concurrency in Dagster to match your task capacity. Without this, ECS will OOM-kill the task under burst load in the Dagster+ UI under Deployment → Run Coordinator.

Step 5: Configure gRPC Workers

DAGSTER_GRPC_MAX_WORKERS controls the thread pool size on the code server. Set this as an environment variable on the code server ECS task — either directly in the task definition or via dagster_cloud.yaml:

container_context:
  ecs:
    env_vars:
      - DAGSTER_GRPC_MAX_WORKERS=20
    server_resources:
      cpu: 2048
      memory: 8192

How it works

The gRPC server uses a ThreadPoolExecutor. A run cannot start until a worker thread picks it up, making this the primary gate on concurrent run count. The default is Python's ThreadPoolExecutor default: num_vCPUs × 5.

The GIL constraint

The GIL is a performance ceiling, not a hard limit. It serializes Python bytecode execution across threads:

I/O-bound runs (DB queries, API calls, dbt Cloud): threads release the GIL while waiting, so concurrency scales well. 20–40+ concurrent runs on a single server is realistic.
CPU-bound runs (pandas, heavy Python computation): threads hold the GIL while computing. Throughput plateaus at roughly 1 vCPU equivalent regardless of thread count. Adding more threads adds overhead with no throughput gain past your vCPU count.

The GIL degrades throughput gradually — it doesn't crash the server. Memory is the hard wall.

Recommended rule

Set DAGSTER_GRPC_MAX_WORKERS ≥ max_concurrent_runs so the gRPC thread pool is never the bottleneck. Then use throughput metrics (see below) to validate.

Hard limits summary

Constraint	Type	Effect when hit
`DAGSTER_GRPC_MAX_WORKERS`	Hard	Runs queue, don't start
ECS task memory	Hard	OOM kill — task restarts
Thread stack memory	Soft (~8 MB/thread)	Gradual memory pressure
GIL contention (CPU-bound)	Performance ceiling	Throughput plateaus, not crashes

Step 6: Enable Throughput Metrics

In your CloudFormation template, set:

AgentMetricsEnabled: true
CodeServerMetricsEnabled: true

This emits structured logs to CloudWatch:

{
  "request_utilization": {
    "max_concurrent_requests": 20,
    "num_running_requests": 12,
    "num_queued_requests": 3
  }
}

num_queued_requests > 0 consistently means your DAGSTER_GRPC_MAX_WORKERS is too low or your task is undersized.

When to Use Replicas vs. a Bigger Task

Use multiple replicas when

You want fault isolation — one OOM kills only that replica's in-flight runs, not everything
Runs are mostly CPU-bound — each replica is a separate process with its own GIL, so you get real parallelism
You want clean rolling deploys (one replica drains at a time)
Run duration is long (>5 min) — blast radius of a single-server OOM is too costly

Use a bigger single task when

Runs share expensive in-memory state (loaded ML models, large lookup tables) that would be costly to reload per process
Runs are short and idempotent — OOM blast radius is low and retries are cheap

Practical concurrency thresholds before splitting

Run type	Single server limit	Then do
Mostly I/O-bound	~20–30 concurrent	Add a second replica
Mixed I/O + CPU	~10–15 concurrent	2–3 replicas
Mostly CPU-bound	~4–8 concurrent	3–4 smaller replicas
Long-running (>5 min)	~10	Replicas for fault isolation

Blast radius comparison

avg_concurrent_runs = 15

# Single server OOM: lose all 15 in-flight runs, task restarts
# 3 replicas, one OOM: lose ~5, other 2 replicas keep running immediately

Multi-replica `max_concurrent_runs`

With multiple replicas, the ECS agent distributes gRPC requests via round-robin DNS. Set your global max_concurrent_runs in the Dagster+ deployment settings to account for all replicas

Recommended default

Start with 2 replicas of a moderately sized task (4 vCPU / 16 GB), each handling ~15 concurrent runs. This gives you fault isolation, clean rolling deploys, and room to scale to 3–4 replicas before needing to rethink task sizing. Only consolidate to a single large server if shared in-memory state makes per-replica memory cost prohibitive.

Settings Reference

Setting	Where	Purpose
`server_resources.cpu/memory`	`dagster_cloud.yaml` / container_context.yaml	Code server ECS task size
`DAGSTER_GRPC_MAX_WORKERS`	Code server container env var	gRPC thread pool size
`max_concurrent_runs`	Dagster+ UI	Global run concurrency cap
`AgentMetricsEnabled`	CloudFormation param	Enable agent throughput logs
`CodeServerMetricsEnabled`	CloudFormation param	Enable code server throughput logs

cnolanminich/code_server_sizing.md

Select an option

No results found

Select an option

No results found

Dagster+ ECS Sizing Guide: Non-Isolated Runs

Overview

The Core Sizing Equation

Step 1: Profile a Single Run

Step 2: Estimate Peak Concurrency

Step 3: Size the ECS Task

Task size reference

Configure in `dagster_cloud.yaml`

Step 4: Configure Dagster Concurrency

Step 5: Configure gRPC Workers

How it works

The GIL constraint

Recommended rule

Hard limits summary

Step 6: Enable Throughput Metrics

When to Use Replicas vs. a Bigger Task

Use multiple replicas when

Use a bigger single task when

Practical concurrency thresholds before splitting

Blast radius comparison

Multi-replica `max_concurrent_runs`

Recommended default

Settings Reference

cnolanminich/code_server_sizing.md

Dagster+ ECS Sizing Guide: Non-Isolated Runs

Overview

The Core Sizing Equation

Step 1: Profile a Single Run

Step 2: Estimate Peak Concurrency

Step 3: Size the ECS Task

Task size reference

Configure in dagster_cloud.yaml

Step 4: Configure Dagster Concurrency

Step 5: Configure gRPC Workers

How it works

The GIL constraint

Recommended rule

Hard limits summary

Step 6: Enable Throughput Metrics

When to Use Replicas vs. a Bigger Task

Use multiple replicas when

Use a bigger single task when

Practical concurrency thresholds before splitting

Blast radius comparison

Multi-replica max_concurrent_runs

Recommended default

Settings Reference

Configure in `dagster_cloud.yaml`

Multi-replica `max_concurrent_runs`