Skip to content

Instantly share code, notes, and snippets.

@elyase
Created August 20, 2025 22:18
Show Gist options
  • Save elyase/4ced2149900bfa3948e7144e057f7b4c to your computer and use it in GitHub Desktop.
Save elyase/4ced2149900bfa3948e7144e057f7b4c to your computer and use it in GitHub Desktop.
modals.md

Modal Labs Scaling Guide

Overview

Modal is a serverless cloud platform that provides automatic scaling for Python workloads. It's particularly useful for:

  • Parallel data processing
  • Machine learning training and inference
  • Hyperparameter optimization
  • Web scraping at scale
  • Batch computations requiring GPUs

Core Scaling Concepts

1. Automatic Horizontal Scaling

Modal automatically scales containers based on workload with zero configuration required.

import modal

app = modal.App("my-scaling-app")

@app.function()
def process_item(item):
    # Automatically scales to handle concurrent requests
    return expensive_computation(item)

@app.local_entrypoint()
def main():
    # Process 10,000 items in parallel across multiple containers
    results = list(process_item.map(range(10_000)))

Configuration Options:

  • min_containers: Keep warm containers ready (default: 0)
  • max_containers: Upper limit on concurrent containers
  • buffer_containers: Maintain containers to reduce queuing
  • scaledown_window: Idle duration before shutdown
@app.function(
    min_containers=2,
    max_containers=100,
    buffer_containers=5,
    scaledown_window=120
)
def scalable_function():
    pass

2. Parallel Execution Patterns

Map for Embarrassingly Parallel Workloads

@app.function(gpu="T4")
def train_model(hyperparams):
    # Train with specific hyperparameters
    return model_score

@app.local_entrypoint()
def hyperparameter_search():
    param_grid = [
        {"lr": 0.001, "batch_size": 32},
        {"lr": 0.01, "batch_size": 64},
        # ... hundreds more combinations
    ]
    
    # Parallel execution across GPUs
    for params, score in train_model.map(param_grid):
        print(f"Params: {params}, Score: {score}")

Spawn Map for Fire-and-Forget Processing

@app.function(volumes={"/data": volume})
def process_batch(batch_id):
    # Process and store results externally
    data = fetch_batch(batch_id)
    results = process(data)
    save_to_volume(results)

@app.local_entrypoint()
def main():
    # Submit 100,000 jobs without waiting for results
    process_batch.spawn_map(range(100_000))

3. Concurrent Input Processing

Process multiple inputs in a single container for I/O-bound workloads:

@app.function()
@modal.concurrent(max_inputs=100, target_inputs=80)
async def fetch_api_data(url: str):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            return await response.json()

# Single container handles up to 100 concurrent API calls

4. Dynamic Batching for ML Inference

Optimize GPU utilization by batching requests:

@app.function(gpu="A100")
@modal.batched(max_batch_size=32, wait_ms=100)
async def predict_batch(inputs: list[np.ndarray]) -> list[dict]:
    # Process entire batch on GPU at once
    predictions = model.predict(np.stack(inputs))
    return [{"prediction": pred} for pred in predictions]

# Individual requests automatically batched
result = predict_batch.remote(single_input)

5. Job Queue Pattern

For long-running background tasks:

@app.function(timeout=3600)  # 1 hour timeout
def process_video(video_url: str):
    # Long-running video processing
    return processed_result

def submit_job(video_url):
    # Submit job and get tracking ID
    fn = modal.Function.from_name("video-processor", "process_video")
    call = fn.spawn(video_url)
    return call.object_id

def get_result(job_id):
    call = modal.FunctionCall.from_id(job_id)
    return call.get()  # Blocks until complete

Practical Examples

Hyperparameter Search with Optuna

import optuna
import modal

app = modal.App(image=modal.Image.debian_slim()
    .pip_install("optuna", "scikit-learn", "pandas"))

@app.function()
def objective(trial_params):
    """Single trial evaluation"""
    model = create_model(trial_params)
    score = cross_validate(model)
    return score

@app.function()
def distributed_optuna_search(n_trials=1000):
    """Distributed hyperparameter optimization"""
    study = optuna.create_study(direction="maximize")
    
    # Generate trial parameters
    trials = [study.ask() for _ in range(n_trials)]
    
    # Evaluate all trials in parallel
    scores = list(objective.map(trials))
    
    # Update study with results
    for trial, score in zip(trials, scores):
        study.tell(trial, score)
    
    return study.best_params

Grid Search Pattern

@app.function(gpu="T4")
def evaluate_model(hyperparams):
    """Evaluate single hyperparameter configuration"""
    lr, batch_size, dropout = hyperparams
    model = train_model(lr, batch_size, dropout)
    accuracy = evaluate(model)
    return (hyperparams, accuracy)

@app.local_entrypoint()
def grid_search():
    """Parallel grid search across parameter space"""
    param_grid = [
        (lr, bs, dr)
        for lr in [0.001, 0.01, 0.1]
        for bs in [16, 32, 64, 128]
        for dr in [0.1, 0.2, 0.3, 0.5]
    ]
    
    # Run all experiments in parallel
    results = list(evaluate_model.map(param_grid))
    best_params, best_score = max(results, key=lambda x: x[1])
    print(f"Best params: {best_params}, Score: {best_score}")

Large-Scale Data Fetching

@app.function()
@modal.concurrent(max_inputs=500)
async def fetch_webpage(url: str):
    """Fetch and process webpage content"""
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            html = await response.text()
            return extract_data(html)

@app.local_entrypoint()
def scrape_websites():
    """Scrape 100,000 websites in parallel"""
    urls = load_url_list()  # 100,000 URLs
    
    # Process with automatic scaling
    for url, data in fetch_webpage.map(urls):
        save_to_database(url, data)

Batch Image Processing

@app.function(
    gpu="T4",
    image=modal.Image.debian_slim()
        .pip_install("torch", "torchvision", "pillow")
)
@modal.batched(max_batch_size=64, wait_ms=50)
async def process_images(image_paths: list[str]) -> list[dict]:
    """Process batch of images on GPU"""
    images = [load_image(path) for path in image_paths]
    tensors = torch.stack([transform(img) for img in images])
    
    # Process entire batch on GPU
    with torch.no_grad():
        features = model(tensors.cuda())
    
    return [{"features": f.cpu().numpy()} for f in features]

Environment Setup

Custom Docker Images

# Complex ML environment
ml_image = (
    modal.Image.from_registry("pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime")
    .pip_install("transformers", "datasets", "wandb")
    .run_commands("apt-get update && apt-get install -y git")
)

app = modal.App(image=ml_image)

@app.function(gpu="A100")
def train_llm():
    # Use pre-configured ML environment
    pass

Mounting Volumes for Persistent Storage

volume = modal.Volume.from_name("my-data-volume")

@app.function(volumes={"/data": volume})
def process_with_storage():
    # Read/write to persistent volume
    data = pd.read_parquet("/data/input.parquet")
    results = process(data)
    results.to_parquet("/data/output.parquet")

Best Practices

  1. Choose the Right Pattern:

    • Use .map() for result collection
    • Use .spawn_map() for fire-and-forget
    • Use @modal.concurrent for I/O-bound tasks
    • Use @modal.batched for GPU inference
  2. Optimize Container Configuration:

    • Set min_containers for low-latency requirements
    • Use buffer_containers to handle burst traffic
    • Configure appropriate timeouts for long-running tasks
  3. Resource Management:

    • Pin dependency versions in images
    • Use volumes for shared data
    • Leverage GPU acceleration where beneficial
  4. Monitoring and Debugging:

    • Use Modal dashboard for monitoring
    • Implement logging within functions
    • Handle failures gracefully with retries

When to Use Modal

Ideal Use Cases:

  • Hyperparameter tuning (grid search, Bayesian optimization)
  • Large-scale data processing and ETL
  • Distributed model training
  • Batch inference on GPUs
  • Web scraping and API data collection
  • Parallel simulations and Monte Carlo methods

Not Recommended For:

  • Real-time, sub-millisecond latency requirements
  • Stateful, long-lived services
  • Applications requiring specific network configurations

Quick Start Commands

# Install Modal
pip install modal

# Authenticate
modal setup

# Deploy an app
modal deploy my_app.py

# Run a function
modal run my_app.py::function_name

# Run with detached jobs
modal run --detach my_app.py

Cost Optimization Tips

  1. Use .spawn_map() for large batches to avoid result serialization overhead
  2. Configure scaledown_window to balance cost vs latency
  3. Use spot instances for fault-tolerant workloads
  4. Batch small tasks to reduce container startup overhead
  5. Use appropriate GPU types (T4 for inference, A100 for training)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment