Offloading Processing to On-Demand Workers — Complete Guide

Context: A FastAPI app running a sequential email→triage→PDF→LLM→save pipeline (30-60s per run). Goal: cut always-on compute costs and scale individual bottlenecks independently.

Current Architecture — What We're Working With

User Request → FastAPI (always-on server)
  → DB Setup (50-200ms)
  → Gmail Search — 13 sequential queries (4-12s)
  → Metadata Fetch — batches of 25 (3-8s)
  → LLM Triage — batches of 30 via Groq (2-4s)
  → PDF Download + PyMuPDF extraction (5-60s) ← BIGGEST BOTTLENECK
  → LLM Extraction — 3 concurrent Groq/xAI (2-5s)
  → Dedup + Enrichment (0.1-0.5s)
  → Save to DB (encrypted) (0.5-2s)
  → SSE response

Problem: The entire pipeline runs inside your web server. Even when nobody's using it, the server is burning compute. When someone IS using it, 30-60s of blocking work ties up a server instance.

Target Architecture — The Offload Pattern

User Request → FastAPI (lightweight API, can be serverless)
  → DB Setup (stays here)
  → Enqueue job → Queue (Redis/SQS)
  → Return job_id immediately (SSE for progress)

Worker (on-demand, scales to zero) picks up job:
  → Phase 1: Gmail Search + Metadata (parallelized)
  → Phase 2: LLM Triage
  → Phase 3: PDF Download + Extraction (fan-out parallel)
  → Phase 4: LLM Extraction (high concurrency)
  → Phase 5: Dedup + Save
  → Push progress via SSE/webhook/polling

Platform Comparison — Real Pricing (March 2026)

Assumptions for cost estimation:

~100 pipeline runs/day (moderate usage)
Average 45s per run
512MB-1GB memory needed
CPU-only (no GPU required)

1. Modal (⭐ RECOMMENDED for your use case)

What it is: Serverless Python functions. Scale to zero. No containers to manage.

Pricing:

Resource	Cost
CPU	$0.0000131/core/sec (~$0.047/core/hour)
Memory	$0.00000222/GiB/sec (~$0.008/GiB/hour)
Free tier	$30/month credits (Starter plan)

Cost estimate for your workload:

100 runs × 45s × 1 core + 1 GiB = ~4,500 core-seconds/day
Monthly: ~135,000 core-seconds
CPU: 135,000 × $0.0000131 = $1.77/month
Memory: 135,000 × $0.00000222 = $0.30/month
Total: ~$2.07/month (covered by free tier!)

Why it's perfect for you:

Pure Python — decorate functions with @app.function(), done
Native FastAPI integration with Function.lookup() + .spawn()
Built-in job queue semantics (spawn, poll, get result)
Scales to zero when nobody's running pipelines
Can fan out PDF downloads to parallel workers trivially
PyMuPDF, Groq SDK, etc. just work via Image.pip_install()

Architecture with Modal:

# modal_workers.py
import modal

image = modal.Image.debian_slim().pip_install(
    "google-api-python-client", "pymupdf", "groq", "cryptography"
)
app = modal.App("policy-pipeline", image=image)

@app.function(timeout=120)
def gmail_search(user_creds: dict, queries: list[str]) -> list:
    """Run all 13 Gmail queries in parallel"""
    import asyncio
    # ... parallel search logic
    return results

@app.function(timeout=300, concurrency_limit=10)
def download_and_extract_pdf(attachment_data: dict) -> dict:
    """Download single PDF + PyMuPDF extraction"""
    # ... download + extract
    return {"text": extracted_text, "metadata": meta}

@app.function(timeout=60)
def llm_triage(emails: list[dict]) -> list[dict]:
    """Batch triage via Groq"""
    # ... groq API call
    return triage_results

@app.function(timeout=60)
def llm_extract(documents: list[dict]) -> list[dict]:
    """LLM extraction via Groq/xAI — high concurrency"""
    return extracted_policies

@app.function(timeout=300)
def run_full_pipeline(user_id: str, vault_key: str):
    """Orchestrator — runs on Modal, fans out to sub-functions"""
    # Phase 1: Gmail (parallel)
    emails = gmail_search.remote(creds, queries)
    
    # Phase 2: Triage
    triaged = llm_triage.remote(new_emails)
    
    # Phase 3: PDF downloads (FAN OUT — the big win)
    pdf_results = list(download_and_extract_pdf.map(attachments))
    
    # Phase 4: LLM extraction
    policies = llm_extract.remote(pdf_results)
    
    # Phase 5: Save (could call back to your DB or use Modal's storage)
    save_results(policies)
    return {"status": "done", "count": len(policies)}

# In your FastAPI app
@app.post("/api/refresh-policies")
async def refresh_policies(user: User):
    db_setup(user)  # stays local
    
    # Offload entire pipeline to Modal
    modal_fn = modal.Function.lookup("policy-pipeline", "run_full_pipeline")
    call = await modal_fn.spawn.aio(user.id, user.vault_key)
    
    return {"job_id": call.object_id, "status": "processing"}

@app.get("/api/job/{job_id}")
async def get_job_status(job_id: str):
    call = modal.FunctionCall.from_id(job_id)
    try:
        result = await call.get.aio(timeout=0)
        return {"status": "complete", "result": result}
    except TimeoutError:
        return {"status": "processing"}

2. AWS Lambda + SQS

Pricing:

Resource	Cost
Requests	$0.20/million
Compute (x86)	$0.0000166667/GB-second
SQS	$0.40/million requests
Free tier	1M requests + 400K GB-sec/month

Cost estimate:

100 runs × 45s × 1GB = 4,500 GB-sec/day → 135,000 GB-sec/month
After free tier: 135,000 - 400,000 = FREE (within free tier!)
Requests: 3,000/month = FREE
Total: $0/month for this workload level

Gotchas:

⚠️ 15-minute max timeout — your PDF phase (up to 60s) fits, but entire pipeline (up to 100s) might not
⚠️ Cold starts add 1-3s (Python runtime)
⚠️ Deployment is heavier — ZIP packages, layers for PyMuPDF, IAM roles
⚠️ Can't easily fan out sub-tasks from within Lambda without Step Functions
⚠️ Starting Aug 2025: cold start init phase is now billed

When to choose: If you're already on AWS and want $0 cost at this scale.

3. Google Cloud Run Jobs

Pricing:

Resource	Cost
CPU	$0.000024/vCPU-sec (~$0.086/vCPU-hour)
Memory	$0.0000025/GiB-sec
Free tier	240K vCPU-sec + 450K GiB-sec/month

Cost estimate:

135,000 vCPU-seconds/month
After free tier: FREE (within free tier!)
Total: $0/month for this workload level

Gotchas:

More container-oriented (need Dockerfile)
No built-in fan-out like Modal's .map()
Good if you're already on GCP

4. Fly.io Machines

Pricing:

Resource	Cost
CPU	~$0.0027/hour (shared 256MB)
Stops when idle	Yes (scale to zero)
Storage	$0.15/GB/month (even when stopped)

Cost estimate:

100 runs × 45s = 4,500s/day = 1.25 hours/day
Monthly: 37.5 hours × $0.0027 = **$0.10/month**

When to choose: If you want more control (SSH into machines, persistent volumes) and don't mind managing containers.

5. Railway (Background Worker)

Pricing:

Resource	Cost
CPU	$0.00000772/vCPU-sec
Memory	$0.00000386/GB-sec
Included credits	$5/month (Hobby) or $20/month (Pro)

Cost estimate:

Similar to Modal pricing: ~$2-3/month (covered by plan credits)

When to choose: If you want a nice dashboard and simple deploys. But workers are always-on (no true scale-to-zero for background services).

Cost Comparison Summary

Platform	Monthly Cost (100 runs/day)	Scale to Zero	Setup Effort	Fan-out Support
Modal	~$2 (free tier covers)	✅ Yes	🟢 Easy (Python decorators)	✅ Built-in `.map()`
AWS Lambda	$0 (free tier)	✅ Yes	🔴 Heavy (IAM, layers, packaging)	⚠️ Needs Step Functions
Cloud Run Jobs	$0 (free tier)	✅ Yes	🟡 Medium (Dockerfile)	⚠️ Manual
Fly.io	~$0.10	✅ Yes	🟡 Medium (Dockerfile)	⚠️ Manual
Railway	~$3 (credits cover)	❌ No	🟢 Easy	⚠️ Manual
Always-on server	$5-25/month	❌ No	Already done	❌ N/A

The Big Wins — What to Offload First

Priority 1: PDF Downloads (Step 3b) — 5-60s savings

This is where you get the most bang. Currently processing PDFs in batches of 5 sequentially.

With Modal fan-out:

@app.function(timeout=120)
def download_single_pdf(gmail_creds, attachment_id, message_id):
    # Download from Gmail API
    # Extract text with PyMuPDF
    return {"text": text, "pages": pages}

# In orchestrator:
# Instead of sequential batches of 5...
results = list(download_single_pdf.map(
    [(creds, att_id, msg_id) for att_id, msg_id in attachments]
))
# All PDFs download in parallel across N workers!

Impact: 60s → 5-10s (limited only by slowest single PDF)

Priority 2: Gmail Search (Step 1a) — 4-12s savings

13 sequential queries → 13 parallel queries.

@app.function()
def single_gmail_search(creds, query):
    return gmail_api_search(creds, query)

# Fan out all 13 queries
all_results = list(single_gmail_search.map(
    [(creds, q) for q in queries]
))

Impact: 12s → 1-2s

Priority 3: LLM Extraction (Step 3c) — 2-5s savings

Already 3x concurrent. Push to 8-10x.

@app.function(concurrency_limit=10)
def extract_single_policy(document_text, model="groq"):
    return llm_extract(document_text)

Impact: 5s → 1-2s

Total Pipeline Time After Offload

Phase	Before	After
DB Setup	200ms	200ms (stays)
Gmail Search	4-12s	1-2s
Metadata Fetch	3-8s	2-4s
Triage	2-4s	2-4s (API-bound)
PDF Download	5-60s	3-8s
LLM Extract	2-5s	1-2s
Dedup + Save	1-3s	1-3s
Total	30-60s	10-23s

Implementation Plan — Step by Step

Phase 1: Set up Modal (Day 1)

pip install modal
modal setup (one-time auth)
Create modal_workers.py with your pipeline functions
Test locally with modal run modal_workers.py

Phase 2: Offload PDF Downloads First (Day 1-2)

Extract fetch_document_text() into a Modal function
Use .map() for parallel downloads
Test with real Gmail data
This alone cuts 20-50s off your pipeline

Phase 3: Parallelize Gmail Search (Day 2)

Extract fetch_email_metadata() queries into parallel Modal calls
Merge results back

Phase 4: Wire Up the Orchestrator (Day 2-3)

Create run_full_pipeline Modal function
Your FastAPI endpoint becomes a thin dispatcher
Implement job status polling or SSE forwarding

Phase 5: Scale Down Your Server (Day 3+)

Your FastAPI app is now just an API gateway
Can move to a cheaper/smaller instance
Or go fully serverless (Modal can host FastAPI too!)

Queue-Based Alternative (If You Want More Control)

If you prefer a traditional queue pattern over Modal's function-calling model:

FastAPI → Redis (BullMQ/Celery) → Worker Process

Stack: FastAPI + Celery + Redis + Fly.io (worker)

# celery_tasks.py
from celery import Celery

app = Celery('pipeline', broker='redis://...')

@app.task(bind=True, max_retries=3)
def process_pipeline(self, user_id, vault_key):
    try:
        emails = fetch_emails(user_id)
        triaged = triage_emails(emails)
        pdfs = download_pdfs(triaged)  # still sequential here
        policies = extract_policies(pdfs)
        save_policies(policies)
        return {"status": "done"}
    except Exception as exc:
        self.retry(exc=exc, countdown=60)

Costs:

Redis (Upstash serverless): Free tier → $0/month
Celery worker on Fly.io: ~$0.10/month (scale to zero)
More operational overhead than Modal

Decision Matrix — Which Path to Take

Factor	Modal	Lambda+SQS	Celery+Redis
Your stack (Python/FastAPI)	✅ Perfect fit	✅ Good	✅ Good
Fan-out parallelism	✅ `.map()` built-in	⚠️ Need Step Functions	⚠️ Need Celery groups
Scale to zero	✅	✅	⚠️ (worker needs to run)
Time to implement	1-2 days	3-5 days	2-3 days
Vendor lock-in	Medium	High (AWS)	Low
Debugging	Good (web dashboard)	OK (CloudWatch)	Good (Flower)
Cost at your scale	~$0-2/mo	$0/mo	~$0-5/mo

My Recommendation

Go with Modal. Here's why:

You're a Python shop — Modal is literally "add a decorator, deploy"
Fan-out is your biggest win — .map() for PDFs is the single highest-impact change
$30 free credits/month covers your workload 10x over
Zero infra to manage — no Docker, no IAM, no VPCs
Migration is incremental — offload one function at a time, not a big-bang rewrite

The only reason to pick Lambda is if you're already deep in AWS and want the $0 free tier forever. But the implementation overhead is 3x more.

Scaling Projections

Users/Day	Runs/Day	Modal Cost	Lambda Cost	Always-On Server
10	100	$2/mo (free)	$0/mo (free)	$7-25/mo
50	500	$10/mo	$2/mo	$15-50/mo
200	2,000	$40/mo	$8/mo	$50-150/mo
1,000	10,000	$200/mo	$40/mo	$200-500/mo

At every scale, on-demand workers cost less than an always-on server that's idle 95% of the time.

Generated March 2026. Prices sourced from official pricing pages.

PandaWhoCodes/worker-offload-guide.md

Select an option

No results found

Select an option

No results found

Offloading Processing to On-Demand Workers — Complete Guide

Current Architecture — What We're Working With

Target Architecture — The Offload Pattern

Platform Comparison — Real Pricing (March 2026)

Assumptions for cost estimation:

1. Modal (⭐ RECOMMENDED for your use case)

2. AWS Lambda + SQS

3. Google Cloud Run Jobs

4. Fly.io Machines

5. Railway (Background Worker)

Cost Comparison Summary

The Big Wins — What to Offload First

Priority 1: PDF Downloads (Step 3b) — 5-60s savings

Priority 2: Gmail Search (Step 1a) — 4-12s savings

Priority 3: LLM Extraction (Step 3c) — 2-5s savings

Total Pipeline Time After Offload

Implementation Plan — Step by Step

Phase 1: Set up Modal (Day 1)

Phase 2: Offload PDF Downloads First (Day 1-2)

Phase 3: Parallelize Gmail Search (Day 2)

Phase 4: Wire Up the Orchestrator (Day 2-3)

Phase 5: Scale Down Your Server (Day 3+)

Queue-Based Alternative (If You Want More Control)

Decision Matrix — Which Path to Take

My Recommendation

Scaling Projections