Skip to content

Instantly share code, notes, and snippets.

@PandaWhoCodes
Created March 16, 2026 14:08
Show Gist options
  • Select an option

  • Save PandaWhoCodes/bf9afc5b187ba46c7131bb3ea1edbad0 to your computer and use it in GitHub Desktop.

Select an option

Save PandaWhoCodes/bf9afc5b187ba46c7131bb3ea1edbad0 to your computer and use it in GitHub Desktop.
Offloading FastAPI Pipeline Processing to On-Demand Workers — Cost Analysis & Implementation Guide

Offloading Processing to On-Demand Workers — Complete Guide

Context: A FastAPI app running a sequential email→triage→PDF→LLM→save pipeline (30-60s per run). Goal: cut always-on compute costs and scale individual bottlenecks independently.


Current Architecture — What We're Working With

User Request → FastAPI (always-on server)
  → DB Setup (50-200ms)
  → Gmail Search — 13 sequential queries (4-12s)
  → Metadata Fetch — batches of 25 (3-8s)
  → LLM Triage — batches of 30 via Groq (2-4s)
  → PDF Download + PyMuPDF extraction (5-60s) ← BIGGEST BOTTLENECK
  → LLM Extraction — 3 concurrent Groq/xAI (2-5s)
  → Dedup + Enrichment (0.1-0.5s)
  → Save to DB (encrypted) (0.5-2s)
  → SSE response

Problem: The entire pipeline runs inside your web server. Even when nobody's using it, the server is burning compute. When someone IS using it, 30-60s of blocking work ties up a server instance.


Target Architecture — The Offload Pattern

User Request → FastAPI (lightweight API, can be serverless)
  → DB Setup (stays here)
  → Enqueue job → Queue (Redis/SQS)
  → Return job_id immediately (SSE for progress)

Worker (on-demand, scales to zero) picks up job:
  → Phase 1: Gmail Search + Metadata (parallelized)
  → Phase 2: LLM Triage
  → Phase 3: PDF Download + Extraction (fan-out parallel)
  → Phase 4: LLM Extraction (high concurrency)
  → Phase 5: Dedup + Save
  → Push progress via SSE/webhook/polling

Platform Comparison — Real Pricing (March 2026)

Assumptions for cost estimation:

  • ~100 pipeline runs/day (moderate usage)
  • Average 45s per run
  • 512MB-1GB memory needed
  • CPU-only (no GPU required)

1. Modal (⭐ RECOMMENDED for your use case)

What it is: Serverless Python functions. Scale to zero. No containers to manage.

Pricing:

Resource Cost
CPU $0.0000131/core/sec (~$0.047/core/hour)
Memory $0.00000222/GiB/sec (~$0.008/GiB/hour)
Free tier $30/month credits (Starter plan)

Cost estimate for your workload:

  • 100 runs × 45s × 1 core + 1 GiB = ~4,500 core-seconds/day
  • Monthly: ~135,000 core-seconds
  • CPU: 135,000 × $0.0000131 = $1.77/month
  • Memory: 135,000 × $0.00000222 = $0.30/month
  • Total: ~$2.07/month (covered by free tier!)

Why it's perfect for you:

  • Pure Python — decorate functions with @app.function(), done
  • Native FastAPI integration with Function.lookup() + .spawn()
  • Built-in job queue semantics (spawn, poll, get result)
  • Scales to zero when nobody's running pipelines
  • Can fan out PDF downloads to parallel workers trivially
  • PyMuPDF, Groq SDK, etc. just work via Image.pip_install()

Architecture with Modal:

# modal_workers.py
import modal

image = modal.Image.debian_slim().pip_install(
    "google-api-python-client", "pymupdf", "groq", "cryptography"
)
app = modal.App("policy-pipeline", image=image)

@app.function(timeout=120)
def gmail_search(user_creds: dict, queries: list[str]) -> list:
    """Run all 13 Gmail queries in parallel"""
    import asyncio
    # ... parallel search logic
    return results

@app.function(timeout=300, concurrency_limit=10)
def download_and_extract_pdf(attachment_data: dict) -> dict:
    """Download single PDF + PyMuPDF extraction"""
    # ... download + extract
    return {"text": extracted_text, "metadata": meta}

@app.function(timeout=60)
def llm_triage(emails: list[dict]) -> list[dict]:
    """Batch triage via Groq"""
    # ... groq API call
    return triage_results

@app.function(timeout=60)
def llm_extract(documents: list[dict]) -> list[dict]:
    """LLM extraction via Groq/xAI — high concurrency"""
    return extracted_policies

@app.function(timeout=300)
def run_full_pipeline(user_id: str, vault_key: str):
    """Orchestrator — runs on Modal, fans out to sub-functions"""
    # Phase 1: Gmail (parallel)
    emails = gmail_search.remote(creds, queries)
    
    # Phase 2: Triage
    triaged = llm_triage.remote(new_emails)
    
    # Phase 3: PDF downloads (FAN OUT — the big win)
    pdf_results = list(download_and_extract_pdf.map(attachments))
    
    # Phase 4: LLM extraction
    policies = llm_extract.remote(pdf_results)
    
    # Phase 5: Save (could call back to your DB or use Modal's storage)
    save_results(policies)
    return {"status": "done", "count": len(policies)}
# In your FastAPI app
@app.post("/api/refresh-policies")
async def refresh_policies(user: User):
    db_setup(user)  # stays local
    
    # Offload entire pipeline to Modal
    modal_fn = modal.Function.lookup("policy-pipeline", "run_full_pipeline")
    call = await modal_fn.spawn.aio(user.id, user.vault_key)
    
    return {"job_id": call.object_id, "status": "processing"}

@app.get("/api/job/{job_id}")
async def get_job_status(job_id: str):
    call = modal.FunctionCall.from_id(job_id)
    try:
        result = await call.get.aio(timeout=0)
        return {"status": "complete", "result": result}
    except TimeoutError:
        return {"status": "processing"}

2. AWS Lambda + SQS

Pricing:

Resource Cost
Requests $0.20/million
Compute (x86) $0.0000166667/GB-second
SQS $0.40/million requests
Free tier 1M requests + 400K GB-sec/month

Cost estimate:

  • 100 runs × 45s × 1GB = 4,500 GB-sec/day → 135,000 GB-sec/month
  • After free tier: 135,000 - 400,000 = FREE (within free tier!)
  • Requests: 3,000/month = FREE
  • Total: $0/month for this workload level

Gotchas:

  • ⚠️ 15-minute max timeout — your PDF phase (up to 60s) fits, but entire pipeline (up to 100s) might not
  • ⚠️ Cold starts add 1-3s (Python runtime)
  • ⚠️ Deployment is heavier — ZIP packages, layers for PyMuPDF, IAM roles
  • ⚠️ Can't easily fan out sub-tasks from within Lambda without Step Functions
  • ⚠️ Starting Aug 2025: cold start init phase is now billed

When to choose: If you're already on AWS and want $0 cost at this scale.


3. Google Cloud Run Jobs

Pricing:

Resource Cost
CPU $0.000024/vCPU-sec (~$0.086/vCPU-hour)
Memory $0.0000025/GiB-sec
Free tier 240K vCPU-sec + 450K GiB-sec/month

Cost estimate:

  • 135,000 vCPU-seconds/month
  • After free tier: FREE (within free tier!)
  • Total: $0/month for this workload level

Gotchas:

  • More container-oriented (need Dockerfile)
  • No built-in fan-out like Modal's .map()
  • Good if you're already on GCP

4. Fly.io Machines

Pricing:

Resource Cost
CPU ~$0.0027/hour (shared 256MB)
Stops when idle Yes (scale to zero)
Storage $0.15/GB/month (even when stopped)

Cost estimate:

  • 100 runs × 45s = 4,500s/day = 1.25 hours/day
  • Monthly: 37.5 hours × $0.0027 = **$0.10/month**

When to choose: If you want more control (SSH into machines, persistent volumes) and don't mind managing containers.


5. Railway (Background Worker)

Pricing:

Resource Cost
CPU $0.00000772/vCPU-sec
Memory $0.00000386/GB-sec
Included credits $5/month (Hobby) or $20/month (Pro)

Cost estimate:

  • Similar to Modal pricing: ~$2-3/month (covered by plan credits)

When to choose: If you want a nice dashboard and simple deploys. But workers are always-on (no true scale-to-zero for background services).


Cost Comparison Summary

Platform Monthly Cost (100 runs/day) Scale to Zero Setup Effort Fan-out Support
Modal ~$2 (free tier covers) ✅ Yes 🟢 Easy (Python decorators) ✅ Built-in .map()
AWS Lambda $0 (free tier) ✅ Yes 🔴 Heavy (IAM, layers, packaging) ⚠️ Needs Step Functions
Cloud Run Jobs $0 (free tier) ✅ Yes 🟡 Medium (Dockerfile) ⚠️ Manual
Fly.io ~$0.10 ✅ Yes 🟡 Medium (Dockerfile) ⚠️ Manual
Railway ~$3 (credits cover) ❌ No 🟢 Easy ⚠️ Manual
Always-on server $5-25/month ❌ No Already done ❌ N/A

The Big Wins — What to Offload First

Priority 1: PDF Downloads (Step 3b) — 5-60s savings

This is where you get the most bang. Currently processing PDFs in batches of 5 sequentially.

With Modal fan-out:

@app.function(timeout=120)
def download_single_pdf(gmail_creds, attachment_id, message_id):
    # Download from Gmail API
    # Extract text with PyMuPDF
    return {"text": text, "pages": pages}

# In orchestrator:
# Instead of sequential batches of 5...
results = list(download_single_pdf.map(
    [(creds, att_id, msg_id) for att_id, msg_id in attachments]
))
# All PDFs download in parallel across N workers!

Impact: 60s → 5-10s (limited only by slowest single PDF)

Priority 2: Gmail Search (Step 1a) — 4-12s savings

13 sequential queries → 13 parallel queries.

@app.function()
def single_gmail_search(creds, query):
    return gmail_api_search(creds, query)

# Fan out all 13 queries
all_results = list(single_gmail_search.map(
    [(creds, q) for q in queries]
))

Impact: 12s → 1-2s

Priority 3: LLM Extraction (Step 3c) — 2-5s savings

Already 3x concurrent. Push to 8-10x.

@app.function(concurrency_limit=10)
def extract_single_policy(document_text, model="groq"):
    return llm_extract(document_text)

Impact: 5s → 1-2s

Total Pipeline Time After Offload

Phase Before After
DB Setup 200ms 200ms (stays)
Gmail Search 4-12s 1-2s
Metadata Fetch 3-8s 2-4s
Triage 2-4s 2-4s (API-bound)
PDF Download 5-60s 3-8s
LLM Extract 2-5s 1-2s
Dedup + Save 1-3s 1-3s
Total 30-60s 10-23s

Implementation Plan — Step by Step

Phase 1: Set up Modal (Day 1)

  1. pip install modal
  2. modal setup (one-time auth)
  3. Create modal_workers.py with your pipeline functions
  4. Test locally with modal run modal_workers.py

Phase 2: Offload PDF Downloads First (Day 1-2)

  • Extract fetch_document_text() into a Modal function
  • Use .map() for parallel downloads
  • Test with real Gmail data
  • This alone cuts 20-50s off your pipeline

Phase 3: Parallelize Gmail Search (Day 2)

  • Extract fetch_email_metadata() queries into parallel Modal calls
  • Merge results back

Phase 4: Wire Up the Orchestrator (Day 2-3)

  • Create run_full_pipeline Modal function
  • Your FastAPI endpoint becomes a thin dispatcher
  • Implement job status polling or SSE forwarding

Phase 5: Scale Down Your Server (Day 3+)

  • Your FastAPI app is now just an API gateway
  • Can move to a cheaper/smaller instance
  • Or go fully serverless (Modal can host FastAPI too!)

Queue-Based Alternative (If You Want More Control)

If you prefer a traditional queue pattern over Modal's function-calling model:

FastAPI → Redis (BullMQ/Celery) → Worker Process

Stack: FastAPI + Celery + Redis + Fly.io (worker)

# celery_tasks.py
from celery import Celery

app = Celery('pipeline', broker='redis://...')

@app.task(bind=True, max_retries=3)
def process_pipeline(self, user_id, vault_key):
    try:
        emails = fetch_emails(user_id)
        triaged = triage_emails(emails)
        pdfs = download_pdfs(triaged)  # still sequential here
        policies = extract_policies(pdfs)
        save_policies(policies)
        return {"status": "done"}
    except Exception as exc:
        self.retry(exc=exc, countdown=60)

Costs:

  • Redis (Upstash serverless): Free tier → $0/month
  • Celery worker on Fly.io: ~$0.10/month (scale to zero)
  • More operational overhead than Modal

Decision Matrix — Which Path to Take

Factor Modal Lambda+SQS Celery+Redis
Your stack (Python/FastAPI) ✅ Perfect fit ✅ Good ✅ Good
Fan-out parallelism .map() built-in ⚠️ Need Step Functions ⚠️ Need Celery groups
Scale to zero ⚠️ (worker needs to run)
Time to implement 1-2 days 3-5 days 2-3 days
Vendor lock-in Medium High (AWS) Low
Debugging Good (web dashboard) OK (CloudWatch) Good (Flower)
Cost at your scale ~$0-2/mo $0/mo ~$0-5/mo

My Recommendation

Go with Modal. Here's why:

  1. You're a Python shop — Modal is literally "add a decorator, deploy"
  2. Fan-out is your biggest win.map() for PDFs is the single highest-impact change
  3. $30 free credits/month covers your workload 10x over
  4. Zero infra to manage — no Docker, no IAM, no VPCs
  5. Migration is incremental — offload one function at a time, not a big-bang rewrite

The only reason to pick Lambda is if you're already deep in AWS and want the $0 free tier forever. But the implementation overhead is 3x more.


Scaling Projections

Users/Day Runs/Day Modal Cost Lambda Cost Always-On Server
10 100 $2/mo (free) $0/mo (free) $7-25/mo
50 500 $10/mo $2/mo $15-50/mo
200 2,000 $40/mo $8/mo $50-150/mo
1,000 10,000 $200/mo $40/mo $200-500/mo

At every scale, on-demand workers cost less than an always-on server that's idle 95% of the time.


Generated March 2026. Prices sourced from official pricing pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment