Context: A FastAPI app running a sequential email→triage→PDF→LLM→save pipeline (30-60s per run). Goal: cut always-on compute costs and scale individual bottlenecks independently.
User Request → FastAPI (always-on server)
→ DB Setup (50-200ms)
→ Gmail Search — 13 sequential queries (4-12s)
→ Metadata Fetch — batches of 25 (3-8s)
→ LLM Triage — batches of 30 via Groq (2-4s)
→ PDF Download + PyMuPDF extraction (5-60s) ← BIGGEST BOTTLENECK
→ LLM Extraction — 3 concurrent Groq/xAI (2-5s)
→ Dedup + Enrichment (0.1-0.5s)
→ Save to DB (encrypted) (0.5-2s)
→ SSE response
Problem: The entire pipeline runs inside your web server. Even when nobody's using it, the server is burning compute. When someone IS using it, 30-60s of blocking work ties up a server instance.
User Request → FastAPI (lightweight API, can be serverless)
→ DB Setup (stays here)
→ Enqueue job → Queue (Redis/SQS)
→ Return job_id immediately (SSE for progress)
Worker (on-demand, scales to zero) picks up job:
→ Phase 1: Gmail Search + Metadata (parallelized)
→ Phase 2: LLM Triage
→ Phase 3: PDF Download + Extraction (fan-out parallel)
→ Phase 4: LLM Extraction (high concurrency)
→ Phase 5: Dedup + Save
→ Push progress via SSE/webhook/polling
- ~100 pipeline runs/day (moderate usage)
- Average 45s per run
- 512MB-1GB memory needed
- CPU-only (no GPU required)
What it is: Serverless Python functions. Scale to zero. No containers to manage.
Pricing:
| Resource | Cost |
|---|---|
| CPU | $0.0000131/core/sec (~$0.047/core/hour) |
| Memory | $0.00000222/GiB/sec (~$0.008/GiB/hour) |
| Free tier | $30/month credits (Starter plan) |
Cost estimate for your workload:
- 100 runs × 45s × 1 core + 1 GiB = ~4,500 core-seconds/day
- Monthly: ~135,000 core-seconds
- CPU: 135,000 × $0.0000131 = $1.77/month
- Memory: 135,000 × $0.00000222 = $0.30/month
- Total: ~$2.07/month (covered by free tier!)
Why it's perfect for you:
- Pure Python — decorate functions with
@app.function(), done - Native FastAPI integration with
Function.lookup()+.spawn() - Built-in job queue semantics (spawn, poll, get result)
- Scales to zero when nobody's running pipelines
- Can fan out PDF downloads to parallel workers trivially
- PyMuPDF, Groq SDK, etc. just work via
Image.pip_install()
Architecture with Modal:
# modal_workers.py
import modal
image = modal.Image.debian_slim().pip_install(
"google-api-python-client", "pymupdf", "groq", "cryptography"
)
app = modal.App("policy-pipeline", image=image)
@app.function(timeout=120)
def gmail_search(user_creds: dict, queries: list[str]) -> list:
"""Run all 13 Gmail queries in parallel"""
import asyncio
# ... parallel search logic
return results
@app.function(timeout=300, concurrency_limit=10)
def download_and_extract_pdf(attachment_data: dict) -> dict:
"""Download single PDF + PyMuPDF extraction"""
# ... download + extract
return {"text": extracted_text, "metadata": meta}
@app.function(timeout=60)
def llm_triage(emails: list[dict]) -> list[dict]:
"""Batch triage via Groq"""
# ... groq API call
return triage_results
@app.function(timeout=60)
def llm_extract(documents: list[dict]) -> list[dict]:
"""LLM extraction via Groq/xAI — high concurrency"""
return extracted_policies
@app.function(timeout=300)
def run_full_pipeline(user_id: str, vault_key: str):
"""Orchestrator — runs on Modal, fans out to sub-functions"""
# Phase 1: Gmail (parallel)
emails = gmail_search.remote(creds, queries)
# Phase 2: Triage
triaged = llm_triage.remote(new_emails)
# Phase 3: PDF downloads (FAN OUT — the big win)
pdf_results = list(download_and_extract_pdf.map(attachments))
# Phase 4: LLM extraction
policies = llm_extract.remote(pdf_results)
# Phase 5: Save (could call back to your DB or use Modal's storage)
save_results(policies)
return {"status": "done", "count": len(policies)}# In your FastAPI app
@app.post("/api/refresh-policies")
async def refresh_policies(user: User):
db_setup(user) # stays local
# Offload entire pipeline to Modal
modal_fn = modal.Function.lookup("policy-pipeline", "run_full_pipeline")
call = await modal_fn.spawn.aio(user.id, user.vault_key)
return {"job_id": call.object_id, "status": "processing"}
@app.get("/api/job/{job_id}")
async def get_job_status(job_id: str):
call = modal.FunctionCall.from_id(job_id)
try:
result = await call.get.aio(timeout=0)
return {"status": "complete", "result": result}
except TimeoutError:
return {"status": "processing"}Pricing:
| Resource | Cost |
|---|---|
| Requests | $0.20/million |
| Compute (x86) | $0.0000166667/GB-second |
| SQS | $0.40/million requests |
| Free tier | 1M requests + 400K GB-sec/month |
Cost estimate:
- 100 runs × 45s × 1GB = 4,500 GB-sec/day → 135,000 GB-sec/month
- After free tier: 135,000 - 400,000 = FREE (within free tier!)
- Requests: 3,000/month = FREE
- Total: $0/month for this workload level
Gotchas:
⚠️ 15-minute max timeout — your PDF phase (up to 60s) fits, but entire pipeline (up to 100s) might not⚠️ Cold starts add 1-3s (Python runtime)⚠️ Deployment is heavier — ZIP packages, layers for PyMuPDF, IAM roles⚠️ Can't easily fan out sub-tasks from within Lambda without Step Functions⚠️ Starting Aug 2025: cold start init phase is now billed
When to choose: If you're already on AWS and want $0 cost at this scale.
Pricing:
| Resource | Cost |
|---|---|
| CPU | $0.000024/vCPU-sec (~$0.086/vCPU-hour) |
| Memory | $0.0000025/GiB-sec |
| Free tier | 240K vCPU-sec + 450K GiB-sec/month |
Cost estimate:
- 135,000 vCPU-seconds/month
- After free tier: FREE (within free tier!)
- Total: $0/month for this workload level
Gotchas:
- More container-oriented (need Dockerfile)
- No built-in fan-out like Modal's
.map() - Good if you're already on GCP
Pricing:
| Resource | Cost |
|---|---|
| CPU | ~$0.0027/hour (shared 256MB) |
| Stops when idle | Yes (scale to zero) |
| Storage | $0.15/GB/month (even when stopped) |
Cost estimate:
- 100 runs × 45s = 4,500s/day = 1.25 hours/day
- Monthly:
37.5 hours × $0.0027 = **$0.10/month**
When to choose: If you want more control (SSH into machines, persistent volumes) and don't mind managing containers.
Pricing:
| Resource | Cost |
|---|---|
| CPU | $0.00000772/vCPU-sec |
| Memory | $0.00000386/GB-sec |
| Included credits | $5/month (Hobby) or $20/month (Pro) |
Cost estimate:
- Similar to Modal pricing: ~$2-3/month (covered by plan credits)
When to choose: If you want a nice dashboard and simple deploys. But workers are always-on (no true scale-to-zero for background services).
| Platform | Monthly Cost (100 runs/day) | Scale to Zero | Setup Effort | Fan-out Support |
|---|---|---|---|---|
| Modal | ~$2 (free tier covers) | ✅ Yes | 🟢 Easy (Python decorators) | ✅ Built-in .map() |
| AWS Lambda | $0 (free tier) | ✅ Yes | 🔴 Heavy (IAM, layers, packaging) | |
| Cloud Run Jobs | $0 (free tier) | ✅ Yes | 🟡 Medium (Dockerfile) | |
| Fly.io | ~$0.10 | ✅ Yes | 🟡 Medium (Dockerfile) | |
| Railway | ~$3 (credits cover) | ❌ No | 🟢 Easy | |
| Always-on server | $5-25/month | ❌ No | Already done | ❌ N/A |
This is where you get the most bang. Currently processing PDFs in batches of 5 sequentially.
With Modal fan-out:
@app.function(timeout=120)
def download_single_pdf(gmail_creds, attachment_id, message_id):
# Download from Gmail API
# Extract text with PyMuPDF
return {"text": text, "pages": pages}
# In orchestrator:
# Instead of sequential batches of 5...
results = list(download_single_pdf.map(
[(creds, att_id, msg_id) for att_id, msg_id in attachments]
))
# All PDFs download in parallel across N workers!Impact: 60s → 5-10s (limited only by slowest single PDF)
13 sequential queries → 13 parallel queries.
@app.function()
def single_gmail_search(creds, query):
return gmail_api_search(creds, query)
# Fan out all 13 queries
all_results = list(single_gmail_search.map(
[(creds, q) for q in queries]
))Impact: 12s → 1-2s
Already 3x concurrent. Push to 8-10x.
@app.function(concurrency_limit=10)
def extract_single_policy(document_text, model="groq"):
return llm_extract(document_text)Impact: 5s → 1-2s
| Phase | Before | After |
|---|---|---|
| DB Setup | 200ms | 200ms (stays) |
| Gmail Search | 4-12s | 1-2s |
| Metadata Fetch | 3-8s | 2-4s |
| Triage | 2-4s | 2-4s (API-bound) |
| PDF Download | 5-60s | 3-8s |
| LLM Extract | 2-5s | 1-2s |
| Dedup + Save | 1-3s | 1-3s |
| Total | 30-60s | 10-23s |
pip install modalmodal setup(one-time auth)- Create
modal_workers.pywith your pipeline functions - Test locally with
modal run modal_workers.py
- Extract
fetch_document_text()into a Modal function - Use
.map()for parallel downloads - Test with real Gmail data
- This alone cuts 20-50s off your pipeline
- Extract
fetch_email_metadata()queries into parallel Modal calls - Merge results back
- Create
run_full_pipelineModal function - Your FastAPI endpoint becomes a thin dispatcher
- Implement job status polling or SSE forwarding
- Your FastAPI app is now just an API gateway
- Can move to a cheaper/smaller instance
- Or go fully serverless (Modal can host FastAPI too!)
If you prefer a traditional queue pattern over Modal's function-calling model:
FastAPI → Redis (BullMQ/Celery) → Worker Process
Stack: FastAPI + Celery + Redis + Fly.io (worker)
# celery_tasks.py
from celery import Celery
app = Celery('pipeline', broker='redis://...')
@app.task(bind=True, max_retries=3)
def process_pipeline(self, user_id, vault_key):
try:
emails = fetch_emails(user_id)
triaged = triage_emails(emails)
pdfs = download_pdfs(triaged) # still sequential here
policies = extract_policies(pdfs)
save_policies(policies)
return {"status": "done"}
except Exception as exc:
self.retry(exc=exc, countdown=60)Costs:
- Redis (Upstash serverless): Free tier → $0/month
- Celery worker on Fly.io: ~$0.10/month (scale to zero)
- More operational overhead than Modal
| Factor | Modal | Lambda+SQS | Celery+Redis |
|---|---|---|---|
| Your stack (Python/FastAPI) | ✅ Perfect fit | ✅ Good | ✅ Good |
| Fan-out parallelism | ✅ .map() built-in |
||
| Scale to zero | ✅ | ✅ | |
| Time to implement | 1-2 days | 3-5 days | 2-3 days |
| Vendor lock-in | Medium | High (AWS) | Low |
| Debugging | Good (web dashboard) | OK (CloudWatch) | Good (Flower) |
| Cost at your scale | ~$0-2/mo | $0/mo | ~$0-5/mo |
Go with Modal. Here's why:
- You're a Python shop — Modal is literally "add a decorator, deploy"
- Fan-out is your biggest win —
.map()for PDFs is the single highest-impact change - $30 free credits/month covers your workload 10x over
- Zero infra to manage — no Docker, no IAM, no VPCs
- Migration is incremental — offload one function at a time, not a big-bang rewrite
The only reason to pick Lambda is if you're already deep in AWS and want the $0 free tier forever. But the implementation overhead is 3x more.
| Users/Day | Runs/Day | Modal Cost | Lambda Cost | Always-On Server |
|---|---|---|---|---|
| 10 | 100 | $2/mo (free) | $0/mo (free) | $7-25/mo |
| 50 | 500 | $10/mo | $2/mo | $15-50/mo |
| 200 | 2,000 | $40/mo | $8/mo | $50-150/mo |
| 1,000 | 10,000 | $200/mo | $40/mo | $200-500/mo |
At every scale, on-demand workers cost less than an always-on server that's idle 95% of the time.
Generated March 2026. Prices sourced from official pricing pages.