Context: A pipeline that fetches emails, triages them via LLM, downloads PDF attachments, extracts insurance policy details via LLM, and saves results. Currently spending ~$0.038/run on Groq Llama 4 Scout.
Per Run (1 user):
Triage (15 batches): 95,630 input + 3,589 output = $0.0117
Extract (39 calls): 193,000 input + 14,000 output = $0.0260
─────────────────────────────────────────────────────────────────
TOTAL LLM: 288,630 input + 17,589 output = $0.0377
Model: Groq Llama 4 Scout — $0.11/M input, $0.34/M output
Modal compute: $0.0018
Everything else: free
TOTAL: ~$0.04/run
Key insight: Extraction is 69% of LLM cost. That's the main target.
Your workload per run: ~289K input tokens, ~18K output tokens.
| Provider / Model | Input $/M | Output $/M | Cost/Run | vs Current | Speed | Notes |
|---|---|---|---|---|---|---|
| Groq Llama 4 Scout (current) | $0.11 | $0.34 | $0.0377 | baseline | ⚡ Very fast | — |
| Groq Llama 3.1 8B | $0.05 | $0.08 | $0.0159 | -58% | ⚡ Very fast | Smaller model, test quality |
| Groq Llama 4 Scout + caching | $0.055 | $0.34 | $0.0220 | -42% | ⚡ Very fast | Auto 50% off cached input |
| Groq Batch API (Scout) | $0.055 | $0.17 | $0.0189 | -50% | 🐌 24h delay | Non-realtime only |
| Groq Llama 3.1 8B + Batch | $0.025 | $0.04 | $0.0080 | -79% | 🐌 24h delay | Cheapest Groq option |
| Gemini 2.0 Flash | $0.10 | $0.40 | $0.0361 | -4% | 🚀 Fast | 250 free req/day! |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | $0.0361 | -4% | 🚀 Fast | Newer, replaces 2.0 Flash Lite |
| Gemini 2.0 Flash (free tier) | $0.00 | $0.00 | $0.0000 | -100% | 🚀 Fast | 250 req/day limit |
| DeepSeek V3.2 (cache miss) | $0.28 | $0.42 | $0.0886 | +135% | 🐢 Slow | MORE expensive raw |
| DeepSeek V3.2 (cache hit) | $0.028 | $0.42 | $0.0157 | -58% | 🐢 Slow | 90% off repeated prefixes |
| Together Llama 3.1 8B Turbo | $0.18 | $0.18 | $0.0553 | +47% | 🏃 Fast | More expensive |
| Cerebras Llama 3.1 8B | $0.10 (blended) | — | $0.0307 | -19% | ⚡ Very fast | 1M free tokens/day |
| Fireworks (entry tier) | $0.10 (blended) | — | $0.0307 | -19% | 🏃 Fast | 50% off batch + cached |
Use different models for different phases based on task complexity:
TRIAGE (classification — easy task)
→ Groq Llama 3.1 8B ($0.05/$0.08 per M)
→ 95,630 × $0.05/M + 3,589 × $0.08/M = $0.0051
EXTRACTION (structured data from PDFs — harder task)
→ Gemini 2.0 Flash (free tier: 250 requests/day)
→ 39 calls per run = up to 6 free runs/day
→ Cost: $0.00
TOTAL: $0.0051/run (was $0.0377) → 86% SAVINGS
Triage is classification — "Is this email about insurance?" An 8B model handles yes/no classification just as well as a 109B MoE model. You're massively overspending here.
Extraction benefits from Gemini Flash because:
- Native PDF understanding (can process PDFs directly, no PyMuPDF needed)
- 1M token context window (send entire documents)
- Built-in structured output / JSON mode
- 98-99% accuracy reported for receipt/document extraction
- Free tier gives you 250 requests/day = enough for 6 pipeline runs
| Daily Runs | Extraction Calls | Gemini Free Covers? | Paid Overflow Cost |
|---|---|---|---|
| 1-6 | 39-234 | ✅ Yes (250/day) | $0.00 |
| 7-12 | 273-468 | ~$0.01-0.02/day | |
| 13+ | 507+ | ❌ No | Use paid @ $0.10/$0.40 per M |
When you outgrow free tier, Gemini paid is roughly the same cost as Groq Scout anyway, so you lose nothing.
These optimizations stack with model switching. Apply them regardless of which model you use.
Groq auto-caches repeated prefixes at 50% discount. Your 39 extraction calls likely share:
- Same system prompt
- Same JSON schema definition
- Same few-shot examples
If your system prompt is ~2,000 tokens and shared across 39 calls:
- Currently: 39 × 2,000 = 78,000 tokens at full price
- With caching: 1 × 2,000 full + 38 × 2,000 at 50% = 40,000 effective tokens
- Saves ~$0.002/run for free (just by call ordering)
How to maximize cache hits:
# BAD — different prefix each call
messages = [
{"role": "system", "content": f"Extract from: {doc_name}..."}, # dynamic = cache miss
{"role": "user", "content": pdf_text}
]
# GOOD — static prefix, dynamic content at the end
messages = [
{"role": "system", "content": "You extract insurance policy details..."}, # static = cached
{"role": "user", "content": f"Document: {doc_name}\n\n{pdf_text}"} # dynamic at end
]Insurance PDFs are full of boilerplate. Pre-process to strip noise:
import re
def compress_pdf_text(raw_text: str) -> str:
"""Strip boilerplate from insurance PDF text before LLM extraction."""
# Remove excessive whitespace
text = re.sub(r'\n{3,}', '\n\n', raw_text)
text = re.sub(r' {2,}', ' ', text)
# Remove page headers/footers (usually repeated)
lines = text.split('\n')
# Count line frequency — repeated lines are headers/footers
from collections import Counter
line_counts = Counter(line.strip() for line in lines if line.strip())
repeated = {line for line, count in line_counts.items() if count > 2}
lines = [l for l in lines if l.strip() not in repeated]
# Remove common insurance PDF noise
noise_patterns = [
r'Page \d+ of \d+',
r'CIN:?\s*[A-Z0-9]+',
r'IRDAI\s*Reg\.?\s*No\.?.*',
r'Toll\s*Free.*\d{4,}',
r'www\.\S+',
r'^\s*\d+\s*$', # lone page numbers
]
for pattern in noise_patterns:
text = re.sub(pattern, '', '\n'.join(lines), flags=re.MULTILINE | re.IGNORECASE)
return text.strip()Impact: Insurance PDFs typically have 30-50% boilerplate. This alone can cut your 193K extraction input to ~100-130K.
If you're making 39 separate LLM calls for 39 documents, consider:
Option A: Batch multiple small documents per call
# Instead of 39 calls with 1 doc each (~5K tokens per doc)
# Make 8 calls with 5 docs each
prompt = """Extract insurance details from each document below.
Return a JSON array with one object per document.
---DOCUMENT 1---
{doc1_text}
---DOCUMENT 2---
{doc2_text}
...
"""Impact: Reduces overhead of repeated system prompts. 39 calls → 8 calls = ~80% fewer system prompt tokens.
Option B: Two-stage extraction
# Stage 1: Quick scan — extract only key identifiers (cheap, fast)
quick_prompt = "Extract ONLY: policy_number, insurer_name, type from this text. JSON only."
# ~50 output tokens per doc vs ~360 currently
# Stage 2: Full extraction — only for documents that need it
# Skip already-cached policies, only process genuinely new onesBoth Groq and Gemini support constrained JSON output:
# Groq with JSON mode
response = client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=messages,
response_format={"type": "json_object"}, # forces valid JSON
max_tokens=500, # cap output — insurance details don't need 1000+ tokens
)
# Gemini with response schema
import google.generativeai as genai
model = genai.GenerativeModel('gemini-2.0-flash')
response = model.generate_content(
prompt,
generation_config=genai.GenerationConfig(
response_mime_type="application/json",
response_schema=PolicySchema, # enforces exact fields
)
)Impact: Constrained output = fewer wasted output tokens. Setting max_tokens=500 for extraction (vs unlimited) prevents verbose responses.
INSURANCE_KEYWORDS = {
'policy', 'premium', 'sum assured', 'sum insured', 'nominee',
'insured', 'coverage', 'claim', 'rider', 'endorsement',
'maturity', 'surrender', 'annuity', 'deductible'
}
def is_likely_insurance(text: str, threshold: int = 3) -> bool:
"""Quick check before sending to LLM."""
text_lower = text.lower()
matches = sum(1 for kw in INSURANCE_KEYWORDS if kw in text_lower)
return matches >= threshold
# In pipeline:
for doc in documents:
if not is_likely_insurance(doc.text):
continue # skip — don't waste LLM tokens
policy = await extract_policy(doc.text)Impact: If even 10-20% of your 39 documents are non-insurance (bank statements, random PDFs), this saves those LLM calls entirely.
For common Indian insurance providers (LIC, HDFC Life, ICICI Pru, SBI Life, Max Life), policy documents follow predictable templates. You could extract key fields with regex/layout rules for known formats and only fall back to LLM for unknown formats:
KNOWN_EXTRACTORS = {
'LIC': extract_lic_policy, # regex-based
'HDFC Life': extract_hdfc_policy,
'ICICI Prudential': extract_icici_policy,
}
def extract_policy(text: str) -> dict:
# Try to identify insurer
for insurer, extractor in KNOWN_EXTRACTORS.items():
if insurer.lower() in text.lower():
result = extractor(text)
if result and result.get('policy_number'):
return result # no LLM needed!
# Fallback to LLM for unknown formats
return await llm_extract(text)Impact: If 50% of policies are from known insurers, you cut LLM extraction calls in half. This is more work to build but costs literally $0 to run.
| Strategy | Triage Cost | Extraction Cost | Total/Run | Monthly (100 runs/day) |
|---|---|---|---|---|
| Current (Groq Scout) | $0.0117 | $0.0260 | $0.0377 | $113.10 |
| Model switch only (8B + Gemini free) | $0.0051 | $0.0000 | $0.0051 | $15.30 |
| + Prompt compression (30% less input) | $0.0036 | $0.0000 | $0.0036 | $10.80 |
| + Batched extraction (8 calls vs 39) | $0.0036 | $0.0000 | $0.0036 | $10.80 |
| + Groq caching (auto 50% off) | $0.0028 | $0.0000 | $0.0028 | $8.40 |
| + Regex pre-filter (20% skip) | $0.0023 | $0.0000 | $0.0023 | $6.90 |
| Full optimization (all above) | $0.0020 | $0.0000 | $0.0020 | $6.00 |
| Daily Users | Runs/Day | Current Cost/Mo | Optimized Cost/Mo | Savings |
|---|---|---|---|---|
| 1-6 | 1-6 | $1-7 | $0.01-0.03 | ~99% |
| 10 | 10 | $11 | $0.20 | 98% |
| 50 | 50 | $57 | $5.10 | 91% |
| 100 | 100 | $113 | $12.50 | 89% |
| 500 | 500 | $566 | $68 | 88% |
At 500 runs/day you'd be making ~19,500 Gemini extraction calls/day (way past free tier), so extraction runs at $0.10/$0.40 per M paid. Triage stays on Groq 8B. Still massive savings.
| # | Change | Effort | Savings | Do It? |
|---|---|---|---|---|
| 1 | Switch triage to Groq 8B | 5 min (change model name) | 58% on triage | ✅ NOW |
| 2 | Switch extraction to Gemini Flash | 2-3 hours (new client) | 100% on extraction (free tier) | ✅ NOW |
| 3 | Ensure Groq prompt caching works | 30 min (reorder prompts) | 10-15% on triage | ✅ Easy win |
| 4 | Add PDF text compression | 1-2 hours | 30-40% fewer input tokens | ✅ Good ROI |
| 5 | Set max_tokens on all calls | 5 min | Caps output waste | ✅ NOW |
| 6 | Batch extraction calls | 2-3 hours | Fewer calls, less overhead | 🟡 Medium |
| 7 | Regex pre-filter | 1 hour | Skip non-insurance docs | 🟡 Medium |
| 8 | Rule-based extractors for known insurers | 1-2 days | Eliminate LLM for 50%+ docs | 🔴 Later |
# 1. Switch triage model (5 seconds)
TRIAGE_MODEL = "llama-3.1-8b-instant" # was "llama-4-scout-17b-16e-instruct"
# 2. Set max_tokens on extraction
response = groq_client.chat.completions.create(
model=EXTRACTION_MODEL,
messages=messages,
response_format={"type": "json_object"},
max_tokens=800, # insurance policy JSON doesn't need more
)
# 3. Add Gemini Flash for extraction (swap out Groq for extraction calls)
import google.generativeai as genai
genai.configure(api_key="YOUR_GEMINI_API_KEY") # free key from ai.google.dev
gemini_model = genai.GenerativeModel('gemini-2.0-flash')
async def extract_policy_gemini(pdf_text: str) -> dict:
response = gemini_model.generate_content(
f"Extract insurance policy details as JSON:\n\n{pdf_text}",
generation_config=genai.GenerationConfig(
response_mime_type="application/json",
max_output_tokens=800,
)
)
return json.loads(response.text)That's it. Three changes. $0.038 → $0.005/run.
Research conducted March 2026. All prices from official provider pricing pages. Groq: groq.com/pricing | Gemini: ai.google.dev/pricing | DeepSeek: platform.deepseek.com