Slashing LLM Token Costs for Insurance PDF Extraction

Context: A pipeline that fetches emails, triages them via LLM, downloads PDF attachments, extracts insurance policy details via LLM, and saves results. Currently spending ~$0.038/run on Groq Llama 4 Scout.

Current State — Where the Money Goes

Per Run (1 user):
  Triage  (15 batches):  95,630 input + 3,589 output  = $0.0117
  Extract (39 calls):   193,000 input + 14,000 output  = $0.0260
  ─────────────────────────────────────────────────────────────────
  TOTAL LLM:            288,630 input + 17,589 output   = $0.0377

  Model: Groq Llama 4 Scout — $0.11/M input, $0.34/M output
  Modal compute: $0.0018
  Everything else: free
  TOTAL: ~$0.04/run

Key insight: Extraction is 69% of LLM cost. That's the main target.

Part 1: Cheaper Model Alternatives (Drop-In Replacements)

Full Provider Comparison for Your Workload

Your workload per run: ~289K input tokens, ~18K output tokens.

Provider / Model	Input $/M	Output $/M	Cost/Run	vs Current	Speed	Notes
Groq Llama 4 Scout (current)	$0.11	$0.34	$0.0377	baseline	⚡ Very fast	—
Groq Llama 3.1 8B	$0.05	$0.08	$0.0159	-58%	⚡ Very fast	Smaller model, test quality
Groq Llama 4 Scout + caching	$0.055	$0.34	$0.0220	-42%	⚡ Very fast	Auto 50% off cached input
Groq Batch API (Scout)	$0.055	$0.17	$0.0189	-50%	🐌 24h delay	Non-realtime only
Groq Llama 3.1 8B + Batch	$0.025	$0.04	$0.0080	-79%	🐌 24h delay	Cheapest Groq option
Gemini 2.0 Flash	$0.10	$0.40	$0.0361	-4%	🚀 Fast	250 free req/day!
Gemini 2.5 Flash-Lite	$0.10	$0.40	$0.0361	-4%	🚀 Fast	Newer, replaces 2.0 Flash Lite
Gemini 2.0 Flash (free tier)	$0.00	$0.00	$0.0000	-100%	🚀 Fast	250 req/day limit
DeepSeek V3.2 (cache miss)	$0.28	$0.42	$0.0886	+135%	🐢 Slow	MORE expensive raw
DeepSeek V3.2 (cache hit)	$0.028	$0.42	$0.0157	-58%	🐢 Slow	90% off repeated prefixes
Together Llama 3.1 8B Turbo	$0.18	$0.18	$0.0553	+47%	🏃 Fast	More expensive
Cerebras Llama 3.1 8B	$0.10 (blended)	—	$0.0307	-19%	⚡ Very fast	1M free tokens/day
Fireworks (entry tier)	$0.10 (blended)	—	$0.0307	-19%	🏃 Fast	50% off batch + cached

Part 2: The Free/Near-Free Strategy

🏆 Best Approach: Hybrid Model Routing

Use different models for different phases based on task complexity:

TRIAGE (classification — easy task)
  → Groq Llama 3.1 8B ($0.05/$0.08 per M)
  → 95,630 × $0.05/M + 3,589 × $0.08/M = $0.0051

EXTRACTION (structured data from PDFs — harder task)
  → Gemini 2.0 Flash (free tier: 250 requests/day)
  → 39 calls per run = up to 6 free runs/day
  → Cost: $0.00

TOTAL: $0.0051/run (was $0.0377) → 86% SAVINGS

Why This Works

Triage is classification — "Is this email about insurance?" An 8B model handles yes/no classification just as well as a 109B MoE model. You're massively overspending here.

Extraction benefits from Gemini Flash because:

Native PDF understanding (can process PDFs directly, no PyMuPDF needed)
1M token context window (send entire documents)
Built-in structured output / JSON mode
98-99% accuracy reported for receipt/document extraction
Free tier gives you 250 requests/day = enough for 6 pipeline runs

Scaling Beyond Free Tier

Daily Runs	Extraction Calls	Gemini Free Covers?	Paid Overflow Cost
1-6	39-234	✅ Yes (250/day)	$0.00
7-12	273-468	⚠️ Partial	~$0.01-0.02/day
13+	507+	❌ No	Use paid @ $0.10/$0.40 per M

When you outgrow free tier, Gemini paid is roughly the same cost as Groq Scout anyway, so you lose nothing.

Part 3: Reduce Token Count (Works with ANY Provider)

These optimizations stack with model switching. Apply them regardless of which model you use.

3a. Prompt Caching on Groq (FREE, automatic)

Groq auto-caches repeated prefixes at 50% discount. Your 39 extraction calls likely share:

Same system prompt
Same JSON schema definition
Same few-shot examples

If your system prompt is ~2,000 tokens and shared across 39 calls:

Currently: 39 × 2,000 = 78,000 tokens at full price
With caching: 1 × 2,000 full + 38 × 2,000 at 50% = 40,000 effective tokens
Saves ~$0.002/run for free (just by call ordering)

How to maximize cache hits:

# BAD — different prefix each call
messages = [
    {"role": "system", "content": f"Extract from: {doc_name}..."},  # dynamic = cache miss
    {"role": "user", "content": pdf_text}
]

# GOOD — static prefix, dynamic content at the end
messages = [
    {"role": "system", "content": "You extract insurance policy details..."},  # static = cached
    {"role": "user", "content": f"Document: {doc_name}\n\n{pdf_text}"}  # dynamic at end
]

3b. Compress PDF Text Before Sending (40-60% token reduction)

Insurance PDFs are full of boilerplate. Pre-process to strip noise:

import re

def compress_pdf_text(raw_text: str) -> str:
    """Strip boilerplate from insurance PDF text before LLM extraction."""
    
    # Remove excessive whitespace
    text = re.sub(r'\n{3,}', '\n\n', raw_text)
    text = re.sub(r' {2,}', ' ', text)
    
    # Remove page headers/footers (usually repeated)
    lines = text.split('\n')
    # Count line frequency — repeated lines are headers/footers
    from collections import Counter
    line_counts = Counter(line.strip() for line in lines if line.strip())
    repeated = {line for line, count in line_counts.items() if count > 2}
    lines = [l for l in lines if l.strip() not in repeated]
    
    # Remove common insurance PDF noise
    noise_patterns = [
        r'Page \d+ of \d+',
        r'CIN:?\s*[A-Z0-9]+',
        r'IRDAI\s*Reg\.?\s*No\.?.*',
        r'Toll\s*Free.*\d{4,}',
        r'www\.\S+',
        r'^\s*\d+\s*$',  # lone page numbers
    ]
    for pattern in noise_patterns:
        text = re.sub(pattern, '', '\n'.join(lines), flags=re.MULTILINE | re.IGNORECASE)
    
    return text.strip()

Impact: Insurance PDFs typically have 30-50% boilerplate. This alone can cut your 193K extraction input to ~100-130K.

3c. Single-Pass Extraction (Reduce Call Count)

If you're making 39 separate LLM calls for 39 documents, consider:

Option A: Batch multiple small documents per call

# Instead of 39 calls with 1 doc each (~5K tokens per doc)
# Make 8 calls with 5 docs each

prompt = """Extract insurance details from each document below.
Return a JSON array with one object per document.

---DOCUMENT 1---
{doc1_text}

---DOCUMENT 2---  
{doc2_text}
...
"""

Impact: Reduces overhead of repeated system prompts. 39 calls → 8 calls = ~80% fewer system prompt tokens.

Option B: Two-stage extraction

# Stage 1: Quick scan — extract only key identifiers (cheap, fast)
quick_prompt = "Extract ONLY: policy_number, insurer_name, type from this text. JSON only."
# ~50 output tokens per doc vs ~360 currently

# Stage 2: Full extraction — only for documents that need it
# Skip already-cached policies, only process genuinely new ones

3d. Use Structured Output / JSON Mode

Both Groq and Gemini support constrained JSON output:

# Groq with JSON mode
response = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=messages,
    response_format={"type": "json_object"},  # forces valid JSON
    max_tokens=500,  # cap output — insurance details don't need 1000+ tokens
)

# Gemini with response schema
import google.generativeai as genai
model = genai.GenerativeModel('gemini-2.0-flash')
response = model.generate_content(
    prompt,
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json",
        response_schema=PolicySchema,  # enforces exact fields
    )
)

Impact: Constrained output = fewer wasted output tokens. Setting max_tokens=500 for extraction (vs unlimited) prevents verbose responses.

3e. Pre-Filter with Regex Before LLM (Skip Obviously Non-Insurance PDFs)

INSURANCE_KEYWORDS = {
    'policy', 'premium', 'sum assured', 'sum insured', 'nominee',
    'insured', 'coverage', 'claim', 'rider', 'endorsement',
    'maturity', 'surrender', 'annuity', 'deductible'
}

def is_likely_insurance(text: str, threshold: int = 3) -> bool:
    """Quick check before sending to LLM."""
    text_lower = text.lower()
    matches = sum(1 for kw in INSURANCE_KEYWORDS if kw in text_lower)
    return matches >= threshold

# In pipeline:
for doc in documents:
    if not is_likely_insurance(doc.text):
        continue  # skip — don't waste LLM tokens
    policy = await extract_policy(doc.text)

Impact: If even 10-20% of your 39 documents are non-insurance (bank statements, random PDFs), this saves those LLM calls entirely.

Part 4: The Nuclear Option — Partial Rule-Based Extraction

For common Indian insurance providers (LIC, HDFC Life, ICICI Pru, SBI Life, Max Life), policy documents follow predictable templates. You could extract key fields with regex/layout rules for known formats and only fall back to LLM for unknown formats:

KNOWN_EXTRACTORS = {
    'LIC': extract_lic_policy,      # regex-based
    'HDFC Life': extract_hdfc_policy,
    'ICICI Prudential': extract_icici_policy,
}

def extract_policy(text: str) -> dict:
    # Try to identify insurer
    for insurer, extractor in KNOWN_EXTRACTORS.items():
        if insurer.lower() in text.lower():
            result = extractor(text)
            if result and result.get('policy_number'):
                return result  # no LLM needed!
    
    # Fallback to LLM for unknown formats
    return await llm_extract(text)

Impact: If 50% of policies are from known insurers, you cut LLM extraction calls in half. This is more work to build but costs literally $0 to run.

Part 5: Cost Projection at Scale

Per-Run Costs by Strategy

Strategy	Triage Cost	Extraction Cost	Total/Run	Monthly (100 runs/day)
Current (Groq Scout)	$0.0117	$0.0260	$0.0377	$113.10
Model switch only (8B + Gemini free)	$0.0051	$0.0000	$0.0051	$15.30
+ Prompt compression (30% less input)	$0.0036	$0.0000	$0.0036	$10.80
+ Batched extraction (8 calls vs 39)	$0.0036	$0.0000	$0.0036	$10.80
+ Groq caching (auto 50% off)	$0.0028	$0.0000	$0.0028	$8.40
+ Regex pre-filter (20% skip)	$0.0023	$0.0000	$0.0023	$6.90
Full optimization (all above)	$0.0020	$0.0000	$0.0020	$6.00

At Higher Scale (When Gemini Free Tier Runs Out)

Daily Users	Runs/Day	Current Cost/Mo	Optimized Cost/Mo	Savings
1-6	1-6	$1-7	$0.01-0.03	~99%
10	10	$11	$0.20	98%
50	50	$57	$5.10	91%
100	100	$113	$12.50	89%
500	500	$566	$68	88%

At 500 runs/day you'd be making ~19,500 Gemini extraction calls/day (way past free tier), so extraction runs at $0.10/$0.40 per M paid. Triage stays on Groq 8B. Still massive savings.

Implementation Priority (Bang for Buck)

#	Change	Effort	Savings	Do It?
1	Switch triage to Groq 8B	5 min (change model name)	58% on triage	✅ NOW
2	Switch extraction to Gemini Flash	2-3 hours (new client)	100% on extraction (free tier)	✅ NOW
3	Ensure Groq prompt caching works	30 min (reorder prompts)	10-15% on triage	✅ Easy win
4	Add PDF text compression	1-2 hours	30-40% fewer input tokens	✅ Good ROI
5	Set max_tokens on all calls	5 min	Caps output waste	✅ NOW
6	Batch extraction calls	2-3 hours	Fewer calls, less overhead	🟡 Medium
7	Regex pre-filter	1 hour	Skip non-insurance docs	🟡 Medium
8	Rule-based extractors for known insurers	1-2 days	Eliminate LLM for 50%+ docs	🔴 Later

Quick Start — Minimum Viable Changes (30 Minutes)

# 1. Switch triage model (5 seconds)
TRIAGE_MODEL = "llama-3.1-8b-instant"  # was "llama-4-scout-17b-16e-instruct"

# 2. Set max_tokens on extraction
response = groq_client.chat.completions.create(
    model=EXTRACTION_MODEL,
    messages=messages,
    response_format={"type": "json_object"},
    max_tokens=800,  # insurance policy JSON doesn't need more
)

# 3. Add Gemini Flash for extraction (swap out Groq for extraction calls)
import google.generativeai as genai
genai.configure(api_key="YOUR_GEMINI_API_KEY")  # free key from ai.google.dev

gemini_model = genai.GenerativeModel('gemini-2.0-flash')

async def extract_policy_gemini(pdf_text: str) -> dict:
    response = gemini_model.generate_content(
        f"Extract insurance policy details as JSON:\n\n{pdf_text}",
        generation_config=genai.GenerationConfig(
            response_mime_type="application/json",
            max_output_tokens=800,
        )
    )
    return json.loads(response.text)

That's it. Three changes. $0.038 → $0.005/run.

Research conducted March 2026. All prices from official provider pricing pages. Groq: groq.com/pricing | Gemini: ai.google.dev/pricing | DeepSeek: platform.deepseek.com

PandaWhoCodes/llm-token-cost-reduction.md

Select an option

No results found