Skip to content

Instantly share code, notes, and snippets.

@PandaWhoCodes
Created March 16, 2026 15:47
Show Gist options
  • Select an option

  • Save PandaWhoCodes/1ebc030da8513b5695cb1949693a910f to your computer and use it in GitHub Desktop.

Select an option

Save PandaWhoCodes/1ebc030da8513b5695cb1949693a910f to your computer and use it in GitHub Desktop.
Slashing LLM Token Costs for Insurance PDF Extraction — From /bin/zsh.038 to /bin/zsh.005/run (87% savings)

Slashing LLM Token Costs for Insurance PDF Extraction

Context: A pipeline that fetches emails, triages them via LLM, downloads PDF attachments, extracts insurance policy details via LLM, and saves results. Currently spending ~$0.038/run on Groq Llama 4 Scout.


Current State — Where the Money Goes

Per Run (1 user):
  Triage  (15 batches):  95,630 input + 3,589 output  = $0.0117
  Extract (39 calls):   193,000 input + 14,000 output  = $0.0260
  ─────────────────────────────────────────────────────────────────
  TOTAL LLM:            288,630 input + 17,589 output   = $0.0377

  Model: Groq Llama 4 Scout — $0.11/M input, $0.34/M output
  Modal compute: $0.0018
  Everything else: free
  TOTAL: ~$0.04/run

Key insight: Extraction is 69% of LLM cost. That's the main target.


Part 1: Cheaper Model Alternatives (Drop-In Replacements)

Full Provider Comparison for Your Workload

Your workload per run: ~289K input tokens, ~18K output tokens.

Provider / Model Input $/M Output $/M Cost/Run vs Current Speed Notes
Groq Llama 4 Scout (current) $0.11 $0.34 $0.0377 baseline ⚡ Very fast
Groq Llama 3.1 8B $0.05 $0.08 $0.0159 -58% ⚡ Very fast Smaller model, test quality
Groq Llama 4 Scout + caching $0.055 $0.34 $0.0220 -42% ⚡ Very fast Auto 50% off cached input
Groq Batch API (Scout) $0.055 $0.17 $0.0189 -50% 🐌 24h delay Non-realtime only
Groq Llama 3.1 8B + Batch $0.025 $0.04 $0.0080 -79% 🐌 24h delay Cheapest Groq option
Gemini 2.0 Flash $0.10 $0.40 $0.0361 -4% 🚀 Fast 250 free req/day!
Gemini 2.5 Flash-Lite $0.10 $0.40 $0.0361 -4% 🚀 Fast Newer, replaces 2.0 Flash Lite
Gemini 2.0 Flash (free tier) $0.00 $0.00 $0.0000 -100% 🚀 Fast 250 req/day limit
DeepSeek V3.2 (cache miss) $0.28 $0.42 $0.0886 +135% 🐢 Slow MORE expensive raw
DeepSeek V3.2 (cache hit) $0.028 $0.42 $0.0157 -58% 🐢 Slow 90% off repeated prefixes
Together Llama 3.1 8B Turbo $0.18 $0.18 $0.0553 +47% 🏃 Fast More expensive
Cerebras Llama 3.1 8B $0.10 (blended) $0.0307 -19% ⚡ Very fast 1M free tokens/day
Fireworks (entry tier) $0.10 (blended) $0.0307 -19% 🏃 Fast 50% off batch + cached

Part 2: The Free/Near-Free Strategy

🏆 Best Approach: Hybrid Model Routing

Use different models for different phases based on task complexity:

TRIAGE (classification — easy task)
  → Groq Llama 3.1 8B ($0.05/$0.08 per M)
  → 95,630 × $0.05/M + 3,589 × $0.08/M = $0.0051

EXTRACTION (structured data from PDFs — harder task)
  → Gemini 2.0 Flash (free tier: 250 requests/day)
  → 39 calls per run = up to 6 free runs/day
  → Cost: $0.00

TOTAL: $0.0051/run (was $0.0377) → 86% SAVINGS

Why This Works

Triage is classification — "Is this email about insurance?" An 8B model handles yes/no classification just as well as a 109B MoE model. You're massively overspending here.

Extraction benefits from Gemini Flash because:

  • Native PDF understanding (can process PDFs directly, no PyMuPDF needed)
  • 1M token context window (send entire documents)
  • Built-in structured output / JSON mode
  • 98-99% accuracy reported for receipt/document extraction
  • Free tier gives you 250 requests/day = enough for 6 pipeline runs

Scaling Beyond Free Tier

Daily Runs Extraction Calls Gemini Free Covers? Paid Overflow Cost
1-6 39-234 ✅ Yes (250/day) $0.00
7-12 273-468 ⚠️ Partial ~$0.01-0.02/day
13+ 507+ ❌ No Use paid @ $0.10/$0.40 per M

When you outgrow free tier, Gemini paid is roughly the same cost as Groq Scout anyway, so you lose nothing.


Part 3: Reduce Token Count (Works with ANY Provider)

These optimizations stack with model switching. Apply them regardless of which model you use.

3a. Prompt Caching on Groq (FREE, automatic)

Groq auto-caches repeated prefixes at 50% discount. Your 39 extraction calls likely share:

  • Same system prompt
  • Same JSON schema definition
  • Same few-shot examples

If your system prompt is ~2,000 tokens and shared across 39 calls:

  • Currently: 39 × 2,000 = 78,000 tokens at full price
  • With caching: 1 × 2,000 full + 38 × 2,000 at 50% = 40,000 effective tokens
  • Saves ~$0.002/run for free (just by call ordering)

How to maximize cache hits:

# BAD — different prefix each call
messages = [
    {"role": "system", "content": f"Extract from: {doc_name}..."},  # dynamic = cache miss
    {"role": "user", "content": pdf_text}
]

# GOOD — static prefix, dynamic content at the end
messages = [
    {"role": "system", "content": "You extract insurance policy details..."},  # static = cached
    {"role": "user", "content": f"Document: {doc_name}\n\n{pdf_text}"}  # dynamic at end
]

3b. Compress PDF Text Before Sending (40-60% token reduction)

Insurance PDFs are full of boilerplate. Pre-process to strip noise:

import re

def compress_pdf_text(raw_text: str) -> str:
    """Strip boilerplate from insurance PDF text before LLM extraction."""
    
    # Remove excessive whitespace
    text = re.sub(r'\n{3,}', '\n\n', raw_text)
    text = re.sub(r' {2,}', ' ', text)
    
    # Remove page headers/footers (usually repeated)
    lines = text.split('\n')
    # Count line frequency — repeated lines are headers/footers
    from collections import Counter
    line_counts = Counter(line.strip() for line in lines if line.strip())
    repeated = {line for line, count in line_counts.items() if count > 2}
    lines = [l for l in lines if l.strip() not in repeated]
    
    # Remove common insurance PDF noise
    noise_patterns = [
        r'Page \d+ of \d+',
        r'CIN:?\s*[A-Z0-9]+',
        r'IRDAI\s*Reg\.?\s*No\.?.*',
        r'Toll\s*Free.*\d{4,}',
        r'www\.\S+',
        r'^\s*\d+\s*$',  # lone page numbers
    ]
    for pattern in noise_patterns:
        text = re.sub(pattern, '', '\n'.join(lines), flags=re.MULTILINE | re.IGNORECASE)
    
    return text.strip()

Impact: Insurance PDFs typically have 30-50% boilerplate. This alone can cut your 193K extraction input to ~100-130K.

3c. Single-Pass Extraction (Reduce Call Count)

If you're making 39 separate LLM calls for 39 documents, consider:

Option A: Batch multiple small documents per call

# Instead of 39 calls with 1 doc each (~5K tokens per doc)
# Make 8 calls with 5 docs each

prompt = """Extract insurance details from each document below.
Return a JSON array with one object per document.

---DOCUMENT 1---
{doc1_text}

---DOCUMENT 2---  
{doc2_text}
...
"""

Impact: Reduces overhead of repeated system prompts. 39 calls → 8 calls = ~80% fewer system prompt tokens.

Option B: Two-stage extraction

# Stage 1: Quick scan — extract only key identifiers (cheap, fast)
quick_prompt = "Extract ONLY: policy_number, insurer_name, type from this text. JSON only."
# ~50 output tokens per doc vs ~360 currently

# Stage 2: Full extraction — only for documents that need it
# Skip already-cached policies, only process genuinely new ones

3d. Use Structured Output / JSON Mode

Both Groq and Gemini support constrained JSON output:

# Groq with JSON mode
response = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=messages,
    response_format={"type": "json_object"},  # forces valid JSON
    max_tokens=500,  # cap output — insurance details don't need 1000+ tokens
)

# Gemini with response schema
import google.generativeai as genai
model = genai.GenerativeModel('gemini-2.0-flash')
response = model.generate_content(
    prompt,
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json",
        response_schema=PolicySchema,  # enforces exact fields
    )
)

Impact: Constrained output = fewer wasted output tokens. Setting max_tokens=500 for extraction (vs unlimited) prevents verbose responses.

3e. Pre-Filter with Regex Before LLM (Skip Obviously Non-Insurance PDFs)

INSURANCE_KEYWORDS = {
    'policy', 'premium', 'sum assured', 'sum insured', 'nominee',
    'insured', 'coverage', 'claim', 'rider', 'endorsement',
    'maturity', 'surrender', 'annuity', 'deductible'
}

def is_likely_insurance(text: str, threshold: int = 3) -> bool:
    """Quick check before sending to LLM."""
    text_lower = text.lower()
    matches = sum(1 for kw in INSURANCE_KEYWORDS if kw in text_lower)
    return matches >= threshold

# In pipeline:
for doc in documents:
    if not is_likely_insurance(doc.text):
        continue  # skip — don't waste LLM tokens
    policy = await extract_policy(doc.text)

Impact: If even 10-20% of your 39 documents are non-insurance (bank statements, random PDFs), this saves those LLM calls entirely.


Part 4: The Nuclear Option — Partial Rule-Based Extraction

For common Indian insurance providers (LIC, HDFC Life, ICICI Pru, SBI Life, Max Life), policy documents follow predictable templates. You could extract key fields with regex/layout rules for known formats and only fall back to LLM for unknown formats:

KNOWN_EXTRACTORS = {
    'LIC': extract_lic_policy,      # regex-based
    'HDFC Life': extract_hdfc_policy,
    'ICICI Prudential': extract_icici_policy,
}

def extract_policy(text: str) -> dict:
    # Try to identify insurer
    for insurer, extractor in KNOWN_EXTRACTORS.items():
        if insurer.lower() in text.lower():
            result = extractor(text)
            if result and result.get('policy_number'):
                return result  # no LLM needed!
    
    # Fallback to LLM for unknown formats
    return await llm_extract(text)

Impact: If 50% of policies are from known insurers, you cut LLM extraction calls in half. This is more work to build but costs literally $0 to run.


Part 5: Cost Projection at Scale

Per-Run Costs by Strategy

Strategy Triage Cost Extraction Cost Total/Run Monthly (100 runs/day)
Current (Groq Scout) $0.0117 $0.0260 $0.0377 $113.10
Model switch only (8B + Gemini free) $0.0051 $0.0000 $0.0051 $15.30
+ Prompt compression (30% less input) $0.0036 $0.0000 $0.0036 $10.80
+ Batched extraction (8 calls vs 39) $0.0036 $0.0000 $0.0036 $10.80
+ Groq caching (auto 50% off) $0.0028 $0.0000 $0.0028 $8.40
+ Regex pre-filter (20% skip) $0.0023 $0.0000 $0.0023 $6.90
Full optimization (all above) $0.0020 $0.0000 $0.0020 $6.00

At Higher Scale (When Gemini Free Tier Runs Out)

Daily Users Runs/Day Current Cost/Mo Optimized Cost/Mo Savings
1-6 1-6 $1-7 $0.01-0.03 ~99%
10 10 $11 $0.20 98%
50 50 $57 $5.10 91%
100 100 $113 $12.50 89%
500 500 $566 $68 88%

At 500 runs/day you'd be making ~19,500 Gemini extraction calls/day (way past free tier), so extraction runs at $0.10/$0.40 per M paid. Triage stays on Groq 8B. Still massive savings.


Implementation Priority (Bang for Buck)

# Change Effort Savings Do It?
1 Switch triage to Groq 8B 5 min (change model name) 58% on triage ✅ NOW
2 Switch extraction to Gemini Flash 2-3 hours (new client) 100% on extraction (free tier) ✅ NOW
3 Ensure Groq prompt caching works 30 min (reorder prompts) 10-15% on triage ✅ Easy win
4 Add PDF text compression 1-2 hours 30-40% fewer input tokens ✅ Good ROI
5 Set max_tokens on all calls 5 min Caps output waste ✅ NOW
6 Batch extraction calls 2-3 hours Fewer calls, less overhead 🟡 Medium
7 Regex pre-filter 1 hour Skip non-insurance docs 🟡 Medium
8 Rule-based extractors for known insurers 1-2 days Eliminate LLM for 50%+ docs 🔴 Later

Quick Start — Minimum Viable Changes (30 Minutes)

# 1. Switch triage model (5 seconds)
TRIAGE_MODEL = "llama-3.1-8b-instant"  # was "llama-4-scout-17b-16e-instruct"

# 2. Set max_tokens on extraction
response = groq_client.chat.completions.create(
    model=EXTRACTION_MODEL,
    messages=messages,
    response_format={"type": "json_object"},
    max_tokens=800,  # insurance policy JSON doesn't need more
)

# 3. Add Gemini Flash for extraction (swap out Groq for extraction calls)
import google.generativeai as genai
genai.configure(api_key="YOUR_GEMINI_API_KEY")  # free key from ai.google.dev

gemini_model = genai.GenerativeModel('gemini-2.0-flash')

async def extract_policy_gemini(pdf_text: str) -> dict:
    response = gemini_model.generate_content(
        f"Extract insurance policy details as JSON:\n\n{pdf_text}",
        generation_config=genai.GenerationConfig(
            response_mime_type="application/json",
            max_output_tokens=800,
        )
    )
    return json.loads(response.text)

That's it. Three changes. $0.038 → $0.005/run.


Research conducted March 2026. All prices from official provider pricing pages. Groq: groq.com/pricing | Gemini: ai.google.dev/pricing | DeepSeek: platform.deepseek.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment