Skip to content

Instantly share code, notes, and snippets.

@qcam
Created April 22, 2026 10:53
Show Gist options
  • Select an option

  • Save qcam/7e3574f75bd5ccb36f61aee6699bce54 to your computer and use it in GitHub Desktop.

Select an option

Save qcam/7e3574f75bd5ccb36f61aee6699bce54 to your computer and use it in GitHub Desktop.
Iris knowledge storage pipeline — review and improvement proposals

Iris Knowledge Pipeline — Review & Improvement Proposals

Current Architecture (as-is)

Data model

  • persona_knowledges — knowledge metadata: name, instructions (text[]), context/operator scope
  • persona_knowledge_chunks — individual chunks: content (text), embedding (pgvector 384/1024), position
  • personas_knowledge_graphs — extracted entity/relationship graph per persona (single JSONB blob)

Ingestion pipeline

entire_content (raw string)
  → Embedding.chunk(text, 390 bytes)     # byte-based splitting
  → Embedding.predict(chunk, MiniLM)     # embed raw chunk
  → insert KnowledgeChunk{content: chunk, embedding: vector, position: n}
  → (optional) GPT-4o classify + extract instructions → knowledge.instructions[]

Retrieval pipeline

query text
  → Embedding.predict(query, MiniLM)     # embed query
  → L2 distance search on persona_knowledge_chunks
  → spread: fetch position ± 1 neighbors
  → fetch instructions for each knowledge_id
  → format as <Start Knowledge>...</End Knowledge> blocks → LLM context

Issues Identified

1. Chunk size is far too small

  • Current limit: 390 bytes (~60–80 words)
  • MiniLM supports: 512 tokens (~380 words)
  • Using ~15–20% of the model's context capacity per chunk
  • Results in many low-coherence fragments, high retrieval noise

2. No chunk overlap

  • Chunks are non-overlapping; concepts spanning boundaries get weak signal
  • "Spread" retrieval (position ±1) partially compensates but embeddings themselves encode incomplete context
  • Standard practice: 10–20% overlap so each chunk is semantically self-contained

3. Byte-based splitting on UTF-8 text

  • Splitter operates on raw bytes; multi-byte characters (CJK, emoji, accented chars) can be split mid-codepoint
  • Sentence boundary detection helps but doesn't fully prevent this

4. No document-level context in embeddings

  • Each chunk is embedded in isolation
  • Transcripts especially suffer: pronouns and references have no context
  • "He said he'd handle it by Friday" is nearly meaningless without surrounding context

5. Embedding model mismatch risk

  • Chunks store no record of which embedding model was used
  • Switching models (e.g. MiniLM → Qwen3) silently degrades retrieval — old chunks return bad similarity scores
  • No way to detect or filter by model at query time

6. Knowledge graph is a single JSONB blob per persona

  • Every extraction overwrites the previous graph
  • No history, no versioning
  • Large graphs become expensive to update atomically

7. No retrieval observability

  • No logging of what was retrieved, which chunks scored best, which model was used
  • Debugging poor retrieval requires ad-hoc instrumentation

Structured Data Problem (e.g. Juhlin Wine Database)

The current pipeline is especially poor for structured content like:

Brand: Pol Roger Millésime
Vintage: 1921
Score: 96
Grape:
Pinot Noir Pinot Meunier Chardonnay
80       0           20
Comment:
Ever since I sat with Christian Pol Roger...

Problems:

  • 390-byte chunks split records mid-entry (brand/vintage in one chunk, comment in another)
  • Columnar grape percentages (80 0 20) embed with zero meaning — model has no idea what they represent
  • Score, price range land wherever the byte split happens
  • No structured metadata for filtering (price ≤ X, score ≥ N, specific vintage lookup)

Recommended Approach: Generic LLM-Driven Pipeline

The key insight: use an LLM at ingest to normalize arbitrary content into a consistent, retrievable form. Pay the cost once at upload; benefit on every query.

Separate what you embed from what you store. Currently they're the same string. They shouldn't be.

Pipeline

Raw content (any format)
       │
       ▼
 [LLM: Analyze & Segment]          ← one call per upload
       │  detect content type
       │  identify natural unit boundaries
       └── return list of units (not byte-split)
              │
              ▼
       [LLM: Per-unit enrichment]   ← one call per unit (batchable)
              │  generate embedding_text: self-contained natural language
              │  description of what this unit contains
              └── preserve content: original unchanged text
                     │
                     ▼
              [Embed embedding_text]
              [Store {content, embedding, position}]

Stage 1 — Analyze & segment

Ask the LLM to find natural boundaries — not bytes:

Content type Natural unit
Structured DB Each record
Article Each section / argument
Info dump / blob Each topic cluster
Transcript Speaker turns / topics

Stage 2 — Per-unit enrichment

Generate embedding_text: a self-contained, natural language representation of each unit.

Example — wine record:

Input:

Brand: Pol Roger | Vintage: 1921 | Score: 96
Grape: 80 0 20 (Pinot Noir / Pinot Meunier / Chardonnay)
Comment: Ever since I sat with Christian Pol Roger...

Generated embedding_text:

"Pol Roger Millésime 1921 is a Brut champagne made from 80% Pinot Noir and 20% Chardonnay, scored 96 points by Richard Juhlin. Tasting notes describe crème brûlée, butterscotch, brioche, mint chocolate, and orange blossom. Exceptional harmony and charm. Priced €57–306."

content stored is the original unchanged text. embedding_text is what gets embedded.

Why this works across all formats

Format Stage 1 output Stage 2 output
Structured DB one unit per record natural language with all fields decoded
Article one unit per section section summary with document context
Info dump topic clusters coherent description of each cluster
Transcript topic segments what was discussed, by whom, key points

The LLM normalizes variance. No format-specific parsers needed.

Schema change

KnowledgeChunk
  content:         <original unit text>       ← returned to LLM in prompt (unchanged)
  embedding_text:  <enriched description>     ← what was embedded (can be omitted long-term)
  embedding:       <vector>
  position:        n
  embedding_model: atom                       ← fix model mismatch issue

Additional Improvements (independent of the above)

Contextual retrieval (Anthropic's technique)

For each chunk, prepend a short document-level summary before embedding. Works even without full LLM segmentation — it's a simpler step that meaningfully improves retrieval for long-form content like transcripts.

Hierarchical / parent-child chunking

  • Small chunks (2–3 sentences) for precise retrieval signal
  • Larger parent chunks (400–600 words) returned as context to the LLM
  • Add parent_chunk_id to KnowledgeChunk; retrieve small, return parent

Hybrid retrieval for structured data

For content with numeric/categorical metadata (scores, prices, vintages):

  • Store as queryable columns alongside the chunk
  • Route structured queries to SQL filters, semantic queries to vector search
  • Combine both for "recommend a Blanc de Blancs under €100 scoring above 92"

Cost of LLM-driven ingest

  • Use Haiku for both stages — fast and cheap, sufficient for this task
  • Stage 1 can process an entire document in one call (return all segment boundaries)
  • Stage 2 can batch multiple short units per call
  • One-time cost per upload; amortized across all future queries

What to Keep From Current Pipeline

  • extract_knowledge_instructions — still useful as knowledge-level metadata
  • Async ingestion via Task.Supervisor — just add LLM stages inside the async task
  • Spread retrieval (position ±1) — still valuable even with better chunks
  • Dual-scope design (operator + context) — flexible, keep as-is

Main Change

Replace Embedding.chunk(entire_content, @embedding_model) with:

  1. LLM-driven segmentation (natural units, not byte splits)
  2. Per-unit embedding_text generation
  3. Embed embedding_text, store original content
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment