Iris Knowledge Pipeline — Review & Improvement Proposals

Current Architecture (as-is)

Data model

persona_knowledges — knowledge metadata: name, instructions (text[]), context/operator scope
persona_knowledge_chunks — individual chunks: content (text), embedding (pgvector 384/1024), position
personas_knowledge_graphs — extracted entity/relationship graph per persona (single JSONB blob)

Ingestion pipeline

entire_content (raw string)
  → Embedding.chunk(text, 390 bytes)     # byte-based splitting
  → Embedding.predict(chunk, MiniLM)     # embed raw chunk
  → insert KnowledgeChunk{content: chunk, embedding: vector, position: n}
  → (optional) GPT-4o classify + extract instructions → knowledge.instructions[]

Retrieval pipeline

query text
  → Embedding.predict(query, MiniLM)     # embed query
  → L2 distance search on persona_knowledge_chunks
  → spread: fetch position ± 1 neighbors
  → fetch instructions for each knowledge_id
  → format as <Start Knowledge>...</End Knowledge> blocks → LLM context

Issues Identified

1. Chunk size is far too small

Current limit: 390 bytes (~60–80 words)
MiniLM supports: 512 tokens (~380 words)
Using ~15–20% of the model's context capacity per chunk
Results in many low-coherence fragments, high retrieval noise

2. No chunk overlap

Chunks are non-overlapping; concepts spanning boundaries get weak signal
"Spread" retrieval (position ±1) partially compensates but embeddings themselves encode incomplete context
Standard practice: 10–20% overlap so each chunk is semantically self-contained

3. Byte-based splitting on UTF-8 text

Splitter operates on raw bytes; multi-byte characters (CJK, emoji, accented chars) can be split mid-codepoint
Sentence boundary detection helps but doesn't fully prevent this

4. No document-level context in embeddings

Each chunk is embedded in isolation
Transcripts especially suffer: pronouns and references have no context
"He said he'd handle it by Friday" is nearly meaningless without surrounding context

5. Embedding model mismatch risk

Chunks store no record of which embedding model was used
Switching models (e.g. MiniLM → Qwen3) silently degrades retrieval — old chunks return bad similarity scores
No way to detect or filter by model at query time

6. Knowledge graph is a single JSONB blob per persona

Every extraction overwrites the previous graph
No history, no versioning
Large graphs become expensive to update atomically

7. No retrieval observability

No logging of what was retrieved, which chunks scored best, which model was used
Debugging poor retrieval requires ad-hoc instrumentation

Structured Data Problem (e.g. Juhlin Wine Database)

The current pipeline is especially poor for structured content like:

Brand: Pol Roger Millésime
Vintage: 1921
Score: 96
Grape:
Pinot Noir Pinot Meunier Chardonnay
80       0           20
Comment:
Ever since I sat with Christian Pol Roger...

Problems:

390-byte chunks split records mid-entry (brand/vintage in one chunk, comment in another)
Columnar grape percentages (80 0 20) embed with zero meaning — model has no idea what they represent
Score, price range land wherever the byte split happens
No structured metadata for filtering (price ≤ X, score ≥ N, specific vintage lookup)

Recommended Approach: Generic LLM-Driven Pipeline

The key insight: use an LLM at ingest to normalize arbitrary content into a consistent, retrievable form. Pay the cost once at upload; benefit on every query.

Separate what you embed from what you store. Currently they're the same string. They shouldn't be.

Pipeline

Raw content (any format)
       │
       ▼
 [LLM: Analyze & Segment]          ← one call per upload
       │  detect content type
       │  identify natural unit boundaries
       └── return list of units (not byte-split)
              │
              ▼
       [LLM: Per-unit enrichment]   ← one call per unit (batchable)
              │  generate embedding_text: self-contained natural language
              │  description of what this unit contains
              └── preserve content: original unchanged text
                     │
                     ▼
              [Embed embedding_text]
              [Store {content, embedding, position}]

Stage 1 — Analyze & segment

Ask the LLM to find natural boundaries — not bytes:

Content type	Natural unit
Structured DB	Each record
Article	Each section / argument
Info dump / blob	Each topic cluster
Transcript	Speaker turns / topics

Stage 2 — Per-unit enrichment

Generate embedding_text: a self-contained, natural language representation of each unit.

Example — wine record:

Input:

Brand: Pol Roger | Vintage: 1921 | Score: 96
Grape: 80 0 20 (Pinot Noir / Pinot Meunier / Chardonnay)
Comment: Ever since I sat with Christian Pol Roger...

Generated embedding_text:

"Pol Roger Millésime 1921 is a Brut champagne made from 80% Pinot Noir and 20% Chardonnay, scored 96 points by Richard Juhlin. Tasting notes describe crème brûlée, butterscotch, brioche, mint chocolate, and orange blossom. Exceptional harmony and charm. Priced €57–306."

content stored is the original unchanged text. embedding_text is what gets embedded.

Why this works across all formats

Format	Stage 1 output	Stage 2 output
Structured DB	one unit per record	natural language with all fields decoded
Article	one unit per section	section summary with document context
Info dump	topic clusters	coherent description of each cluster
Transcript	topic segments	what was discussed, by whom, key points

The LLM normalizes variance. No format-specific parsers needed.

Schema change

KnowledgeChunk
  content:         <original unit text>       ← returned to LLM in prompt (unchanged)
  embedding_text:  <enriched description>     ← what was embedded (can be omitted long-term)
  embedding:       <vector>
  position:        n
  embedding_model: atom                       ← fix model mismatch issue

Additional Improvements (independent of the above)

Contextual retrieval (Anthropic's technique)

For each chunk, prepend a short document-level summary before embedding. Works even without full LLM segmentation — it's a simpler step that meaningfully improves retrieval for long-form content like transcripts.

Hierarchical / parent-child chunking

Small chunks (2–3 sentences) for precise retrieval signal
Larger parent chunks (400–600 words) returned as context to the LLM
Add parent_chunk_id to KnowledgeChunk; retrieve small, return parent

Hybrid retrieval for structured data

For content with numeric/categorical metadata (scores, prices, vintages):

Store as queryable columns alongside the chunk
Route structured queries to SQL filters, semantic queries to vector search
Combine both for "recommend a Blanc de Blancs under €100 scoring above 92"

Cost of LLM-driven ingest

Use Haiku for both stages — fast and cheap, sufficient for this task
Stage 1 can process an entire document in one call (return all segment boundaries)
Stage 2 can batch multiple short units per call
One-time cost per upload; amortized across all future queries

What to Keep From Current Pipeline

extract_knowledge_instructions — still useful as knowledge-level metadata
Async ingestion via Task.Supervisor — just add LLM stages inside the async task
Spread retrieval (position ±1) — still valuable even with better chunks
Dual-scope design (operator + context) — flexible, keep as-is

Main Change

Replace Embedding.chunk(entire_content, @embedding_model) with:

LLM-driven segmentation (natural units, not byte splits)
Per-unit embedding_text generation
Embed embedding_text, store original content

qcam/iris-knowledge-pipeline-review.md

Select an option

No results found