persona_knowledges— knowledge metadata: name, instructions (text[]), context/operator scopepersona_knowledge_chunks— individual chunks: content (text), embedding (pgvector 384/1024), positionpersonas_knowledge_graphs— extracted entity/relationship graph per persona (single JSONB blob)
entire_content (raw string)
→ Embedding.chunk(text, 390 bytes) # byte-based splitting
→ Embedding.predict(chunk, MiniLM) # embed raw chunk
→ insert KnowledgeChunk{content: chunk, embedding: vector, position: n}
→ (optional) GPT-4o classify + extract instructions → knowledge.instructions[]
query text
→ Embedding.predict(query, MiniLM) # embed query
→ L2 distance search on persona_knowledge_chunks
→ spread: fetch position ± 1 neighbors
→ fetch instructions for each knowledge_id
→ format as <Start Knowledge>...</End Knowledge> blocks → LLM context
- Current limit: 390 bytes (~60–80 words)
- MiniLM supports: 512 tokens (~380 words)
- Using ~15–20% of the model's context capacity per chunk
- Results in many low-coherence fragments, high retrieval noise
- Chunks are non-overlapping; concepts spanning boundaries get weak signal
- "Spread" retrieval (position ±1) partially compensates but embeddings themselves encode incomplete context
- Standard practice: 10–20% overlap so each chunk is semantically self-contained
- Splitter operates on raw bytes; multi-byte characters (CJK, emoji, accented chars) can be split mid-codepoint
- Sentence boundary detection helps but doesn't fully prevent this
- Each chunk is embedded in isolation
- Transcripts especially suffer: pronouns and references have no context
- "He said he'd handle it by Friday" is nearly meaningless without surrounding context
- Chunks store no record of which embedding model was used
- Switching models (e.g. MiniLM → Qwen3) silently degrades retrieval — old chunks return bad similarity scores
- No way to detect or filter by model at query time
- Every extraction overwrites the previous graph
- No history, no versioning
- Large graphs become expensive to update atomically
- No logging of what was retrieved, which chunks scored best, which model was used
- Debugging poor retrieval requires ad-hoc instrumentation
The current pipeline is especially poor for structured content like:
Brand: Pol Roger Millésime
Vintage: 1921
Score: 96
Grape:
Pinot Noir Pinot Meunier Chardonnay
80 0 20
Comment:
Ever since I sat with Christian Pol Roger...
Problems:
- 390-byte chunks split records mid-entry (brand/vintage in one chunk, comment in another)
- Columnar grape percentages (
80 0 20) embed with zero meaning — model has no idea what they represent - Score, price range land wherever the byte split happens
- No structured metadata for filtering (price ≤ X, score ≥ N, specific vintage lookup)
The key insight: use an LLM at ingest to normalize arbitrary content into a consistent, retrievable form. Pay the cost once at upload; benefit on every query.
Separate what you embed from what you store. Currently they're the same string. They shouldn't be.
Raw content (any format)
│
▼
[LLM: Analyze & Segment] ← one call per upload
│ detect content type
│ identify natural unit boundaries
└── return list of units (not byte-split)
│
▼
[LLM: Per-unit enrichment] ← one call per unit (batchable)
│ generate embedding_text: self-contained natural language
│ description of what this unit contains
└── preserve content: original unchanged text
│
▼
[Embed embedding_text]
[Store {content, embedding, position}]
Ask the LLM to find natural boundaries — not bytes:
| Content type | Natural unit |
|---|---|
| Structured DB | Each record |
| Article | Each section / argument |
| Info dump / blob | Each topic cluster |
| Transcript | Speaker turns / topics |
Generate embedding_text: a self-contained, natural language representation of each unit.
Example — wine record:
Input:
Brand: Pol Roger | Vintage: 1921 | Score: 96
Grape: 80 0 20 (Pinot Noir / Pinot Meunier / Chardonnay)
Comment: Ever since I sat with Christian Pol Roger...
Generated embedding_text:
"Pol Roger Millésime 1921 is a Brut champagne made from 80% Pinot Noir and 20% Chardonnay, scored 96 points by Richard Juhlin. Tasting notes describe crème brûlée, butterscotch, brioche, mint chocolate, and orange blossom. Exceptional harmony and charm. Priced €57–306."
content stored is the original unchanged text. embedding_text is what gets embedded.
| Format | Stage 1 output | Stage 2 output |
|---|---|---|
| Structured DB | one unit per record | natural language with all fields decoded |
| Article | one unit per section | section summary with document context |
| Info dump | topic clusters | coherent description of each cluster |
| Transcript | topic segments | what was discussed, by whom, key points |
The LLM normalizes variance. No format-specific parsers needed.
KnowledgeChunk
content: <original unit text> ← returned to LLM in prompt (unchanged)
embedding_text: <enriched description> ← what was embedded (can be omitted long-term)
embedding: <vector>
position: n
embedding_model: atom ← fix model mismatch issue
For each chunk, prepend a short document-level summary before embedding. Works even without full LLM segmentation — it's a simpler step that meaningfully improves retrieval for long-form content like transcripts.
- Small chunks (2–3 sentences) for precise retrieval signal
- Larger parent chunks (400–600 words) returned as context to the LLM
- Add
parent_chunk_idtoKnowledgeChunk; retrieve small, return parent
For content with numeric/categorical metadata (scores, prices, vintages):
- Store as queryable columns alongside the chunk
- Route structured queries to SQL filters, semantic queries to vector search
- Combine both for "recommend a Blanc de Blancs under €100 scoring above 92"
- Use Haiku for both stages — fast and cheap, sufficient for this task
- Stage 1 can process an entire document in one call (return all segment boundaries)
- Stage 2 can batch multiple short units per call
- One-time cost per upload; amortized across all future queries
extract_knowledge_instructions— still useful as knowledge-level metadata- Async ingestion via
Task.Supervisor— just add LLM stages inside the async task - Spread retrieval (position ±1) — still valuable even with better chunks
- Dual-scope design (operator + context) — flexible, keep as-is
Replace Embedding.chunk(entire_content, @embedding_model) with:
- LLM-driven segmentation (natural units, not byte splits)
- Per-unit
embedding_textgeneration - Embed
embedding_text, store originalcontent