Created
April 18, 2026 13:57
-
-
Save goranefbl/fb148834ce276381c5e82f782ef6846f to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # AI Knowledge Assistant – Product + Documents RAG Platform | |
| ## Purpose | |
| Build a multi-tenant AI knowledge assistant that can: | |
| - answer questions using organization-specific content | |
| - use both structured product content and uploaded PDFs | |
| - ingest website content from WordPress via REST API | |
| - keep content fresh via sync jobs and webhooks | |
| - provide full traceability for every answer | |
| - allow admins to manage what the AI knows | |
| - support an internal admin testing experience | |
| - support embedding on external websites such as WordPress | |
| This is **not** just a chatbot UI. It is an **admin-controlled AI knowledge system** with ingestion, synchronization, retrieval, traceability, and embed capabilities. | |
| --- | |
| ## Phase 1 Scope | |
| Phase 1 must include: | |
| 1. Authentication using Better Auth | |
| 2. Admin panel | |
| 3. Organization-scoped knowledge management | |
| 4. PDF upload and ingestion | |
| 5. WordPress content ingestion via REST API | |
| 6. WooCommerce product ingestion | |
| 7. Webhook-driven freshness where possible | |
| 8. Background workers for ingestion/re-indexing | |
| 9. Chat testing inside admin | |
| 10. Public/full-page embed endpoint for website use | |
| 11. Traceable answers that show exactly which sources were used | |
| Phase 1 does **not** need: | |
| - complex analytics dashboards | |
| - billing/subscriptions | |
| - advanced role hierarchy beyond admin/member if unnecessary | |
| - fine-tuning custom models | |
| - direct live querying of WordPress or WooCommerce during user chat | |
| - long-term conversation memory or summarization (a short sliding window of recent turns IS included — see Extensions §8) | |
| - direct editing of chunk-level embeddings | |
| - multi-language support (deferred; admin and end users operate in a single language) | |
| --- | |
| ## Core Product Concept | |
| The system should be modeled around **knowledge**, not around a specific industry. | |
| Do **not** design the product around WooCommerce products only. | |
| The correct abstraction is: | |
| ```text | |
| Sources → Documents → Chunks → Retrieval → LLM Answer | |
| ``` | |
| This lets the same product work for: | |
| - a WooCommerce store with product descriptions | |
| - an events business with event descriptions | |
| - a plugin company with technical docs and PDFs | |
| - a blog-heavy website with article search and guidance | |
| Admins should feel they are managing **what the AI knows**, not embeddings or vector internals. | |
| --- | |
| ## High-Level Architecture | |
| ```text | |
| Next.js App | |
| ├── Admin UI | |
| ├── Public Chat Page | |
| ├── Embedded Chat Page | |
| ├── API Routes | |
| └── Better Auth | |
| Background Worker | |
| ├── PDF ingestion jobs | |
| ├── WordPress sync jobs | |
| ├── WooCommerce sync jobs | |
| ├── Re-index jobs | |
| └── webhook follow-up jobs | |
| PostgreSQL | |
| ├── auth tables | |
| ├── sources | |
| ├── documents | |
| ├── chunks | |
| ├── products or knowledge records | |
| ├── chat logs / retrieval logs | |
| └── settings | |
| Redis + BullMQ | |
| ├── ingestion queue | |
| ├── sync queue | |
| ├── webhook queue | |
| └── reindex queue | |
| LLM Provider | |
| └── OpenAI | |
| ``` | |
| --- | |
| ## Retrieval Model (RAG) | |
| This product uses **Retrieval-Augmented Generation (RAG)**. | |
| ### Correct RAG flow | |
| 1. User asks a question | |
| 2. The question is converted to an embedding vector | |
| 3. The system searches stored chunk embeddings in Postgres using pgvector | |
| 4. The system returns the most relevant chunks | |
| 5. Those chunks, along with relevant metadata, are sent to the LLM | |
| 6. The LLM generates an answer using only that context | |
| 7. The API returns: | |
| - the answer | |
| - the list of sources/chunks used | |
| ### Important rule | |
| Do **not** fetch live WordPress pages or live WooCommerce products during chat requests. | |
| All external content must be ingested ahead of time into our database. | |
| That means: | |
| - WordPress REST API is a **content source** | |
| - WooCommerce REST API or webhooks are **content sources** | |
| - PDFs are **content sources** | |
| But the chat system always queries **our own database**, not the remote systems directly. | |
| --- | |
| ## Vector Search – Detailed Explanation | |
| Claude should understand exactly how vector search is expected to work. | |
| ### Goal | |
| Vector search finds text by **semantic meaning**, not just exact keywords. | |
| Example: | |
| - User asks: “Which plugin helps me reward customers for inviting friends?” | |
| - The system may retrieve chunks mentioning: | |
| - referral programs | |
| - invite friends | |
| - customer rewards | |
| - affiliate-like cashback for referrals | |
| Even if the exact wording does not match. | |
| ### Embeddings | |
| Every chunk of text is converted into a vector using an embedding model. | |
| Example: | |
| ```text | |
| "Referral rewards are given after the referred order is completed" | |
| → [0.123, -0.918, 0.442, ...] | |
| ``` | |
| The same is done for the user question. | |
| ### Storage | |
| Each chunk record stores: | |
| - content text | |
| - embedding vector | |
| - document ID | |
| - metadata | |
| ### Query flow | |
| When the user asks a question: | |
| 1. Generate embedding for the question | |
| 2. Compare that vector against stored chunk vectors in Postgres | |
| 3. Return the nearest chunks using pgvector similarity search | |
| 4. Send only the top relevant chunks to the LLM | |
| ### Why chunking matters | |
| Do **not** embed whole PDFs or whole pages as one large unit. | |
| Instead: | |
| - split content into chunks of around 500–800 tokens | |
| - include small overlap between chunks | |
| This improves: | |
| - retrieval precision | |
| - traceability | |
| - answer quality | |
| ### What vector search returns | |
| Vector search returns the top relevant chunks, typically 3–8 depending on tuning. | |
| Each returned result should include: | |
| - chunk content | |
| - score/distance | |
| - document title | |
| - document ID | |
| - source type | |
| - source URL if available | |
| - external ID if available | |
| ### Important distinction | |
| - Vector search = retrieval | |
| - LLM = answer generation | |
| The vector database does **not** answer questions by itself. | |
| It only returns the most relevant context. | |
| ### Quality rules | |
| - only search active documents | |
| - optionally filter by organization | |
| - optionally filter by source type | |
| - optionally boost product-type documents if the mode is product recommendation | |
| - return metadata with every retrieval result for debugging and traceability | |
| --- | |
| ## Why This Is a Knowledge Platform, Not a Hardcoded Product Chatbot | |
| The data model and admin UI must be generic enough so that one organization can use it for: | |
| - WooCommerce products | |
| - WordPress posts/pages | |
| - uploaded PDFs | |
| - pasted manual knowledge | |
| Another organization may use it for: | |
| - event descriptions | |
| - help center content | |
| - manuals | |
| - training documents | |
| Therefore, the frontend admin should not primarily display “products” as the central abstraction. | |
| The central abstraction should be: | |
| - Sources | |
| - Documents | |
| - Test Chat | |
| A WooCommerce product is just one type of document. | |
| A WordPress page is just one type of document. | |
| A PDF is just one type of document. | |
| --- | |
| ## Tech Stack | |
| ### Required | |
| - Next.js (App Router) | |
| - Node.js | |
| - PostgreSQL | |
| - pgvector | |
| - Better Auth | |
| - Redis | |
| - BullMQ | |
| - OpenAI API | |
| ### Optional utilities | |
| - pdf-parse or equivalent for PDFs | |
| - html-to-text or equivalent for stripping HTML | |
| - zod for validation | |
| - drizzle or direct SQL (avoid heavy abstraction if possible) | |
| ### Recommendation | |
| Do not over-abstract the core retrieval logic behind a large framework if it reduces debuggability. | |
| Keep these steps explicit: | |
| - embed() | |
| - searchChunks() | |
| - buildPrompt() | |
| - callLLM() | |
| --- | |
| ## Multi-Tenancy | |
| The platform should be organization-scoped from the start. | |
| Every important record must belong to an organization: | |
| - sources | |
| - documents | |
| - chunks | |
| - settings | |
| - chat tests / logs | |
| - webhook configs | |
| - sync runs | |
| Even if Phase 1 only has a few organizations, this is the correct foundation. | |
| --- | |
| ## Authentication | |
| Use **Better Auth** with email/password and database-backed sessions. | |
| ### Auth requirements | |
| - login page | |
| - protected admin routes | |
| - organization-aware user access | |
| - role field available for future use | |
| ### User creation | |
| For now, users may be inserted manually into the database or seeded via a script. | |
| No public registration flow is required unless explicitly added later. | |
| --- | |
| ## Core Data Model | |
| The most important design decision is to introduce a top-level **Source** abstraction. | |
| ### Sources | |
| A source represents where knowledge comes from. | |
| Examples: | |
| - WooCommerce source | |
| - WordPress source | |
| - PDF source collection | |
| - Manual content source | |
| Suggested fields: | |
| - id | |
| - organization_id | |
| - name | |
| - type (`woocommerce`, `wordpress`, `pdf`, `manual`) | |
| - status (`active`, `disabled`) | |
| - config JSONB | |
| - last_sync_at | |
| - created_at | |
| - updated_at | |
| Examples of `config`: | |
| - WordPress base URL, auth, selected post types | |
| - WooCommerce store URL and API credentials | |
| - PDF settings if needed | |
| - manual source metadata | |
| ### Documents | |
| Documents are the normalized content records that the AI actually knows about. | |
| Every ingested thing becomes a document. | |
| Examples: | |
| - a WooCommerce product description | |
| - a WordPress page | |
| - a blog article | |
| - a PDF file | |
| - a manually created knowledge entry | |
| Suggested fields: | |
| - id | |
| - organization_id | |
| - source_id | |
| - title | |
| - type (`woo_product`, `wp_post`, `wp_page`, `pdf`, `manual`, `event`, etc.) | |
| - status (`active`, `disabled`, `draft`, `deleted`) | |
| - source_url | |
| - external_id | |
| - raw_content | |
| - normalized_content | |
| - override_content (nullable; for future editable overrides) | |
| - sync_hash | |
| - last_synced_at | |
| - metadata JSONB | |
| - created_at | |
| - updated_at | |
| Notes: | |
| - `external_id` stores remote IDs like Woo product ID or WP post ID | |
| - `source_url` stores the public URL if available | |
| - `raw_content` may contain HTML or extracted text | |
| - `normalized_content` is the cleaned text used for chunking | |
| - `override_content` is optional for later, allowing admin-written replacements | |
| - `sync_hash` helps detect content changes | |
| ### Chunks | |
| Chunks are the searchable retrieval units. | |
| Suggested fields: | |
| - id | |
| - organization_id | |
| - document_id | |
| - content | |
| - embedding vector | |
| - chunk_index | |
| - token_count | |
| - metadata JSONB | |
| - created_at | |
| Chunk metadata may include: | |
| - title | |
| - source_type | |
| - source_url | |
| - external_id | |
| - document_type | |
| - product SKU if applicable | |
| - product/category tags if applicable | |
| ### Settings | |
| Organization-level assistant settings. | |
| Suggested fields: | |
| - id | |
| - organization_id | |
| - system_prompt | |
| - mode (`recommendation`, `support`, `search`) | |
| - retrieval_limit | |
| - response_style | |
| - created_at | |
| - updated_at | |
| ### Retrieval Logs / Chat Logs | |
| These are important for debugging and trust. | |
| Suggested fields: | |
| - id | |
| - organization_id | |
| - user_id nullable | |
| - session_id nullable | |
| - query | |
| - final_answer | |
| - used_document_ids JSONB | |
| - used_chunk_ids JSONB | |
| - retrieval_debug JSONB | |
| - mode | |
| - created_at | |
| These logs let us inspect why the assistant answered the way it did. | |
| --- | |
| ## Product-Specific Knowledge vs Generic Knowledge | |
| Do not create a completely separate system for products. | |
| Instead, products should be represented as documents with structured metadata. | |
| For example, a WooCommerce product document can include metadata like: | |
| - product_id | |
| - sku | |
| - price | |
| - categories | |
| - tags | |
| - permalink | |
| - stock status if needed | |
| - short description | |
| - full description | |
| This lets retrieval work across: | |
| - product descriptions | |
| - PDFs | |
| - manuals | |
| - blog posts | |
| In future, ranking can prefer products when the question appears commercial. | |
| --- | |
| ## Ingestion Sources | |
| Phase 1 needs these source types: | |
| ### 1. PDF Upload | |
| Admins can upload PDFs in the admin panel. | |
| Flow: | |
| 1. upload PDF | |
| 2. save file (local or object storage) | |
| 3. create document record | |
| 4. extract text | |
| 5. clean text | |
| 6. chunk text | |
| 7. create embeddings | |
| 8. store chunks | |
| ### 2. WordPress Content via REST API | |
| Use the WordPress REST API as an ingestion source. | |
| Do **not** query it live during user chat. | |
| Use it to fetch content on sync. | |
| Endpoints may include: | |
| - `/wp-json/wp/v2/posts` | |
| - `/wp-json/wp/v2/pages` | |
| - optionally custom post types | |
| The content returned is usually HTML and must be cleaned into text. | |
| We should support: | |
| - manual sync | |
| - scheduled sync | |
| - webhook-triggered sync where possible | |
| ### 3. WooCommerce Products | |
| Use WooCommerce as an ingestion source for product knowledge. | |
| Product content can come from: | |
| - name | |
| - short description | |
| - full description | |
| - attributes | |
| - categories/tags | |
| - possibly FAQ/meta if needed later | |
| We should support: | |
| - initial full sync | |
| - incremental sync | |
| - webhook-triggered updates | |
| ### 4. Manual Knowledge Entries | |
| Admin can create knowledge entries directly inside the app. | |
| Useful for: | |
| - support notes | |
| - “things the AI should say” | |
| - event descriptions | |
| - internal definitions | |
| These become documents and get chunked/indexed like any other source. | |
| --- | |
| ## Content Normalization | |
| All content sources must be normalized before chunking. | |
| ### PDF normalization | |
| - extract text | |
| - remove repeated headers/footers when possible | |
| - normalize whitespace | |
| ### WordPress normalization | |
| - fetch rendered HTML from REST API | |
| - remove HTML tags | |
| - strip navigation-like artifacts if present | |
| - normalize whitespace | |
| - preserve useful headings if possible | |
| ### WooCommerce normalization | |
| - combine selected product fields into a single normalized text representation | |
| - preserve product name prominently | |
| - optionally include structured metadata in metadata JSONB, not necessarily inline in text | |
| ### Manual content normalization | |
| - save text as-is after basic cleanup | |
| --- | |
| ## Chunking Strategy | |
| Suggested defaults: | |
| - chunk target size: 500–800 tokens | |
| - overlap: 50–100 tokens | |
| Each document is split into ordered chunks. | |
| Store `chunk_index` to preserve sequence. | |
| The chunking utility should aim to split on logical boundaries where possible: | |
| - headings | |
| - paragraphs | |
| - list boundaries | |
| Avoid splitting in the middle of sentences if possible. | |
| --- | |
| ## Embedding Strategy | |
| Use OpenAI embeddings. | |
| Suggested initial model: | |
| - `text-embedding-3-small` | |
| Rules: | |
| - generate embeddings for every chunk | |
| - generate embedding for every user query | |
| - store embeddings in pgvector | |
| - re-embed chunks when content changes | |
| --- | |
| ## Retrieval Strategy | |
| The retrieval pipeline should be explicit and debuggable. | |
| ### Standard retrieval flow | |
| 1. load organization settings | |
| 2. embed the user query | |
| 3. query chunks within the same organization | |
| 4. restrict to active documents only | |
| 5. optionally filter by mode or source type | |
| 6. return top relevant chunks with metadata | |
| 7. build prompt from those chunks | |
| 8. call LLM | |
| 9. return answer and sources | |
| ### Retrieval rules | |
| - default top-k should be configurable | |
| - retrieval results should include similarity score/distance | |
| - future ranking can bias: | |
| - product docs for recommendation mode | |
| - PDF/manual docs for support mode | |
| - blog/article docs for search mode | |
| ### Important rule | |
| Never send too much context to the model. | |
| Prefer a curated set of top chunks over large text dumps. | |
| --- | |
| ## Traceability Requirements | |
| This is a core product requirement. | |
| Every answer should be explainable. | |
| ### API response should include | |
| - answer text | |
| - source list | |
| - optionally retrieval debug data for admin-only test mode | |
| ### Source item structure | |
| Each source item should include at least: | |
| - document_id | |
| - document_title | |
| - chunk_id | |
| - source_type | |
| - source_url if available | |
| - snippet excerpt | |
| ### Admin test mode should additionally show | |
| - similarity score or rank | |
| - full retrieved chunks | |
| - which prompt mode was used | |
| - final prompt preview if needed for debugging | |
| - whether content came from products, pages, PDFs, or manual knowledge | |
| --- | |
| ## Freshness and Sync Strategy | |
| We need both sync jobs and webhooks. | |
| ### Why both are needed | |
| - sync jobs are reliable baseline reconciliation | |
| - webhooks give near-real-time freshness | |
| ### WooCommerce freshness | |
| Register product-related webhooks where appropriate, such as: | |
| - product.created | |
| - product.updated | |
| - possibly product.deleted | |
| On webhook: | |
| 1. validate webhook | |
| 2. identify product | |
| 3. fetch latest product data if necessary | |
| 4. upsert corresponding document | |
| 5. delete old chunks | |
| 6. regenerate chunks and embeddings | |
| Also support scheduled reconciliation sync to catch missed webhooks. | |
| ### WordPress freshness | |
| WordPress does not provide a strong native webhook system by default. | |
| Recommended approaches: | |
| #### Preferred | |
| Use a plugin such as WP Webhooks to notify our system when posts/pages change. | |
| #### Fallback | |
| Run scheduled sync using REST API and compare `modified` timestamps or a content hash. | |
| On webhook or sync update: | |
| 1. find changed page/post | |
| 2. fetch latest content from REST API | |
| 3. upsert document | |
| 4. replace chunks | |
| ### PDF freshness | |
| PDFs are manually managed. | |
| Freshness actions: | |
| - upload new file | |
| - replace file | |
| - delete/disable file | |
| - re-index file | |
| --- | |
| ## Admin UX Philosophy | |
| Admins should manage **knowledge**, not vectors. | |
| Do not expose embeddings or low-level AI jargon as the primary interface. | |
| The core admin experience should center around: | |
| 1. Sources | |
| 2. Documents | |
| 3. Test Chat | |
| 4. Settings | |
| --- | |
| ## Admin Frontend Information Architecture | |
| ### 1. Dashboard | |
| Simple overview: | |
| - total active sources | |
| - total active documents | |
| - total chunks | |
| - last sync status | |
| - recent ingestion errors | |
| - quick links to test chat and manage content | |
| ### 2. Sources Page | |
| Purpose: manage integrations and knowledge sources. | |
| Display a list of sources such as: | |
| - WooCommerce Store | |
| - WordPress Site | |
| - PDFs | |
| - Manual Content | |
| Each source card or row should show: | |
| - source name | |
| - type | |
| - status | |
| - last sync time | |
| - sync health | |
| - actions: | |
| - sync now | |
| - configure | |
| - disable | |
| - view documents | |
| Important: sources are the top-level admin abstraction for data origin. | |
| ### 3. Documents Page | |
| Purpose: manage actual knowledge items. | |
| This is the most important page. | |
| Table/list fields: | |
| - title | |
| - source | |
| - type | |
| - status | |
| - last synced | |
| - actions | |
| Actions: | |
| - view detail | |
| - enable/disable | |
| - resync/reindex | |
| - delete | |
| - optionally edit (future) | |
| This page must work well across industries. | |
| Examples visible in the same UI: | |
| - a WooCommerce product description | |
| - a WordPress blog article | |
| - a PDF manual | |
| - a manual knowledge note | |
| - an event description | |
| Do not hardcode the page title or UX around “products” only. | |
| ### 4. Document Detail Page | |
| Purpose: inspect what the assistant knows for a specific item. | |
| Should show: | |
| - title | |
| - source info | |
| - source type | |
| - source URL | |
| - external ID | |
| - last synced | |
| - document status | |
| Sections: | |
| #### Content Preview | |
| Show normalized content preview. | |
| #### Chunks | |
| Show chunk list in order. | |
| For each chunk, display: | |
| - chunk index | |
| - snippet | |
| - token count if available | |
| - metadata | |
| - optional embedding debug only if needed for internal dev | |
| #### Actions | |
| - reindex | |
| - disable | |
| - delete | |
| - future: override/edit content | |
| This page is crucial for debugging misinformation. | |
| ### 5. Test Chat Page | |
| This is a must-have for Phase 1. | |
| Purpose: | |
| - allow admin to test the assistant before embedding publicly | |
| - inspect sources used | |
| - validate whether retrieval quality is good | |
| UI should include: | |
| - chat input | |
| - response output | |
| - sources panel | |
| - debug panel | |
| Debug panel may show: | |
| - retrieved documents | |
| - retrieved chunks | |
| - similarity scores | |
| - prompt mode | |
| - final prompt preview (optional) | |
| - source types involved | |
| This is where traceability becomes usable. | |
| ### 6. Settings Page | |
| At minimum include: | |
| - system prompt editor | |
| - mode selector | |
| - retrieval limit | |
| - maybe source preferences later | |
| The system prompt must be editable in admin. | |
| --- | |
| ## Should Admin Be Able to Edit Content? | |
| ### Recommendation for Phase 1 | |
| Use **read-only synced documents** plus **manual knowledge entries**. | |
| That means: | |
| - WooCommerce and WordPress content are primarily synced from source | |
| - admin can enable/disable/reindex them | |
| - admin can add manual entries for exceptions or important clarifications | |
| ### Optional future feature | |
| Add `override_content` to documents so admin can override imported content without editing the source platform. | |
| But this does not need to be in MVP unless explicitly requested. | |
| --- | |
| ## Public Chat / Embed Strategy | |
| The system must support both: | |
| 1. admin-only test chat | |
| 2. public/full-page embedded chat | |
| ### Recommended Phase 1 embed approach | |
| Expose a public full-page chat route that can be embedded via iframe into WordPress. | |
| Example: | |
| ```html | |
| <iframe src="https://app.example.com/embed/{organization-or-assistant-key}"></iframe> | |
| ``` | |
| This is much simpler and faster than building a full JavaScript widget first. | |
| ### Phase 1 also includes a JS widget | |
| In addition to the full-page iframe, ship a lightweight JavaScript widget: | |
| ```html | |
| <script src="https://app.example.com/widget.js?key=..."></script> | |
| ``` | |
| The widget renders a floating chat bubble that opens into a chat panel on any page. Both embed modes (iframe and widget) point at the same public chat API. See Extensions §11 for detailed widget requirements. | |
| ### Why not just give WP an API endpoint? | |
| That is possible, but then the WordPress side must build and maintain the UI. | |
| For MVP, our app should own the chat UI. | |
| --- | |
| ## Public Assistant Security Model | |
| Public embed routes should not expose organization internals. | |
| Use an assistant/public token or embed key tied to an organization or assistant configuration. | |
| The public route should know: | |
| - which organization’s data to query | |
| - which prompt/settings to use | |
| Do not use the admin session for embeds. | |
| --- | |
| ## Prompting Strategy | |
| The system prompt must be organization-configurable. | |
| ### Initial modes | |
| - `recommendation` | |
| - `support` | |
| - `search` | |
| ### Example behavior | |
| #### recommendation | |
| Favor recommending relevant products/services when available. | |
| #### support | |
| Favor accurate technical/support answers from PDFs/manual docs. | |
| #### search | |
| Favor website/blog/article discovery and concise answers with references. | |
| ### Prompt rules | |
| The prompt should emphasize: | |
| - only use provided context | |
| - if the answer is not in context, say so | |
| - do not fabricate | |
| - keep answer aligned with chosen mode | |
| - cite or summarize relevant source-backed details | |
| --- | |
| ## Background Jobs | |
| Use BullMQ workers for non-trivial operations. | |
| Required jobs: | |
| - ingest-pdf | |
| - sync-wordpress | |
| - sync-woocommerce | |
| - reindex-document | |
| - process-webhook | |
| Do not block the request-response cycle with heavy embedding work if avoidable. | |
| Typical flow: | |
| 1. API request creates source or upload | |
| 2. enqueue job | |
| 3. worker processes | |
| 4. source/document statuses update | |
| --- | |
| ## Recommended Status Fields | |
| ### Source status | |
| - active | |
| - disabled | |
| - syncing | |
| - error | |
| ### Document status | |
| - active | |
| - disabled | |
| - syncing | |
| - error | |
| - deleted | |
| These statuses should drive admin visibility and retrieval filters. | |
| Only active documents should participate in retrieval. | |
| --- | |
| ## Suggested API Surface | |
| This is indicative, not strict. | |
| ### Auth | |
| - `POST /api/auth/...` via Better Auth | |
| ### Sources | |
| - `GET /api/admin/sources` | |
| - `POST /api/admin/sources` | |
| - `GET /api/admin/sources/:id` | |
| - `PUT /api/admin/sources/:id` | |
| - `POST /api/admin/sources/:id/sync` | |
| ### Documents | |
| - `GET /api/admin/documents` | |
| - `GET /api/admin/documents/:id` | |
| - `POST /api/admin/documents/manual` | |
| - `POST /api/admin/documents/:id/reindex` | |
| - `PUT /api/admin/documents/:id/status` | |
| - `DELETE /api/admin/documents/:id` | |
| ### Uploads | |
| - `POST /api/admin/upload/pdf` | |
| ### Webhooks | |
| - `POST /api/webhooks/woocommerce/:sourceId` | |
| - `POST /api/webhooks/wordpress/:sourceId` | |
| ### Chat | |
| - `POST /api/chat` (admin/internal) | |
| - `POST /api/embed/:key/chat` (public embed) | |
| - optional `GET /embed/:key` full-page public chat UI | |
| ### Settings | |
| - `GET /api/admin/settings` | |
| - `PUT /api/admin/settings` | |
| --- | |
| ## WordPress Ingestion Design | |
| ### Source configuration | |
| A WordPress source should support config like: | |
| - base URL | |
| - REST auth if needed | |
| - selected content types: | |
| - posts | |
| - pages | |
| - maybe custom post types later | |
| - sync mode: | |
| - manual | |
| - scheduled | |
| - webhook + scheduled fallback | |
| ### Sync flow | |
| 1. fetch content from REST API | |
| 2. extract: | |
| - id | |
| - title | |
| - slug | |
| - link | |
| - modified date | |
| - content.rendered | |
| 3. clean HTML into normalized text | |
| 4. upsert document | |
| 5. delete old chunks | |
| 6. regenerate chunks + embeddings | |
| ### Live chat rule | |
| Again: the WP REST API is for ingestion only, not runtime Q&A. | |
| --- | |
| ## WooCommerce Ingestion Design | |
| ### Source configuration | |
| A WooCommerce source should support config like: | |
| - store URL | |
| - API credentials | |
| - optional source filters if needed later | |
| ### Sync flow | |
| For each product: | |
| 1. fetch product fields | |
| 2. build normalized text from relevant fields | |
| 3. upsert document with type `woo_product` | |
| 4. replace chunks | |
| Potential product text composition: | |
| - product name | |
| - short description | |
| - full description | |
| - selected attributes or tags if helpful | |
| Do not clutter normalized text with too much raw structured data. | |
| Store structured values in metadata where appropriate. | |
| --- | |
| ## Manual Knowledge Design | |
| Manual entries are important because synced source content is not always enough. | |
| Allow admin to create a manual document with: | |
| - title | |
| - content | |
| - type `manual` | |
| - status active/disabled | |
| This gives the organization a way to teach the AI extra information without editing the external platforms. | |
| --- | |
| ## Traceability UX Requirements | |
| Traceability is a core promise. | |
| ### Public chat | |
| Public users may see: | |
| - answer | |
| - concise “Sources” list | |
| ### Admin test chat | |
| Admins should see richer traceability: | |
| - answer | |
| - sources | |
| - retrieved chunks | |
| - similarity rank/score | |
| - prompt mode | |
| - debug metadata | |
| This is what lets the team inspect misinformation and decide whether to: | |
| - disable a document | |
| - fix source content | |
| - add manual knowledge | |
| - adjust prompt/settings | |
| --- | |
| ## What We Reuse Conceptually From the Existing SaaS Architecture | |
| The existing SaaS architecture principles are highly reusable: | |
| ### Reusable principles | |
| - multi-tenant scoping | |
| - source-of-truth mindset | |
| - background jobs | |
| - explicit sync model | |
| - webhook + scheduled reconciliation | |
| - auditability | |
| - admin operational visibility | |
| ### Equivalent mapping | |
| Old inventory-oriented concepts map to AI knowledge concepts like this: | |
| - store integration → source integration | |
| - synced product/order record → document | |
| - stock movement logs → retrieval/chat logs | |
| - sync jobs → ingestion jobs | |
| - health dashboards → source sync status and ingestion status | |
| - admin debug tooling → test chat + source inspection | |
| --- | |
| ## Implementation Philosophy | |
| Claude should build this with these priorities: | |
| 1. clarity over over-engineering | |
| 2. debuggability over magic | |
| 3. source-first knowledge management | |
| 4. generic abstractions instead of industry-specific assumptions | |
| 5. strong admin visibility from day one | |
| The admin should be able to answer: | |
| - What does the AI know? | |
| - Where did that answer come from? | |
| - Which document caused bad information? | |
| - How do I disable or fix it? | |
| - Is my WordPress/WooCommerce content synced and fresh? | |
| --- | |
| ## Deliverables Expected From Claude | |
| Claude should build: | |
| ### Backend / infrastructure | |
| - Better Auth integration | |
| - organization-aware DB schema | |
| - source/document/chunk models | |
| - pgvector support | |
| - ingestion workers | |
| - chat API | |
| - embed/public chat API | |
| - webhook endpoints | |
| - sync logic for WordPress and WooCommerce | |
| ### Admin UI | |
| - login | |
| - dashboard | |
| - sources page | |
| - documents list | |
| - document detail | |
| - test chat | |
| - settings page | |
| ### Public UI | |
| - full-page embeddable chat route | |
| ### Behavior | |
| - PDF upload and ingestion | |
| - WordPress sync via REST API | |
| - WooCommerce sync | |
| - chunking and embedding | |
| - retrieval + answer generation | |
| - traceable source-backed responses | |
| --- | |
| ## Non-Negotiable Rules | |
| 1. Do not query live WordPress or WooCommerce during chat requests | |
| 2. Only query our own indexed data during chat | |
| 3. Always return traceability info for admin testing | |
| 4. Only active documents participate in retrieval | |
| 5. All data must be organization-scoped | |
| 6. Keep retrieval pipeline explicit and inspectable | |
| 7. Build the system as a generic knowledge platform, not a Woo-only chatbot | |
| --- | |
| ## Extensions: Production Requirements for Real Use Cases | |
| This section overrides and extends earlier phase definitions based on the two confirmed Phase 1 clients: | |
| 1. A **WooCommerce store** — product recommendations and commerce Q&A | |
| 2. A **WordPress nightlife events site** — what's on tonight / this weekend / which venue / ticket links | |
| These extensions are **required for Phase 1**, not optional. Where this section conflicts with earlier text, this section wins. | |
| --- | |
| ### 1. Time-Aware Retrieval (events) | |
| Purpose: event content is only useful before the event ends. | |
| #### Required document fields (first-class columns, not JSONB) | |
| - `event_start` (timestamptz, nullable) | |
| - `event_end` (timestamptz, nullable) | |
| - `venue` (text, nullable) | |
| - `city` (text, nullable) | |
| #### Status behavior | |
| - A document with `event_end < now()` auto-transitions to `expired` status. | |
| - `expired` documents are excluded from retrieval by default. | |
| - Expired documents remain visible in admin so recurring events can be restored/extended. | |
| #### Query-time behavior | |
| - Parse temporal intent in the user query *before* vector search: | |
| - "tonight" → today, from now → end of day | |
| - "tomorrow" → next calendar day | |
| - "this weekend" → upcoming Saturday + Sunday | |
| - weekday names → upcoming occurrence of that weekday | |
| - "next week" → Monday–Sunday of next week | |
| - explicit dates → parsed to a range | |
| - If temporal intent is detected: apply `WHERE event_start BETWEEN ... AND ...` *before* vector search. | |
| - If no temporal intent: default to future events only (`event_end >= now()`). | |
| #### Non-event documents | |
| - For documents without `event_start` (products, PDFs, manual knowledge, blog posts), temporal filters are skipped — missing dates count as "always valid", not "expired". | |
| --- | |
| ### 2. Custom Post Types, ACF, and Arbitrary Meta Fields | |
| The client is the site administrator on the WP side. Assume they can expose anything via REST. The ingestion config must take advantage of that. | |
| #### WordPress source configuration (extended) | |
| Each WordPress source must let the admin configure: | |
| - Base URL | |
| - Auth (application password or bearer token) | |
| - Selected post types — arbitrary (`post`, `page`, `tribe_events`, `event`, `product`, custom slugs) | |
| - **Per-post-type field mapping:** | |
| - **Fields concatenated into normalized text** (e.g. `title`, `content.rendered`, `acf.description`, `meta.venue_description`) | |
| - **Fields mapped to structured metadata columns** (e.g. `acf.event_date` → `event_start`, `acf.venue_name` → `venue`, `acf.ticket_url` → `metadata.ticket_url`, `acf.price` → `metadata.price`, `_thumbnail` → `primary_image_url`) | |
| - **Fields ignored** | |
| This mapping lives on the source record (JSONB config) and is editable in admin. | |
| #### WooCommerce meta / attributes | |
| Same pattern: admin maps product attributes, ACF, and custom meta into either normalized text or structured metadata. Examples: | |
| - `attributes.color`, `attributes.size` → `metadata.tags` (filterable) | |
| - ACF product fields → text or metadata | |
| - Custom meta (warranty, return policy snippets) → text | |
| - `images[0].src` → `primary_image_url` | |
| --- | |
| ### 3. Sitemap Ingestion | |
| #### Scope: per-sitemap, not whole-site | |
| Do **not** crawl the whole site from `/sitemap.xml`. That pulls noise: `/cart`, `/checkout`, `/my-account`, archive pages, paginated loops. | |
| Instead: | |
| - Admin specifies one or more specific sitemap URLs. | |
| - Most WP SEO plugins (Yoast, Rank Math, SEOPress) split sitemaps per post type: | |
| - `sitemap-posts.xml` | |
| - `sitemap-pages.xml` | |
| - `sitemap-products.xml` | |
| - `sitemap-events.xml` / `sitemap-tribe_events.xml` | |
| - Admin picks exactly which sitemaps are in scope. | |
| #### Optional URL filters | |
| Each sitemap config may include: | |
| - `include_patterns` (regex or glob allowlist) | |
| - `exclude_patterns` (regex or glob denylist) | |
| Applied after sitemap fetch, before ingestion. | |
| #### Sitemap is a fallback, not the primary path | |
| When REST API exposes the content type cleanly, REST is preferred — richer fields, faster incremental sync. Use sitemap when: | |
| - A post type isn't exposed in REST | |
| - Orphaned URLs aren't returned by REST filters | |
| - Rendered HTML is easier to extract than raw post content | |
| Sitemap crawl: fetch URL → extract main content HTML → clean to text → ingest. | |
| --- | |
| ### 4. Hybrid Search (Vector + Keyword) | |
| Pure cosine similarity misses exact matches: product names, SKUs, DJ names, event titles, specific dates. | |
| #### Implementation | |
| - Maintain pgvector embeddings (semantic) | |
| - Maintain a Postgres `tsvector` column on chunks (and/or documents) for full-text search (lexical) | |
| - On every query: | |
| 1. Run vector search (top-k, e.g. k=20) | |
| 2. Run `tsvector` search (top-k, e.g. k=20) | |
| 3. Merge with **Reciprocal Rank Fusion (RRF)** using the canonical formula: `score = Σ 1 / (k + rank_i)` across result lists, with `k = 60`. | |
| 4. Take top N after fusion (e.g. N=6) into the LLM prompt. | |
| #### Weighting | |
| - Default: equal weight RRF | |
| - `recommendation` mode: boost documents of type `woo_product` | |
| - `events` mode (new — see §9): boost documents with `event_start` inside the query's date window | |
| --- | |
| ### 5. Query-Time Metadata Filters | |
| RAG alone cannot answer "techno events under €30 this weekend" or "red dresses under €50 in stock." | |
| #### Required | |
| Retrieval must accept structured filters applied to document/chunk metadata: | |
| - Date range (`event_start` between X and Y) | |
| - Price range (`metadata.price` between X and Y, if indexed) | |
| - Category / tag (from metadata) | |
| - Stock status (Woo) | |
| - City / venue | |
| Filters are applied as SQL `WHERE` clauses *before* vector + keyword search. | |
| #### Extracting filters from the query | |
| Split by filter type — do not run everything through an LLM, and do not rely on regex for everything: | |
| 1. **Rule-based parser for temporal intent** (`tonight`, `tomorrow`, weekday names, "this weekend", explicit dates). Deterministic, no LLM call, near-zero latency. Dates are simple and high-frequency — a rule layer is more reliable than an LLM here. | |
| 2. **LLM-based query planner for everything else** (price ranges, categories, stock, city, complex intent). A small prompt extracts `{price_range, categories, stock_only, city, ...}` from the user question. | |
| Both run in parallel; results are merged into a single filter object. Log the extracted filters in retrieval debug. | |
| --- | |
| ### 6. Structured Responses (Cards) | |
| The chat API must return structured output, not just prose. | |
| #### Response shape | |
| ```json | |
| { | |
| "answer": "Three events match 'techno Friday'.", | |
| "cards": [ | |
| { | |
| "type": "event", | |
| "document_id": "...", | |
| "title": "Amelie Lens @ Warehouse", | |
| "image": "https://.../image.jpg", | |
| "url": "https://.../event/amelie-lens", | |
| "date": "2026-04-24T22:00:00Z", | |
| "venue": "Warehouse, Belgrade", | |
| "price": "€25", | |
| "cta": { "label": "Get tickets", "url": "https://tickets.example/..." } | |
| }, | |
| { | |
| "type": "product", | |
| "document_id": "...", | |
| "title": "Red Summer Dress", | |
| "image": "https://.../dress.jpg", | |
| "url": "https://shop.example/product/red-dress", | |
| "price": "€39.99", | |
| "in_stock": true, | |
| "cta": { "label": "Add to cart", "url": "..." } | |
| } | |
| ], | |
| "sources": [ ... ] | |
| } | |
| ``` | |
| #### Card population | |
| Cards are built from the **top-N documents after hybrid retrieval + RRF fusion** — not from whichever documents the LLM happens to "reference" in its prose (parsing answer text for citations is fragile and inconsistent). | |
| **Do not ask the LLM to generate card JSON** — it hallucinates fields. | |
| 1. LLM generates the `answer` text only. | |
| 2. Server code builds `cards[]` from the metadata of the top-N fused retrieval results (the same documents passed to the LLM as context). | |
| 3. Both are returned together. | |
| Card types supported in Phase 1: `event`, `product`, `article` (generic link with title/excerpt/image). | |
| --- | |
| ### 7. Images and Media | |
| Add to the documents table: | |
| - `primary_image_url` (text, nullable) | |
| Populated during ingestion: | |
| - Woo: `images[0].src` | |
| - WP: featured image URL (requires `_embed=1` or explicit media fetch) | |
| - Events: featured image or ACF-mapped hero image | |
| - PDF / manual: null (unless admin attaches one) | |
| The embed UI renders cards with images where available. | |
| --- | |
| ### 8. Conversation Memory (Sliding Window + Query Rewriting) | |
| Phase 1 includes short conversation memory. Long-term / summary memory is deferred. | |
| #### Sliding window | |
| - Keep the last **N = 6 messages** (3 user + 3 assistant exchanges) in the LLM prompt. | |
| - Drop older messages from prompt context (they're still stored in chat logs for admin review). | |
| - 6 is a good default. Expose as an org setting if needed later. | |
| #### Do not send full history every turn | |
| - Cost and latency grow linearly with history length. | |
| - Older turns drag retrieval in wrong directions. | |
| - 6 messages cover "cheaper ones", "tomorrow?", "is it wheelchair accessible?", "tell me more about the second one". | |
| #### Query rewriting for retrieval (important) | |
| Embedding "cheaper ones" alone retrieves nothing useful. Before retrieval: | |
| 1. If the latest user message is a follow-up (short, referential: "cheaper", "tomorrow?", "the second one"), run a cheap LLM call that rewrites it into a standalone question using the last 2–4 turns. | |
| - Example: "cheaper ones" + prior context "events this Friday" → "cheaper techno events this Friday" | |
| 2. Embed and search using the rewritten query. | |
| 3. The rewritten query also feeds the filter extractor (§5). | |
| This is small but high-impact. Log both the original and rewritten query in chat logs. | |
| --- | |
| ### 9. System Prompt Presets | |
| Admins are typically non-technical. A blank textarea is a trap. Ship presets. | |
| #### What a preset is | |
| A preset bundles: | |
| - A **mode** (`recommendation`, `support`, `search`, plus the new `events`) | |
| - A **pre-written system prompt template** with `{brand}` and other placeholders | |
| - **Default retrieval settings** (top-k, filters, boosts) | |
| - **Default card types** to render | |
| #### Presets shipped in Phase 1 | |
| **1. E-commerce / Shopping Assistant** — for the Woo client | |
| - Mode: `recommendation` | |
| - Prompt (excerpt): *"You are a shopping assistant for {brand}. Recommend only products present in the provided context. Reference each recommendation by name. Never invent SKUs, prices, or stock status. If nothing matches, say so and suggest the closest alternative. Keep replies short."* | |
| - Card type: `product` | |
| - Boost: `woo_product` documents | |
| **2. Events Concierge** — for the nightlife client | |
| - Mode: `events` | |
| - Prompt (excerpt): *"You are an events concierge for {brand}. Recommend events matching the user's date and interest, using only the provided context. Always show events with date, venue, and ticket link. If no events match the requested date, say so clearly and suggest the closest alternatives in date. Never invent events or venues. Keep replies short."* | |
| - Card type: `event` | |
| - Default filter: future events only, unless user explicitly asks about past events | |
| - Boost: `event` documents | |
| **3. Support / Helpdesk** | |
| - Mode: `support` | |
| - Prompt (excerpt): *"You are a support agent for {brand}. Answer using the provided documentation only. If the answer is not in context, say 'I don't have that information' and suggest contacting support. Quote the relevant snippet when helpful. Never guess."* | |
| - Card type: `article` | |
| - Boost: `pdf`, `manual`, `wp_page` | |
| Admin picks a preset, edits `{brand}` and any wording they want, saves. They can always go fully custom later. | |
| --- | |
| ### 10. Manual Knowledge Authoring UX | |
| Manual documents are how admins teach the AI things that aren't on their site yet: dress code, age policy, refund policy, event FAQs, venue directions, "please never recommend the discontinued X line". | |
| #### Editor requirements | |
| - Markdown or rich-text editor (not a `<textarea>`) | |
| - Fields: title, body, optional tags, optional expiration date | |
| - Save creates a document with `type = 'manual'`; chunks + embeds immediately | |
| - Edit re-chunks and re-embeds on save | |
| #### Organization / discovery | |
| - Tag manual docs (e.g. `policy`, `faq`, `venue-info`) | |
| - Filter the documents page by tag and by `type = 'manual'` | |
| - Allow duplicating an existing manual doc as a template | |
| --- | |
| ### 11. Embeddable JS Widget (Phase 1) | |
| The iframe full-page embed still ships, but the JS widget is also Phase 1. | |
| #### Widget requirements | |
| - A single `<script src="https://app.example.com/widget.js?key=...">` tag | |
| - Floating chat bubble (position configurable) | |
| - Opens into a chat panel | |
| - Renders cards (event, product, article) with images and CTAs | |
| - Branded to match the org's primary color + logo (configured in admin) | |
| - Mobile responsive | |
| - No jQuery / framework bloat — vanilla JS or Preact. Target **<100KB gzipped** for the full widget bundle (realistic for chat UI + cards; push lower if practical). | |
| #### Widget vs iframe | |
| - iframe = full-page dedicated embed | |
| - Widget = chat bubble overlay on existing pages | |
| Both call the same public chat API. | |
| --- | |
| ### 12. Public Embed Security | |
| #### Required Phase 1 protections | |
| 1. **Origin allowlist**: each public key has an allowed origins list. Chat API checks `Origin`/`Referer`. Mismatch = reject. | |
| 2. **Rate limiting**: | |
| - Per key: e.g. 60 req/min, 1000/day (defaults, org-configurable) | |
| - Per IP: e.g. 10 req/min | |
| - Return 429 with `Retry-After` | |
| 3. **Key rotation**: admin can rotate; old key stops after a grace period. | |
| 4. **Input size caps**: max user message length (e.g. 1000 chars), max history length sent. | |
| 5. **No admin-debug leakage**: public response must not include retrieval debug info. | |
| 6. **CORS**: configured to match origin allowlist, not `*`. | |
| #### Optional Phase 1 | |
| - CAPTCHA after N abusive requests from the same IP | |
| - Basic prompt-injection filter on user input (strip obvious patterns; log-only) | |
| --- | |
| ### 13. User Feedback and Admin Feedback Review | |
| #### Public chat UI | |
| After each assistant response, show 👍 / 👎 buttons. On click: | |
| - `POST /api/embed/:key/feedback` with `{ message_id, rating, optional_comment }` | |
| - Persist to `chat_feedback` table | |
| #### Schema: `chat_feedback` | |
| - `id` | |
| - `organization_id` | |
| - `chat_log_id` (links to the retrieval/chat log) | |
| - `rating` (`up` | `down`) | |
| - `comment` (text, nullable) | |
| - `user_session_id` (nullable) | |
| - `created_at` | |
| #### Admin "Feedback" page | |
| A dedicated admin page, newest first: | |
| - Columns: timestamp, rating, user query, assistant answer, comment | |
| - Row click → full retrieval trace (same view as admin Test Chat debug panel) | |
| - Filters: rating = down, date range, source types used | |
| - Quick actions from the row: | |
| - Disable one of the cited documents | |
| - Open the cited document for editing | |
| - Add a manual knowledge entry to correct the answer | |
| - Mark feedback as "addressed" | |
| This is the closed loop: real users flag bad answers → admin sees context → admin fixes knowledge → next time is better. | |
| --- | |
| ### 14. Cost and Usage Controls | |
| Each organization should have: | |
| - Monthly token budget (embeddings + completions) | |
| - Current usage counter, reset monthly | |
| - Soft warning at 80%, hard stop at 100% (configurable) | |
| - Per-request token log (tokens in, tokens out, model, cost estimate) | |
| - Admin view: "Usage this month" with simple chart and cost estimate | |
| This protects against runaway traffic and makes the economics obvious. | |
| --- | |
| ### 15. Updated Data Model Summary | |
| Compared to the earlier Core Data Model, add: | |
| #### documents (new columns) | |
| - `document_role` text not null — behavior role (`product` | `event` | `article` | `support` | `manual`). Drives retrieval boosting, card rendering, and UI filtering. See §18. | |
| - `source_modified_at` timestamptz null — the `modified` timestamp as reported by the source (WP `modified`, Woo `date_modified`). Drives freshness conflict resolution. See §23. | |
| - `event_start` timestamptz null | |
| - `event_end` timestamptz null | |
| - `venue` text null | |
| - `city` text null | |
| - `primary_image_url` text null | |
| - `tags` text[] null (manual docs + mapped from source) | |
| - `expires_at` timestamptz null (manual docs with expiry) | |
| #### sources (extended config JSONB) | |
| - For WP: `post_types[]`, `field_mapping{}`, `sitemap_urls[]`, `include_patterns[]`, `exclude_patterns[]` | |
| - For Woo: `attribute_mapping{}`, `meta_mapping{}` | |
| #### new table: `chat_feedback` | |
| As described in §13. | |
| #### new table: `usage_ledger` (or columns on organization) | |
| Monthly tokens and costs. | |
| #### new table: `assistant_keys` | |
| Public embed keys with origin allowlist, rate limits, rotation state. | |
| --- | |
| ### 16. Updated Retrieval Pipeline | |
| Replace the earlier retrieval flow with: | |
| 1. Load organization settings and mode preset. | |
| 2. **Rewrite query** if it's a short follow-up (using last 2–4 turns). | |
| 3. **Extract filters** from rewritten query (date range, price, category, stock, city). | |
| 4. **Pre-filter SQL**: select active, non-expired documents in the org, matching filters. | |
| 5. Run **vector search** (pgvector) on the pre-filtered set, top 20. | |
| 6. Run **keyword search** (tsvector) on the pre-filtered set, top 20. | |
| 7. **Fuse** with RRF, take top 6. | |
| 8. Apply mode-specific boosts (product docs for `recommendation`, event docs for `events`). | |
| 9. Build prompt: preset system prompt + last 6 conversation messages + retrieved chunks. | |
| 10. Call LLM → generate `answer`. | |
| 11. Build `cards[]` server-side from metadata of retrieved documents actually referenced. | |
| 12. Return `{ answer, cards, sources, debug? }`. | |
| 13. Log to `chat_logs`: original query, rewritten query, extracted filters, retrieved chunk IDs, scores, final prompt (admin debug only). | |
| --- | |
| ### 17. Updated Non-Negotiable Rules | |
| Add to the original list: | |
| 8. Expired event documents must not appear in retrieval unless explicitly requested. | |
| 9. Manual knowledge and synced content retrieve through the same pipeline — no separate path. | |
| 10. Card structured data is built server-side from top-N fused retrieval results, not generated by the LLM and not derived from LLM answer text. | |
| 11. Public embed endpoints must enforce origin allowlist + rate limits. | |
| 12. Every chat response must include traceable source IDs. | |
| 13. Every chat request MUST persist retrieval debug: original query, rewritten query, extracted filters, retrieved chunk IDs, fusion scores, final prompt. | |
| 14. Failed ingestion (embedding error, normalization failure) marks the document `status = error`. Prior successful chunks are preserved and keep serving retrieval; partial chunk writes are never committed. | |
| 15. No LLM call is made when retrieval returns zero qualifying chunks — return a deterministic fallback response instead (§20). | |
| 16. All public and admin API routes are versioned from day one (`/api/admin/v1/...`, `/api/embed/v1/...`). v1 contracts are frozen once deployed; breaking changes go to v2. | |
| 17. Freshness conflicts resolve on `source_modified_at`, not `last_synced_at`. Late-arriving webhooks carrying older data are ignored. | |
| --- | |
| ### 18. Document Role (behavior, separate from type) | |
| `type` tracks **origin** (`woo_product`, `wp_post`, `wp_page`, `pdf`, `manual`, `event`, custom CPT slugs). | |
| `document_role` tracks **behavior** — how retrieval, ranking, and card UI should treat it. | |
| Allowed values: | |
| - `product` — renders as product card, boosted in `recommendation` mode | |
| - `event` — renders as event card, date filters apply, boosted in `events` mode | |
| - `article` — renders as article card, used in `search` mode | |
| - `support` — renders as article card, boosted in `support` mode | |
| - `manual` — admin-authored; admin picks the effective role on creation | |
| Why the split matters: | |
| - A WP CPT `tribe_events` post and a manually-entered event both get `document_role = 'event'` and flow through the same retrieval/card path. | |
| - A WP `page` documenting return policy gets `document_role = 'support'` even though `type = wp_page`. | |
| - Retrieval boosting, filtering, and card rendering switch on `document_role`, not on `type`. | |
| How it's set: | |
| - Source field mapping (§2) specifies the role per post type during ingestion. | |
| - Manual docs let the admin choose a role at creation. | |
| - Sensible defaults: `woo_product` → `product`; `tribe_events` / `event` CPT → `event`; PDFs → `support`; WP posts → `article`; WP pages → `article` unless the admin overrides. | |
| --- | |
| ### 19. Mode → Role Priority Ranking | |
| Modes determine which roles are boosted in hybrid retrieval (applied **after** RRF fusion, not as a hard filter): | |
| | Mode | Priority order (highest first) | | |
| | ---- | ------------------------------ | | |
| | `recommendation` | `product` > `support` > `article` > `manual` | | |
| | `events` | `event` > `article` > `manual` | | |
| | `support` | `support` > `manual` > `article` | | |
| | `search` | neutral (no role boost) | | |
| Boost is applied by multiplying the fused score by a role factor (e.g. preferred role ×1.3, next ×1.1, others ×1.0), then re-ranking the top-N. Do **not** exclude non-preferred roles — a product FAQ (role `support`) should still appear for a recommendation-mode query if it's genuinely the best match, just ranked slightly lower than equivalent products. | |
| --- | |
| ### 20. No-Results Fallback | |
| If hybrid retrieval returns zero chunks above the configured similarity threshold, or the pre-filter removes everything: | |
| 1. **Do not call the LLM with empty context** — it will hallucinate. | |
| 2. Return a deterministic fallback: | |
| - `answer`: preset-configured fallback text (e.g. *"I don't have information about that. Want me to search more broadly?"*) | |
| - `cards`: empty | |
| - `sources`: empty | |
| - `debug.reason`: `"no_results_above_threshold"` or `"pre_filter_empty"` | |
| 3. Optionally re-run retrieval **without filters** and offer the top results as "closest matches" in a second response, clearly labeled as such ("No exact matches — here are some alternatives"). | |
| Log every zero-result event to a dedicated admin view. These are the highest-value signals for what knowledge is missing. | |
| --- | |
| ### 21. Re-Indexing Triggers | |
| A document must be re-embedded when: | |
| - `sync_hash` of the source content changes (update detected on sync) | |
| - The source's field mapping changes (admin edits what's included in normalized text) | |
| - The chunking strategy changes (global setting change) | |
| - The embedding model changes (global setting change) | |
| - Admin clicks "Reindex" manually | |
| - A previously failed embedding is retried | |
| Re-embedding flow: | |
| 1. Mark document `status = syncing` | |
| 2. Normalize content from the latest source data | |
| 3. Chunk | |
| 4. Embed all chunks (atomic: all-or-nothing) | |
| 5. **In one transaction**: delete old chunks, insert new chunks | |
| 6. Mark document `status = active` | |
| 7. On failure at any step: leave old chunks in place, set `status = error` (§22) | |
| Do not re-embed on every sync — only when `sync_hash` differs. Idempotent syncs are cheap. | |
| --- | |
| ### 22. Partial Ingestion and Error Handling | |
| #### Document-level failure | |
| If ingestion fails at any step (fetch, normalize, chunk, embed): | |
| - Set `status = error` | |
| - Store reason in `metadata.last_error` and `metadata.last_error_at` | |
| - **Preserve prior chunks** — the previous successful version continues serving retrieval until the next successful reindex. | |
| - Surface the error in admin (sources page + document detail) | |
| Exception: if this is the first-ever ingestion (no prior chunks), the document enters `error` with zero chunks and is excluded from retrieval. | |
| #### Chunk-level failure | |
| If one chunk fails to embed but others succeed (e.g. transient API error): | |
| - Retry with exponential backoff up to N times (e.g. N=3) | |
| - If it still fails: mark document `status = error`, do **not** replace prior chunks | |
| - Never partially commit — all chunks for a document are inserted together or none are | |
| #### Sync-level failure | |
| If a source sync fails (e.g. WP API down): | |
| - Existing documents stay as-is and remain in retrieval | |
| - Mark the source `status = error` and record the reason | |
| - Next scheduled sync retries automatically | |
| --- | |
| ### 23. Freshness Conflict Resolution | |
| Webhooks and scheduled syncs can race. Resolve by **source-side** timestamp, not our ingestion timestamp. | |
| Rule: | |
| - Each document stores `source_modified_at` — the `modified` timestamp as reported by the source (WP `modified`, Woo `date_modified`). | |
| - An incoming update (from webhook or sync) is applied only if its `source_modified_at` is **strictly newer** than the stored value. | |
| - Stale webhooks (delivered late, carrying older data) are silently ignored and logged. | |
| - If the incoming update lacks `source_modified_at` (rare), fall back to comparing our `last_synced_at`. | |
| Why not `last_synced_at`: `last_synced_at` is *when we synced*, not *when the source changed*. A late webhook could otherwise clobber newer content we already have. | |
| --- | |
| ### 24. API Versioning | |
| All public and admin routes are versioned from day one: | |
| - `/api/admin/v1/...` | |
| - `/api/embed/v1/:key/chat` | |
| - `/api/embed/v1/:key/feedback` | |
| - `/api/webhooks/v1/woocommerce/:sourceId` | |
| - `/api/webhooks/v1/wordpress/:sourceId` | |
| Rules: | |
| - v1 contracts are frozen once deployed. Breaking changes go to v2. | |
| - Additive changes (new optional fields in responses, new optional request parameters) do not require a version bump. | |
| - `widget.js` pins a default API version but can be overridden per embed during migration periods. | |
| --- | |
| ## Nice-to-Have Future Features (Not Required for Phase 1) | |
| - content overrides on synced documents | |
| - source tagging and filtering beyond manual-doc tags | |
| - deep analytics on user questions (beyond up/down feedback) | |
| - per-mode retrieval tuning dashboard | |
| - role permissions beyond admin/member | |
| - long-term conversation memory / summarization across sessions | |
| - source confidence tuning | |
| - assistant personas per organization beyond presets | |
| - CAPTCHA / advanced abuse protection | |
| --- | |
| ## Final Summary | |
| We are building a **multi-tenant, admin-controlled AI knowledge assistant platform**. | |
| It ingests: | |
| - WooCommerce products | |
| - WordPress content | |
| - PDFs | |
| - manual knowledge | |
| It normalizes them into: | |
| - sources | |
| - documents | |
| - chunks | |
| It answers questions through: | |
| - vector retrieval from pgvector | |
| - LLM generation using retrieved context only | |
| It provides: | |
| - full traceability | |
| - admin testing | |
| - source/document management | |
| - public embeddable chat | |
| This should be implemented as a generic, reusable knowledge system with explicit ingestion and retrieval pipelines. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment