Vectorizing and Indexing Structured JSON for RAG

Introduction

Retrieval-Augmented Generation (RAG) systems rely heavily on high-quality, efficiently retrievable vector embeddings. Using structured JSON as a source for vectorization can be very effective—provided the structure is leveraged appropriately.

This document outlines best practices, potential pitfalls, and implementation examples for vectorizing and indexing structured JSON data, with an emphasis on downstream use in RAG pipelines.

When is JSON a Good Source for Vectorization?

JSON is a great candidate for vectorization if:

The schema is consistent across entries.
Fields are semantically rich (e.g., descriptions, reviews, bios).
Metadata fields are useful for filtering or reranking (e.g., date, category).

However, simply flattening or concatenating JSON into a single string often leads to suboptimal embeddings. Key-value context can be lost.

Best Practices

1. Schema-Aware Vectorization

Extract semantically meaningful fields and maintain structure in the prompt.

{
  "title": "AI for Healthcare",
  "description": "A course that teaches medical professionals how to use AI tools.",
  "category": "Education",
  "tags": ["AI", "Healthcare", "Course"]
}

You should transform this into a prompt-friendly format before vectorization:

Title: AI for Healthcare
Description: A course that teaches medical professionals how to use AI tools.
Category: Education
Tags: AI, Healthcare, Course

2. Use Field-Specific Embedding

Use separate embeddings for each field and store them in the same vector index, optionally with field identifiers for reranking.

3. Store Metadata for Filtering

Metadata like category, tags, or timestamps should not be embedded but stored in a metadata store (e.g., with Pinecone or Weaviate).

Architecture Diagram

graph LR
A[JSON Source] --> B[Field Extraction]
B --> C[Prompt Formatting]
C --> D[Embedding Model e.g. OpenAI, Cohere]
D --> E[Vector Store e.g. FAISS, Pinecone]
B --> F[Metadata Extraction]
F --> E

Example Code (Python)

import json
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# Load JSON data
data = json.loads(open("data.json").read())

vectors = []
metadata = []

for entry in data:
    text = f"Title: {entry['title']}\nDescription: {entry['description']}\nCategory: {entry['category']}\nTags: {', '.join(entry['tags'])}"
    vector = model.encode(text)
    vectors.append(vector)
    metadata.append({"id": entry["id"], "category": entry["category"]})

# Create FAISS index
dimension = len(vectors[0])
index = faiss.IndexFlatL2(dimension)
index.add(np.array(vectors))

# Save index and metadata separately
faiss.write_index(index, "index.faiss")
with open("metadata.json", "w") as f:
    json.dump(metadata, f)

Key Considerations

Pros

JSON structure provides natural field boundaries.
Easy to extract meaningful metadata.
Supports hybrid search (vector + filter).

Cons

Field importance may not be learned by vanilla embedding models.
Schema changes can break pipelines.
Large JSON objects may need chunking.

Summary

Using structured JSON as a source for vector indexing in RAG is not only viable but advantageous if the structure is preserved and leveraged. Extract fields meaningfully, use embedding models intelligently, and retain metadata for hybrid retrieval.

With proper design, this approach improves context relevance, retrievability, and downstream generation quality.

decagondev/json-vectorization.md