This Markdown file addresses the question of similarity metrics other than cosine similarity for use in Retrieval-Augmented Generation (RAG) systems. It complements the lecture content on RAG systems, particularly the sections on "How Similarity Search Works" and "Similarity Search in Action," which emphasize mathematical matching in vector space to retrieve relevant content. Below, we list and explain alternative similarity metrics, their applications in RAG, and their relevance to the lecture’s focus on embeddings and vector databases.
Other than cosine similarity, what are other similarity metrics, and how do they work?
In RAG systems, similarity search is a core component, as described in the lecture’s sections on similarity search and vector databases. It involves comparing a query’s embedding vector to document vectors stored in a vector database (e.g., Pinecone, FAISS) to retrieve the most relevant content. While cosine similarity is commonly used due to its effectiveness with high-dimensional embeddings (e.g., those generated by all-MiniLM-L6-v2
), other metrics offer different perspectives on vector similarity or distance. Below, we explore four alternative metrics—Euclidean distance, Manhattan distance, dot product, and Jaccard similarity—explaining their mechanics, use cases, and considerations for RAG, with ties to the lecture’s examples like the Berkshire Hathaway shareholder letters.
Euclidean distance measures the straight-line (geometric) distance between two vectors in high-dimensional space. It is the most intuitive distance metric, based on the Pythagorean theorem.
For two vectors ( \mathbf{a} = [a_1, a_2, ..., a_n] ) and ( \mathbf{b} = [b_1, b_2, ..., b_n] ), the Euclidean distance is:
[ d(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^n (a_i - b/bi)^2} ]
- Distance-Based: Unlike cosine similarity, which measures the angle between vectors (ignoring magnitude), Euclidean distance considers both direction and magnitude. Smaller distances indicate more similar vectors.
- In RAG: For a query vector (e.g., “What did Buffett say about risk?”), Euclidean distance ranks document vectors (e.g., chunks of shareholder letters) by how close they are in vector space. Closest vectors (smallest distances) are retrieved as the most relevant.
- Characteristics:
- Sensitive to magnitude: If embeddings have varying lengths (e.g., due to different text lengths), Euclidean distance may prioritize shorter or longer vectors, which can skew results.
- Works well in low-to-medium dimensions but can suffer from the “curse of dimensionality” in high-dimensional spaces (e.g., 384 dimensions from
all-MiniLM-L6-v2
), where distances become less discriminative.
- Suitable for applications where absolute differences in vector components matter, such as when embeddings encode specific numerical features (e.g., financial metrics in the lecture’s shareholder letters).
- Often used in vector databases like FAISS, as mentioned in the lecture, for approximate nearest neighbor (ANN) search when exact matches are needed.
- Pros: Intuitive; captures both direction and magnitude; widely supported in vector databases.
- Cons: Sensitive to vector magnitude and scale; less effective in high-dimensional spaces compared to cosine similarity; requires normalized embeddings for consistent results.
In the lecture’s hands-on example, if you embed a query (“risk management”) and a document chunk (“Buffett’s approach to risk”) using all-MiniLM-L6-v2
, Euclidean distance calculates the geometric distance between their 384-dimensional vectors. A smaller distance indicates higher relevance, but normalization may be needed to avoid magnitude bias.
Manhattan distance, also known as the L1 norm or taxicab distance, measures the sum of absolute differences between vector components, resembling travel along a grid.
For two vectors ( \mathbf{a} ) and ( \mathbf{b} ):
[ d(\mathbf{a}, \mathbf{b}) = \sum_{i=1}^n |a_i - b_i| ]
- Distance-Based: Measures the total “path length” along each dimension, ignoring diagonal paths. Smaller distances indicate more similar vectors.
- In RAG: Similar to Euclidean distance, Manhattan distance ranks document vectors by proximity to the query vector in the lecture’s similarity search process. It’s less sensitive to outliers in individual dimensions than Euclidean distance.
- Characteristics:
- Emphasizes absolute differences, making it robust to small variations in specific dimensions.
- Computationally simpler than Euclidean distance (no square roots), which can be faster for large-scale RAG systems.
- Effective when embeddings have sparse or categorical features, such as in the lecture’s discussion of chunking categorical data (e.g., “approved” status). It’s less affected by large differences in a single dimension.
- Useful in vector databases for applications where robustness to noise or outliers is critical, such as retrieving financial terms in the Berkshire Hathaway example.
- Pros: Computationally efficient; robust to outliers; suitable for sparse or high-dimensional data.
- Cons: Ignores geometric relationships (e.g., angles); less intuitive for semantic similarity; still sensitive to unnormalized vectors.
For a query like “dividend policy” and a document chunk in the lecture’s repo, Manhattan distance sums the absolute differences across the 384 dimensions of their embeddings. It may retrieve slightly different results than Euclidean distance due to its grid-based approach.
The dot product measures the alignment of two vectors by summing the products of their corresponding components, often used as a similarity metric (higher values indicate greater similarity).
For two vectors ( \mathbf{a} ) and ( \mathbf{b} ):
[ \text{Dot Product}(\mathbf{a}, \mathbf{b}) = \sum_{i=1}^n a_i \cdot b_i ]
- Similarity-Based: Unlike distance metrics, a higher dot product indicates greater similarity. It accounts for both angle and magnitude, closely related to cosine similarity (cosine = dot product divided by vector magnitudes).
- In RAG: In the lecture’s similarity search, the dot product can rank document vectors by how well they align with the query vector. For normalized vectors, it’s equivalent to cosine similarity.
- Characteristics:
- Sensitive to vector magnitude: Longer vectors (e.g., from longer text chunks) produce larger dot products, which may bias results unless vectors are normalized.
- Computationally efficient, as it’s a simple sum of products, making it fast for vector databases.
- Common in vector databases like Pinecone or Weaviate (per the lecture) when vectors are normalized, as it behaves like cosine similarity but is faster to compute.
- Useful for tasks where magnitude differences are meaningful, such as when embeddings reflect document importance or length in the shareholder letters example.
- Pros: Fast and simple; aligns with cosine similarity for normalized vectors; supported by most vector databases.
- Cons: Sensitive to magnitude; requires normalization for semantic tasks; less intuitive as a standalone metric.
In the lecture’s vector space, a query vector for “investment strategy” and a document vector for “Buffett’s long-term strategy” yield a high dot product if closely aligned. Normalization ensures the metric focuses on semantic similarity, as in the lecture’s similarity search steps.
Jaccard similarity measures the similarity between two sets by dividing the size of their intersection by the size of their union. In RAG, it’s applied to vectorized representations after converting embeddings to binary or discrete sets.
For two sets ( A ) and ( B ):
[ \text{Jaccard Similarity}(A, B) = \frac{|A \cap B|}{|A \cup B|} ]
- Set-Based: In RAG, continuous embeddings (e.g., 384-dimensional vectors) are binarized (e.g., thresholding non-zero components) or tokenized into sets of features. Jaccard similarity compares these sets, focusing on shared versus total elements.
- In RAG: Less common for dense embeddings but useful for sparse or keyword-based representations. For example, convert document and query embeddings into sets of significant features (e.g., tokens above a threshold) and compare overlap.
- Characteristics:
- Ignores vector magnitude and focuses on presence/absence of features, making it suitable for categorical or sparse data.
- Less effective for dense, high-dimensional embeddings like those in the lecture’s example (
all-MiniLM-L6-v2
).
- Useful for hybrid RAG systems combining keyword-based and semantic search, as it can compare sets of keywords or tags extracted from documents (e.g., “dividend,” “risk” in shareholder letters).
- Applicable when embeddings are preprocessed into discrete features, such as in the lecture’s discussion of categorical data handling.
- Pros: Simple for set-based comparisons; robust for sparse or categorical data; less sensitive to magnitude.
- Cons: Requires preprocessing continuous embeddings into sets; less suitable for dense semantic embeddings; lower precision for complex RAG tasks.
For the lecture’s Berkshire Hathaway repo, convert embeddings of a query (“value investing”) and a document chunk into sets of key terms or binary features. Jaccard similarity measures the overlap (e.g., shared terms like “value,” “investing”), though it’s less precise than cosine or Euclidean for semantic tasks.
The lecture highlights cosine similarity as the default for RAG due to its focus on angular similarity, which is robust for high-dimensional embeddings and insensitive to magnitude. Here’s how alternatives compare:
- Euclidean Distance: Focuses on geometric distance, sensitive to magnitude; better for exact matches but less robust in high dimensions.
- Manhattan Distance: Emphasizes absolute differences; robust to outliers but less semantic than cosine.
- Dot Product: Similar to cosine for normalized vectors; faster but magnitude-sensitive otherwise.
- Jaccard Similarity: Set-based, less suited for dense embeddings but useful for sparse or keyword-based RAG.
- Vector Database Support: Most vector databases (e.g., FAISS, Pinecone, as in the lecture) support Euclidean distance and dot product natively, with Manhattan distance often available. Jaccard similarity may require custom preprocessing.
- Normalization: For dot product and distance metrics, normalize embeddings (e.g., unit length) to align with cosine similarity’s focus on direction, especially for semantic tasks like the lecture’s similarity search.
- Hybrid Approaches: Combine metrics (e.g., cosine for semantic search, Euclidean for fine-grained ranking) in RAG fusion, as discussed in the lecture, to improve retrieval quality.
- Testing: In the lecture’s hands-on example, experiment with these metrics in the GitHub repo’s vector database setup. For instance, use FAISS with Euclidean distance instead of cosine to compare retrieval results for queries like “Buffett’s risk strategy.”
While cosine similarity is ideal for most RAG systems due to its robustness in high-dimensional semantic spaces, alternatives like Euclidean distance, Manhattan distance, dot product, and Jaccard similarity offer unique advantages. Euclidean and Manhattan are great for distance-based ranking, especially with normalized vectors or sparse data. Dot product is a fast alternative for normalized embeddings, while Jaccard suits keyword or categorical data. For the lecture’s Berkshire Hathaway example, start with cosine or Euclidean distance in the vector database, and experiment with others based on your data’s characteristics (e.g., numerical or categorical content) and performance needs.