Skip to content

Instantly share code, notes, and snippets.

@Darkflib
Last active November 6, 2024 13:25
Show Gist options
  • Save Darkflib/81d18cc5d744b677301aa250c00197f3 to your computer and use it in GitHub Desktop.
Save Darkflib/81d18cc5d744b677301aa250c00197f3 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Embeddings and dimensionality reduction in the context of RAG (Retrieval Augmented Generation).

Vector embeddings typically have many dimensions (e.g., 768, 1024, or 1536 dimensions depending on the model), but you're correct that they can often be compressed while maintaining much of their utility. Here's why and how this works:

  1. Primary Principles:
  • Embeddings encode semantic meaning across their dimensions
  • The dimensions aren't necessarily ordered by importance in their raw form
  • However, we can transform them to be ordered by importance
  1. Dimensionality Reduction Methods:

The most common approach is Principal Component Analysis (PCA), which:

  • Finds the directions (principal components) of maximum variance in the data
  • Reorients the dimensions so they ARE ordered by importance
  • After PCA, the first dimensions capture the most variance/information

Let me create a simple visualization to demonstrate this concept:

import React from 'react';
import { LineChart, Line, XAxis, YAxis, CartesianGrid, Tooltip, Legend } from 'recharts';

const DimensionalityVisualization = () => {
  // Simulate variance explained by each principal component
  const data = Array.from({ length: 50 }, (_, i) => ({
    dimension: i + 1,
    varianceExplained: 100 * Math.exp(-i / 10),  // Exponential decay
    cumulativeVariance: 0  // Will be calculated below
  }));
  
  // Calculate cumulative variance
  let cumulative = 0;
  data.forEach(point => {
    cumulative += point.varianceExplained;
    point.cumulativeVariance = Math.min(100, cumulative / 10);
  });

  return (
    <div className="w-full max-w-4xl p-4">
      <h2 className="text-xl font-bold mb-4">Information Retention vs. Dimensions</h2>
      <div className="h-96">
        <LineChart data={data} margin={{ top: 20, right: 30, left: 20, bottom: 30 }}>
          <CartesianGrid strokeDasharray="3 3" />
          <XAxis 
            dataKey="dimension" 
            label={{ value: 'Number of Dimensions', position: 'bottom', offset: 0 }}
          />
          <YAxis 
            label={{ value: 'Percentage (%)', angle: -90, position: 'left', offset: 10 }}
          />
          <Tooltip />
          <Legend />
          <Line 
            type="monotone" 
            dataKey="varianceExplained" 
            name="Variance per Dimension"
            stroke="#8884d8" 
            dot={false}
          />
          <Line 
            type="monotone" 
            dataKey="cumulativeVariance" 
            name="Cumulative Variance"
            stroke="#82ca9d" 
            dot={false}
          />
        </LineChart>
      </div>
      <div className="mt-4 text-sm text-gray-600">
        <p>The blue line shows the amount of information captured by each dimension after PCA transformation.</p>
        <p>The green line shows the cumulative information retained as dimensions are added.</p>
      </div>
    </div>
  );
};

export default DimensionalityVisualization;

Key insights from the visualization:

  1. Information Distribution:
  • After PCA transformation, the first few dimensions capture the most significant semantic patterns
  • There's typically an exponential decay in information content
  • The later dimensions often capture increasingly subtle nuances or noise
  1. Practical Implications:
  • You can often retain 80-90% of the information with just 20-30% of the dimensions
  • For many RAG applications, you might reduce 1536D vectors to 256D or even 128D
  • The exact reduction depends on your specific needs for precision vs. efficiency
  1. Trade-offs:
  • Fewer dimensions = faster retrieval and less storage
  • More dimensions = more nuanced similarity comparisons
  • The optimal dimension count depends on your specific use case

let me describe what it was meant to show and then create a simpler visual using SVG instead:

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 500 300">
    <!-- Background -->
    <rect width="500" height="300" fill="#ffffff"/>
    
    <!-- Grid lines -->
    <g stroke="#e0e0e0" stroke-width="1">
        <line x1="50" y1="250" x2="450" y2="250"/>
        <line x1="50" y1="200" x2="450" y2="200"/>
        <line x1="50" y1="150" x2="450" y2="150"/>
        <line x1="50" y1="100" x2="450" y2="100"/>
        <line x1="50" y1="50" x2="450" y2="50"/>
    </g>
    
    <!-- Axes -->
    <line x1="50" y1="250" x2="450" y2="250" stroke="black" stroke-width="2"/>
    <line x1="50" y1="50" x2="50" y2="250" stroke="black" stroke-width="2"/>
    
    <!-- Information retention curve -->
    <path d="M50 250 Q 150 100, 450 50" 
          fill="none" 
          stroke="#4CAF50" 
          stroke-width="3"/>
    
    <!-- Individual dimension importance -->
    <path d="M50 250 C 100 80, 150 150, 450 240" 
          fill="none" 
          stroke="#2196F3" 
          stroke-width="3"
          stroke-dasharray="5,5"/>
    
    <!-- Labels -->
    <g font-family="Arial" font-size="12">
        <!-- Y-axis labels -->
        <text x="10" y="255">0%</text>
        <text x="10" y="205">25%</text>
        <text x="10" y="155">50%</text>
        <text x="10" y="105">75%</text>
        <text x="10" y="55">100%</text>
        
        <!-- X-axis labels -->
        <text x="50" y="270">0</text>
        <text x="240" y="270">Number of Dimensions</text>
        <text x="440" y="270">n</text>
        
        <!-- Title and Legend -->
        <text x="250" y="30" text-anchor="middle" font-size="14" font-weight="bold">Information Retention vs. Dimensions</text>
        <g transform="translate(320, 80)">
            <line x1="0" y1="0" x2="20" y2="0" stroke="#4CAF50" stroke-width="3"/>
            <text x="25" y="5">Cumulative Information</text>
            <line x1="0" y1="20" x2="20" y2="20" stroke="#2196F3" stroke-width="3" stroke-dasharray="5,5"/>
            <text x="25" y="25">Per-Dimension Importance</text>
        </g>
        
        <!-- Key points -->
        <text x="150" y="140" fill="#666">~80% information</text>
        <text x="150" y="155" fill="#666">retained here</text>
    </g>
    
    <!-- Annotation arrow -->
    <path d="M160 160 L 190 180" stroke="#666" stroke-width="1" marker-end="url(#arrow)"/>
    
    <!-- Arrow marker definition -->
    <defs>
        <marker id="arrow" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto" markerUnits="strokeWidth">
            <path d="M0,0 L0,6 L9,3 z" fill="#666"/>
        </marker>
    </defs>
</svg>

img.svg

This visualization shows two key curves:

  1. The solid green line shows cumulative information retention:
  • It rises sharply at first, showing that the first few dimensions capture a lot of information
  • It then levels off, demonstrating diminishing returns from additional dimensions
  • You can see that around 20-30% of the dimensions capture about 80% of the information
  1. The dashed blue line shows the importance of each individual dimension:
  • It starts high (first dimensions are very important)
  • It drops off rapidly
  • Later dimensions contribute very little new information

This illustrates why dimension reduction works: after transforming your embeddings (using something like PCA), the information becomes "front-loaded" into the earlier dimensions. This means you can often cut off the later dimensions while retaining most of the semantic meaning that's useful for retrieval.

A practical example:

  • Original embedding: 1536 dimensions
  • After PCA, keeping first 256 dimensions (~17%)
  • Might retain 85-90% of the useful information
  • Much faster similarity computations
  • Significantly reduced storage requirements

The exact number of dimensions to keep depends on your specific use case and the trade-off you want to make between:

  • Retrieval accuracy
  • Computation speed
  • Storage requirements

Would you like me to explain more about how to implement this in practice, or would you like to know more about how PCA achieves this transformation?

  1. Implementation Considerations:
  • Always fit PCA on your training data, then apply the same transformation to new vectors
  • Consider using incremental PCA for large datasets
  • Test different dimension counts to find the sweet spot for your application

The reason this works so well for RAG is that most semantic queries don't need the ultra-fine distinctions that the full dimensionality provides. The first N dimensions after PCA capture the broad semantic relationships that drive most retrievals.

Save a short version (after transformation) for sharding

This is a really clever application of dimensionality reduction for search optimization. You could implement a multi-tier search strategy where:

  1. First Pass (Quick Filter):
  • Use shortened vectors (e.g., 128D) for initial broad search
  • Quickly eliminate obviously irrelevant documents
  • Can shard these smaller vectors across many machines efficiently
  • Keep these in memory for super-fast approximate nearest neighbor (ANN) search
  1. Second Pass (Refinement):
  • Only on the most promising candidates from first pass
  • Use full vectors (e.g., 1536D) for final ranking
  • More precise but more computationally expensive
  • Can be done on fewer documents thanks to first pass

Here's a rough architecture of how this might work:

The benefits of this approach include:

  1. Performance:
  • Lower memory footprint for primary index
  • Faster initial search across shards
  • Can parallelize first pass effectively
  • Only pay full computation cost on promising candidates
  1. Scalability:
  • Easier to distribute shortened vectors across nodes
  • Can fit more documents in memory
  • Lower network bandwidth when coordinating between shards
  • More efficient caching
  1. Flexibility:
  • Can adjust precision vs. speed by tuning:
    • Dimension count for quick search
    • Number of candidates passed to refinement
    • Number of shards
  • Can even have multiple tiers with different dimension counts

Example configuration:

class SearchConfig:
    FULL_DIMS = 1536        # Original embedding size
    QUICK_DIMS = 128        # Reduced size for first pass
    SHARD_SIZE = 1_000_000  # Vectors per shard
    TOP_K_QUICK = 1000      # Candidates from first pass
    TOP_K_FINAL = 50        # Final results after reranking

The trade-off is a small potential loss in recall, but with proper tuning, this is often negligible compared to the massive gains in performance and scalability. The key is finding the sweet spot for your specific use case in terms of:

  • Quick search dimension count
  • Number of candidates to rerank
  • Acceptable latency vs. precision requirements

Would you like me to elaborate on any particular aspect of this architecture or discuss specific implementation strategies?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment