Verify the quantizations for vector data on Elasticsearch

Overview

Elasticsearch supports vector search, but when implementing vector search, it is essentially expected that all data resides in RAM (off-heap memory). Previously, there was no way to know how much memory was required by indexes storing vector data, but starting from v9.1, metrics related to vector data can now be obtained.

This article introduces how to obtain these metrics and their meanings. Additionally, we compare the metrics when storing vectors with four types of index options: Flat, HNSW, Int8 HNSW, and BBQ HNSW, and verify the impact of each index option on RAM.

Theoretical Values

Vector data in Elasticsearch is stored in off-heap memory. Off-heap refers to native memory areas outside of the JVM's heap memory. By using off-heap memory, Elasticsearch/Lucene can efficiently handle large amounts of vector data. However, since it is managed separately from the JVM's heap memory, it is not included in regular JVM memory usage metrics. Therefore, it is necessary to obtain off-heap memory usage through alternative methods.

Elasticsearch provides several index options when storing vector data. These options affect how vector data is stored and search performance. By referring to the following, you can check the theoretical memory usage required for vector data for each index option.

Ensure data nodes have enough memory

Summarized in a table, it looks like this:

element_type	Quantization	Theoretical Memory Usage
float	None	num_vectors * num_dimensions * 4
float	int8	num_vectors * (num_dimensions + 4)
float	int4	num_vectors * (num_dimensions / 2 + 4)
float	bbq	num_vectors * (num_dimensions / 8 + 14)
byte	None	num_vectors * num_dimensions
bit	None	num_vectors * (num_dimensions / 8)

Additionally, when using HNSW, extra memory is required for the HNSW graph. The theoretical memory usage for the HNSW graph is as follows:

num_vectors * 4 * HNSW.m

Here, HNSW.m is a parameter of the HNSW algorithm, with a default value of 16.

When creating a new index to store vector data, you can use these theoretical values as a reference to estimate how much off-heap memory the vector data stored in the index will actually require.

Obtaining Off-Heap Memory Usage

So, how can we obtain the off-heap memory usage of vector data used in an Elasticsearch index that is actually in operation?

Starting from Elasticsearch v9.1, metrics related to vector memory can be obtained using the Get index statistics API. As shown below, you can use the filter_path parameter to extract only vector-related metrics.

GET my_vector_index/_stats?filter_path=*.primaries.dense_vector

Using this API, you can obtain vector-related metrics like the following:

{
  "_all": {
    "primaries": {
      "dense_vector": {
        "value_count": 764,
        "off_heap": {
          "total_size_bytes": 1229092,
          "total_vec_size_bytes": 1173504,
          "total_veq_size_bytes": 0,
          "total_veb_size_bytes": 47368,
          "total_vex_size_bytes": 8220
        }
      }
    }
  }
}

The meaning of each element is shown in the table below.

Metric Name	Description
value_count	Total number of vectors in the index
total_size_bytes	Total size of vector data used in off-heap memory
total_vec_size_bytes	Size of non-quantized vector data
total_veq_size_bytes	Size of quantized vector data (int4 or int8). The 'q' in veq stands for quantization.
total_veb_size_bytes	Size of binary quantized vector data (bbq). The 'b' in veb stands for binary.
total_vex_size_bytes	Size of HNSW graph

In the above example, bbq quantized vectors are used, so total_veq_size_bytes is 0. If int4 or int8 is used, total_veb_size_bytes will be 0, and the size will be displayed in total_veq_size_bytes.

Among these, the items that should fit in RAM can be summarized as follows:

Index type	Items that should fit in RAM
flat	vec
hnsw	vec, vex
int8_hnsw	veq, vex
bbq_hnsw	veb, vex

However, note that this value is a theoretical value based on the actual data count and settings. It means that this much memory is required when calculated from the total amount of stored vectors. For example, RAM may already be consumed by other processes, so there is no accurate way to determine exactly how much memory vector data is actually using at the OS level.

Nevertheless, Elasticsearch assumes that this vector data (total_size_bytes) is fully expanded in RAM when performing searches. By referring to these metrics, you can understand the resources that Elasticsearch requires.

Verification

We actually loaded data into Elasticsearch and verified how closely the above metrics match the theoretical values.

Verification Results

We compared the metrics when registering 100 64-dimensional vectors with four types of index options: Flat, HNSW, Int8 HNSW, and BBQ HNSW. The results are as follows:

Index type	value_count	total_size_bytes	vec	veq	veb	vex
flat	100	25600	25600	0	0	0
hnsw	100	26780	25600	0	0	1180
int8_hnsw	100	33601	25600	6800	0	1201
bbq_hnsw	100	28982	25600	0	2200	1182

Here, the items that should fit in memory are shown in bold. The units are in bytes.

Comparing the theoretical values with the actual metrics for each index option yields the following results:

The vector data (vec) values are all 25,600, which perfectly matches the theoretical value (num_vectors * num_dimensions * 4 = 100 * 64 * 4).
The HNSW graph (vex) values are considerably smaller than the theoretical value (num_vectors * 4 * HNSW.m = 100 * 4 * 16 = 6400). This may be because the number of HNSW graph connections was kept low due to the small number of vectors. For actual operations, we recommend verifying with an appropriate size. Also, since this value depends on the structure of the created graph, it appears to vary depending on the values of the registered vectors.
The Int8 quantization (veq) value perfectly matches the theoretical value (num_vectors * (num_dimensions + 4) = 100 * (64 + 4) = 6800).
The BBQ quantization (veb) value also perfectly matches the theoretical value (num_vectors * (num_dimensions / 8 + 14) = 100 * (64 / 8 + 14) = 2200).

Verification Code

The above results were verified with the following code. By setting the ES_URL and ES_API_KEY in the environment variables or .env file and running it, the above table will be displayed.

# Test the vector quantizations

import os
from elasticsearch import Elasticsearch
from dotenv import load_dotenv
import numpy as np
from tqdm import tqdm

# Load environment variables from .env file (ES_URL and ES_API_KEY)
load_dotenv()

TEST_SPECS = [
    # Format: (index_name, index_options.type)
    ("vec_float_flat", "flat"),
    ("vec_float_hnsw", "hnsw"),
    ("vec_int8_hnsw", "int8_hnsw"),
    ("vec_bbq_hnsw", "bbq_hnsw"),
]
NUM_VECTORS = 100
DIM = 64
M = 32
EF_CONSTRUCTION = 100

BULK_BATCH_SIZE = 500  # Number of documents per bulk request

"""
Create index with given name and type.
"""
def create_index(es, index_name, index_type):
    # Delete index if it exists
    if es.indices.exists(index=index_name):
        es.indices.delete(index=index_name)
    
    body = {
        "mappings": {
            "properties": {
                "vector": {
                    "type": "dense_vector",
                    "dims": DIM,
                    "similarity": "cosine",
                    "element_type": "float",
                    "index_options": {
                        "type": index_type
                    }
                }
            }
        }
    }

    if index_type == "hnsw" or index_type == "int8_hnsw" or index_type == "bbq_hnsw":
        body["mappings"]["properties"]["vector"]["index_options"].update({
            "m": M,
            "ef_construction": EF_CONSTRUCTION
        })
    
    es.indices.create(index=index_name, body=body)
    print(f"Created index: {index_name} with type: {index_type}")

"""
Ingest sample vectors into the given index.
"""
def ingest_vector(es, index_name, vectors):
    total_vectors = len(vectors)
    num_batches = (total_vectors + BULK_BATCH_SIZE - 1) // BULK_BATCH_SIZE
    
    print(f"Ingesting {total_vectors} vectors into {index_name}")
    
    # Process vectors in batches with progress bar
    with tqdm(total=total_vectors, desc=f"Bulk indexing to {index_name}", unit="docs") as pbar:
        for batch_num in range(num_batches):
            start_idx = batch_num * BULK_BATCH_SIZE
            end_idx = min(start_idx + BULK_BATCH_SIZE, total_vectors)
            
            bulk_body = []
            for i in range(start_idx, end_idx):
                bulk_body.append({"index": {"_index": index_name, "_id": str(i)}})
                bulk_body.append({"vector": vectors[i]})
            
            es.bulk(body=bulk_body)
            pbar.update(end_idx - start_idx)
    
    # Refresh index to make documents searchable
    es.indices.refresh(index=index_name)
    
    # Flush to disk
    es.indices.flush(index=index_name)
    
    # Force merge to combine all segments into 1
    es.indices.forcemerge(index=index_name, max_num_segments=1)
    
    print(f"Completed ingestion of {total_vectors} vectors into {index_name}")

"""
Check the off-heap metrics for the given index.
Print the results as markdown table.
"""
def test_vector_quantizations(es):
    print("## Parameters\n")
    print(f"- Number of vectors: {NUM_VECTORS}")
    print(f"- Dimensions: {DIM}")
    print(f"- HNSW M: {M}")
    print(f"- HNSW ef_construction: {EF_CONSTRUCTION}") 
    print("\n## Off-heap Memory Usage\n")
    print("| Index type | value_count | total_size_bytes |     vec     |     veq     |     veb     |     vex     |")
    print("|------------|-------------|------------------|-------------|-------------|-------------|-------------|")

    def get_formatted_off_heap_size(off_heap, key):
        size = off_heap.get(key, 0)
        return f"{size:,}"

    for index_name, index_type in TEST_SPECS:
        # Get index stats with dense_vector metrics
        stats = es.indices.stats(index=index_name)
        
        # Extract dense_vector information
        dense_vector = stats['indices'][index_name]['primaries'].get('dense_vector', {})
        off_heap = dense_vector.get('off_heap', {})
        
        # Get off-heap memory breakdown
        vec_fmt = get_formatted_off_heap_size(off_heap, 'total_vec_size_bytes')
        veq_fmt = get_formatted_off_heap_size(off_heap, 'total_veq_size_bytes')
        veb_fmt = get_formatted_off_heap_size(off_heap, 'total_veb_size_bytes')
        vex_fmt = get_formatted_off_heap_size(off_heap, 'total_vex_size_bytes')
        total_fmt = get_formatted_off_heap_size(off_heap, 'total_size_bytes')

        # Get document count
        count = dense_vector.get('value_count', 0)
        count_fmt = f"{count:,}"

        print(f"| {index_type:10} | {count_fmt:>11} | {total_fmt:>16} | {vec_fmt:>11} | {veq_fmt:>11} | {veb_fmt:>11} | {vex_fmt:>11} |")

if __name__ == "__main__":
    es_url = os.getenv("ES_URL", "http://localhost:9200")
    es_api_key = os.getenv("ES_API_KEY")
    
    # Connect to Elasticsearch
    if es_api_key:
        es = Elasticsearch(es_url, api_key=es_api_key)
    else:
        es = Elasticsearch(es_url)

    # Generate vectors once for all tests
    print(f"Generating {NUM_VECTORS} random vectors with {DIM} dimensions...")
    vectors = [np.random.rand(DIM).tolist() for _ in range(NUM_VECTORS)]

    for index_name, index_type in TEST_SPECS:
        create_index(es, index_name, index_type)
        ingest_vector(es, index_name, vectors)

    test_vector_quantizations(es)

Summary

Metrics for off-heap memory usage related to Elasticsearch vector data became available via the Get index statistics API starting from v9.1. By using these metrics, you can understand how much off-heap memory the vector data stored in an index actually requires. The results of comparing theoretical values with actual metrics for each index option confirmed that in most cases they match the theoretical values. Please leverage this information to effectively operate Elasticsearch's vector search functionality.

daixque/es_quantization_test.md

Select an option

No results found