Skip to content

Instantly share code, notes, and snippets.

@glaforge
Created February 16, 2025 13:30
Show Gist options
  • Save glaforge/2d021dd05b8d9f8a11957fc9b6147ace to your computer and use it in GitHub Desktop.
Save glaforge/2d021dd05b8d9f8a11957fc9b6147ace to your computer and use it in GitHub Desktop.
Create a new embedding store

Okay, here's a breakdown of how to create a new LangChain4j embedding store module for Google Cloud Firestore, along with the key steps and considerations, mirroring the structure of existing modules like langchain4j-milvus.

Project Structure

Your project structure should follow the established pattern. I'll create a simplified version based on the most relevant parts from the provided file listing. The full structure would be much larger (like the main langchain4j project), but this captures the essentials:

langchain4j/
└── langchain4j-embedding-store-google-firestore/  (or similar name)
    ├── pom.xml                                     (Your module's Maven build file)
    └── src/
        └── main/
            └── java/
                └── dev/
                    └── langchain4j/
                        └── store/
                            └── embedding/
                                └── google/
                                    └── firestore/
                                        └── GoogleFirestoreEmbeddingStore.java  (Main class)
                                        └── (Any other supporting classes, like request/response objects)
        └── test/
            └── java/
                └── dev/
                    └── langchain4j/
                        └── store/
                            └── embedding/
                                └── google/
                                    └── firestore/
                                        └── GoogleFirestoreEmbeddingStoreIT.java (Integration test)

Steps to Create the Module

  1. Project Setup (Maven Module):

    • Create a new directory named langchain4j-embedding-store-google-firestore (or a similar, descriptive name) within the langchain4j parent directory.
    • Create a pom.xml file inside this new directory. This file will define your module as a Maven project and specify its dependencies. Use langchain4j-milvus/pom.xml as a guide, but adapt it:
      • Parent: Set the parent to langchain4j-parent, as shown in your example.
      • ArtifactId: Use langchain4j-embedding-store-google-firestore.
      • Name and Description: Update these appropriately.
      • Dependencies: Include langchain4j-core as a compile-time dependency. Add the Google Cloud Firestore Java client library as a dependency:
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-firestore</artifactId>
            <version>YOUR_VERSION_HERE</version> <!-- Lookup the latest version -->
        </dependency>
        Add any other necessary dependencies, such as logging (SLF4J), Lombok (if desired, but try to avoid in new code), etc. Minimize dependencies as much as possible.
      • Test Dependencies: Include JUnit, AssertJ, and any mocking libraries (like Mockito) in the test scope.
      • Licenses: Ensure your module has the correct Apache 2.0 license information in the POM.
  2. Implement EmbeddingStore<TextSegment>:

    • Create GoogleFirestoreEmbeddingStore.java in the dev.langchain4j.store.embedding.google.firestore package.
    • Implement the EmbeddingStore<TextSegment> interface from langchain4j-core. This is the crucial part. You'll need to implement the following methods, mapping them to Firestore operations:
      • add(Embedding embedding): Adds a single embedding to the store, generating a unique ID.
      • add(String id, Embedding embedding): Adds an embedding with a specified ID.
      • add(Embedding embedding, TextSegment textSegment): Adds an embedding with associated text and metadata.
      • addAll(List<Embedding> embeddings): Adds multiple embeddings.
      • addAll(List<String> ids, List<Embedding> embeddings, List<TextSegment> textSegments): Adds multiple embeddings with associated IDs and text segments.
      • removeAll(Collection<String> ids): Delete embeddings by id
      • removeAll(): Deletes everything
      • removeAll(Filter filter): Deletes records using the condition filter
      • search(EmbeddingSearchRequest request): Find related embeddings
      • findRelevant(Embedding referenceEmbedding, int maxResults, double minScore): Finds relevant embeddings.
    • Considerations for Firestore Implementation:
      • Data Model: How will you store the embeddings (as float arrays) and associated data (text, metadata)? Firestore uses a NoSQL document model. You'll likely store each TextSegment and its Embedding as a document in a collection.
      • Metadata: Firestore supports storing metadata as document fields. You'll need a way to map TextSegment metadata (which is a Map<String, String>) to Firestore document fields. You have a few options here, mirroring what existing modules do:
        • Individual Columns (Preferred): Each metadata key becomes a separate field in the document. This is efficient for querying but requires knowing the metadata keys in advance. The Mariadb integration uses this approach with MetadataColumDefinition.
        • Single JSON Field: Store all metadata as a single JSON string. This is flexible but less efficient for filtering. The Chroma integration uses this.
        • Mixed Approach: Common metadata fields (like "source" or "document_id") could be separate fields, and a catch-all "metadata" field could store the rest as JSON.
      • Vector Search: Firestore now has native Vector Search, which is excellent! You'll use this for the findRelevant method. The key will be understanding how to map LangChain4j's EmbeddingSearchRequest (which includes maxResults and minScore) to the appropriate Firestore vector search query.
      • Filtering: Implement filtering based on metadata using Firestore's query capabilities. LangChain4j's Filter interface needs to be translated into a Firestore query. Look at existing implementations like ChromaMetadataFilterMapper for inspiration, but adapt it to Firestore's query syntax.
      • Error Handling: Wrap Firestore exceptions in RuntimeException or a more specific custom exception.
      • Configuration: Use a Builder pattern (like OpenAiChatModel does) to allow users to configure:
        • Firestore project ID, database, collection name
        • Credentials (API key, service account)
        • Timeout settings
        • Maximum number of results (maxResults)
        • Minimum relevance score (minScore)
        • Field names for text, embedding, and metadata (if configurable)
        • Possibly options for automatically creating the collection/indexes.
      • Indexing: You must create the necessary index in Firestore for vector search to work. The documentation you provided explains how to do this with the gcloud CLI. The code should, ideally, at least check if the index exists. It could even attempt to create it, but this would require more permissions.
      • Concurrency: if there is any concurrency, the code should be thread-safe.
  • Implement GoogleFirestoreEmbeddingModel: You should also implement the EmbeddingModel interface, by adapting the google API to the langchain4j interface.
  1. Create SPI Builder Factory:

    • Create a GoogleFirestoreEmbeddingStoreBuilderFactory class that implements Supplier<GoogleFirestoreEmbeddingStore.Builder>.
    • Create a file named META-INF/services/dev.langchain4j.spi.store.embedding.EmbeddingStoreFactory in src/main/resources.
    • Add a single line to this file containing the fully qualified name of your factory class (e.g., dev.langchain4j.store.embedding.google.firestore.GoogleFirestoreEmbeddingStoreBuilderFactory).
  2. Write Integration Tests:

    • Create GoogleFirestoreEmbeddingStoreIT.java (and potentially ...WithFilteringIT.java, ...WithRemovalIT.java, etc., mirroring the existing modules).
    • Extend EmbeddingStoreIT (or EmbeddingStoreWithFilteringIT, etc.) to inherit a basic set of tests.
    • Implement the abstract methods (like embeddingStore()) to provide instances of your store and a compatible embedding model.
    • Add tests specific to Google Cloud Firestore features and limitations.
    • Use @EnabledIfEnvironmentVariable to conditionally run the tests only when the necessary environment variables (credentials) are set. See OllamaChatModelIT for an example.
  3. Add to BOM (Bill of Materials): Add your new module to langchain4j-bom/pom.xml to manage its version consistently.

  4. Documentation:

    • Add the new model in all the index.md files
    • Write a guide in docs/docs/integrations/embedding-stores
    • Create a README file

Code Example (Conceptual)

// In your GoogleFirestoreEmbeddingStore.java
package dev.langchain4j.store.embedding.google.firestore;

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingStore;
import lombok.Builder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;

// ... other imports ...

public class GoogleFirestoreEmbeddingStore implements EmbeddingStore<TextSegment> {

    private static final Logger log = LoggerFactory.getLogger(GoogleFirestoreEmbeddingStore.class);

    // ... (fields for Firestore client, collection name, etc.) ...

    @Builder
    public GoogleFirestoreEmbeddingStore(...) {
        // ... (initialize Firestore client, etc.) ...
    }

    @Override
    public String add(Embedding embedding) {
        // ... implementation ...
    }

    @Override
    public void add(String id, Embedding embedding) {
        // ... implementation ...
    }

    @Override
    public String add(Embedding embedding, TextSegment textSegment) {
        // ... implementation ...
    }

    @Override
    public List<String> addAll(List<Embedding> embeddings) {
        // ... implementation ...
    }

    @Override
    public void addAll(List<String> ids, List<Embedding> embeddings, List<TextSegment> embedded) {
        // ... implementation ...
    }

    @Override
    public List<EmbeddingMatch<TextSegment>> findRelevant(Embedding referenceEmbedding, int maxResults, double minScore) {
        // ... implementation using Firestore vector search ...
        // ... map Firestore results to EmbeddingMatch objects ...
    }

    // ... other methods ...
}

// In a separate file: GoogleFirestoreEmbeddingStoreBuilderFactory.java
package dev.langchain4j.store.embedding.google.firestore;

import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.spi.store.embedding.EmbeddingStoreFactory;

public class GoogleFirestoreEmbeddingStoreBuilderFactory implements EmbeddingStoreFactory {
    @Override
    public GoogleFirestoreEmbeddingStore.Builder get() {
       return GoogleFirestoreEmbeddingStore.builder();
    }
}

Key Improvements and Considerations from the examples:

  • Error Handling: Wrap Firestore API calls in try-catch blocks and throw RuntimeException (or a more specific custom exception) to handle potential errors.
  • Logging: Use SLF4J (log.debug(), log.info(), log.warn(), log.error()) for logging. This is crucial for debugging and monitoring.
  • Configuration: The Builder should allow configuration of all necessary parameters (endpoint, credentials, collection name, timeouts, etc.). Use ensureNotNull and ensureNotBlank to validate.
  • Dependency Management: The pom.xml should correctly specify the Google Cloud Firestore client library as a dependency.
  • Testcontainers: Consider using Testcontainers for integration tests to spin up a Firestore emulator. This will make your tests more reliable and portable. (See examples in the provided code, such as for CassandraEmbeddingStoreDockerIT).
  • Metadata: Implement proper handling of metadata, including mapping it to/from Firestore document fields. Decide on your strategy (separate fields vs. JSON).
  • Filtering: Implement filtering based on metadata using Firestore's query capabilities. You will likely need a helper class like ChromaMetadataFilterMapper to translate Filter objects into Firestore queries.
  • Service Provider Interface (SPI): The spi package and META-INF/services file are critical for LangChain4j to discover your implementation. This is how AiServices (and other parts of the framework) dynamically load components. Without this, your module won't be usable.
  • Return type: The findRelevant method should take a {@link dev.langchain4j.store.embedding.EmbeddingSearchRequest} and return a {@link dev.langchain4j.store.embedding.EmbeddingSearchResult}. This would ensure that we have consistency through all embedding stores and allow advanced filtering and reranking features.
  • Interface segregation: Introduce a EmbeddingStoreWithFiltering interface, that extends EmbeddingStore and adds a removeAll(Filter filter) method.
  • Metadata handling: Implement the MetadataHandler interface, or extend one of its base classes if it's a good fit.
  • Support Vector Search API in different database modes: Support vector search in the different modes of Cosmos DB, implementing the search in each of them.
  • Implement the close() method: You should close the connection to Firestore in the close() method.
  • Test your embedding store against the EmbeddingStoreWithFilteringIT and EmbeddingStoreWithRemovalIT: The provided langchain4j-core contains those abstract test classes that you can use to test your embedding store.
  • Add documentation: Create Markdown files in the docs/docs/integrations/embedding-stores directory, following the structure of existing integrations.
  • Add an example: to the examples repository.

By following these steps and adapting the provided example code, you can create a robust and well-integrated LangChain4j module for Google Cloud Firestore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment