Okay, here's a breakdown of how to create a new LangChain4j embedding store module for Google Cloud Firestore, along with the key steps and considerations, mirroring the structure of existing modules like langchain4j-milvus
.
Project Structure
Your project structure should follow the established pattern. I'll create a simplified version based on the most relevant parts from the provided file listing. The full structure would be much larger (like the main langchain4j
project), but this captures the essentials:
langchain4j/
└── langchain4j-embedding-store-google-firestore/ (or similar name)
├── pom.xml (Your module's Maven build file)
└── src/
└── main/
└── java/
└── dev/
└── langchain4j/
└── store/
└── embedding/
└── google/
└── firestore/
└── GoogleFirestoreEmbeddingStore.java (Main class)
└── (Any other supporting classes, like request/response objects)
└── test/
└── java/
└── dev/
└── langchain4j/
└── store/
└── embedding/
└── google/
└── firestore/
└── GoogleFirestoreEmbeddingStoreIT.java (Integration test)
Steps to Create the Module
-
Project Setup (Maven Module):
- Create a new directory named
langchain4j-embedding-store-google-firestore
(or a similar, descriptive name) within thelangchain4j
parent directory. - Create a
pom.xml
file inside this new directory. This file will define your module as a Maven project and specify its dependencies. Uselangchain4j-milvus/pom.xml
as a guide, but adapt it:- Parent: Set the parent to
langchain4j-parent
, as shown in your example. - ArtifactId: Use
langchain4j-embedding-store-google-firestore
. - Name and Description: Update these appropriately.
- Dependencies: Include
langchain4j-core
as a compile-time dependency. Add the Google Cloud Firestore Java client library as a dependency:Add any other necessary dependencies, such as logging (SLF4J), Lombok (if desired, but try to avoid in new code), etc. Minimize dependencies as much as possible.<dependency> <groupId>com.google.cloud</groupId> <artifactId>google-cloud-firestore</artifactId> <version>YOUR_VERSION_HERE</version> <!-- Lookup the latest version --> </dependency>
- Test Dependencies: Include JUnit, AssertJ, and any mocking libraries (like Mockito) in the
test
scope. - Licenses: Ensure your module has the correct Apache 2.0 license information in the POM.
- Parent: Set the parent to
- Create a new directory named
-
Implement
EmbeddingStore<TextSegment>
:- Create
GoogleFirestoreEmbeddingStore.java
in thedev.langchain4j.store.embedding.google.firestore
package. - Implement the
EmbeddingStore<TextSegment>
interface fromlangchain4j-core
. This is the crucial part. You'll need to implement the following methods, mapping them to Firestore operations:add(Embedding embedding)
: Adds a single embedding to the store, generating a unique ID.add(String id, Embedding embedding)
: Adds an embedding with a specified ID.add(Embedding embedding, TextSegment textSegment)
: Adds an embedding with associated text and metadata.addAll(List<Embedding> embeddings)
: Adds multiple embeddings.addAll(List<String> ids, List<Embedding> embeddings, List<TextSegment> textSegments)
: Adds multiple embeddings with associated IDs and text segments.removeAll(Collection<String> ids)
: Delete embeddings by idremoveAll()
: Deletes everythingremoveAll(Filter filter)
: Deletes records using the condition filtersearch(EmbeddingSearchRequest request)
: Find related embeddingsfindRelevant(Embedding referenceEmbedding, int maxResults, double minScore)
: Finds relevant embeddings.
- Considerations for Firestore Implementation:
- Data Model: How will you store the embeddings (as float arrays) and associated data (text, metadata)? Firestore uses a NoSQL document model. You'll likely store each
TextSegment
and itsEmbedding
as a document in a collection. - Metadata: Firestore supports storing metadata as document fields. You'll need a way to map
TextSegment
metadata (which is aMap<String, String>
) to Firestore document fields. You have a few options here, mirroring what existing modules do:- Individual Columns (Preferred): Each metadata key becomes a separate field in the document. This is efficient for querying but requires knowing the metadata keys in advance. The Mariadb integration uses this approach with
MetadataColumDefinition
. - Single JSON Field: Store all metadata as a single JSON string. This is flexible but less efficient for filtering. The Chroma integration uses this.
- Mixed Approach: Common metadata fields (like "source" or "document_id") could be separate fields, and a catch-all "metadata" field could store the rest as JSON.
- Individual Columns (Preferred): Each metadata key becomes a separate field in the document. This is efficient for querying but requires knowing the metadata keys in advance. The Mariadb integration uses this approach with
- Vector Search: Firestore now has native Vector Search, which is excellent! You'll use this for the
findRelevant
method. The key will be understanding how to map LangChain4j'sEmbeddingSearchRequest
(which includesmaxResults
andminScore
) to the appropriate Firestore vector search query. - Filtering: Implement filtering based on metadata using Firestore's query capabilities. LangChain4j's
Filter
interface needs to be translated into a Firestore query. Look at existing implementations likeChromaMetadataFilterMapper
for inspiration, but adapt it to Firestore's query syntax. - Error Handling: Wrap Firestore exceptions in
RuntimeException
or a more specific custom exception. - Configuration: Use a
Builder
pattern (likeOpenAiChatModel
does) to allow users to configure:- Firestore project ID, database, collection name
- Credentials (API key, service account)
- Timeout settings
- Maximum number of results (
maxResults
) - Minimum relevance score (
minScore
) - Field names for text, embedding, and metadata (if configurable)
- Possibly options for automatically creating the collection/indexes.
- Indexing: You must create the necessary index in Firestore for vector search to work. The documentation you provided explains how to do this with the
gcloud
CLI. The code should, ideally, at least check if the index exists. It could even attempt to create it, but this would require more permissions. - Concurrency: if there is any concurrency, the code should be thread-safe.
- Data Model: How will you store the embeddings (as float arrays) and associated data (text, metadata)? Firestore uses a NoSQL document model. You'll likely store each
- Create
- Implement
GoogleFirestoreEmbeddingModel
: You should also implement theEmbeddingModel
interface, by adapting the google API to the langchain4j interface.
-
Create SPI Builder Factory:
- Create a
GoogleFirestoreEmbeddingStoreBuilderFactory
class that implementsSupplier<GoogleFirestoreEmbeddingStore.Builder>
. - Create a file named
META-INF/services/dev.langchain4j.spi.store.embedding.EmbeddingStoreFactory
insrc/main/resources
. - Add a single line to this file containing the fully qualified name of your factory class (e.g.,
dev.langchain4j.store.embedding.google.firestore.GoogleFirestoreEmbeddingStoreBuilderFactory
).
- Create a
-
Write Integration Tests:
- Create
GoogleFirestoreEmbeddingStoreIT.java
(and potentially...WithFilteringIT.java
,...WithRemovalIT.java
, etc., mirroring the existing modules). - Extend
EmbeddingStoreIT
(orEmbeddingStoreWithFilteringIT
, etc.) to inherit a basic set of tests. - Implement the abstract methods (like
embeddingStore()
) to provide instances of your store and a compatible embedding model. - Add tests specific to Google Cloud Firestore features and limitations.
- Use
@EnabledIfEnvironmentVariable
to conditionally run the tests only when the necessary environment variables (credentials) are set. SeeOllamaChatModelIT
for an example.
- Create
-
Add to BOM (Bill of Materials): Add your new module to
langchain4j-bom/pom.xml
to manage its version consistently. -
Documentation:
- Add the new model in all the
index.md
files - Write a guide in
docs/docs/integrations/embedding-stores
- Create a README file
- Add the new model in all the
Code Example (Conceptual)
// In your GoogleFirestoreEmbeddingStore.java
package dev.langchain4j.store.embedding.google.firestore;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingStore;
import lombok.Builder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.List;
// ... other imports ...
public class GoogleFirestoreEmbeddingStore implements EmbeddingStore<TextSegment> {
private static final Logger log = LoggerFactory.getLogger(GoogleFirestoreEmbeddingStore.class);
// ... (fields for Firestore client, collection name, etc.) ...
@Builder
public GoogleFirestoreEmbeddingStore(...) {
// ... (initialize Firestore client, etc.) ...
}
@Override
public String add(Embedding embedding) {
// ... implementation ...
}
@Override
public void add(String id, Embedding embedding) {
// ... implementation ...
}
@Override
public String add(Embedding embedding, TextSegment textSegment) {
// ... implementation ...
}
@Override
public List<String> addAll(List<Embedding> embeddings) {
// ... implementation ...
}
@Override
public void addAll(List<String> ids, List<Embedding> embeddings, List<TextSegment> embedded) {
// ... implementation ...
}
@Override
public List<EmbeddingMatch<TextSegment>> findRelevant(Embedding referenceEmbedding, int maxResults, double minScore) {
// ... implementation using Firestore vector search ...
// ... map Firestore results to EmbeddingMatch objects ...
}
// ... other methods ...
}
// In a separate file: GoogleFirestoreEmbeddingStoreBuilderFactory.java
package dev.langchain4j.store.embedding.google.firestore;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.spi.store.embedding.EmbeddingStoreFactory;
public class GoogleFirestoreEmbeddingStoreBuilderFactory implements EmbeddingStoreFactory {
@Override
public GoogleFirestoreEmbeddingStore.Builder get() {
return GoogleFirestoreEmbeddingStore.builder();
}
}
Key Improvements and Considerations from the examples:
- Error Handling: Wrap Firestore API calls in
try-catch
blocks and throwRuntimeException
(or a more specific custom exception) to handle potential errors. - Logging: Use SLF4J (
log.debug()
,log.info()
,log.warn()
,log.error()
) for logging. This is crucial for debugging and monitoring. - Configuration: The
Builder
should allow configuration of all necessary parameters (endpoint, credentials, collection name, timeouts, etc.). UseensureNotNull
andensureNotBlank
to validate. - Dependency Management: The
pom.xml
should correctly specify the Google Cloud Firestore client library as a dependency. - Testcontainers: Consider using Testcontainers for integration tests to spin up a Firestore emulator. This will make your tests more reliable and portable. (See examples in the provided code, such as for
CassandraEmbeddingStoreDockerIT
). - Metadata: Implement proper handling of metadata, including mapping it to/from Firestore document fields. Decide on your strategy (separate fields vs. JSON).
- Filtering: Implement filtering based on metadata using Firestore's query capabilities. You will likely need a helper class like
ChromaMetadataFilterMapper
to translateFilter
objects into Firestore queries. - Service Provider Interface (SPI): The
spi
package andMETA-INF/services
file are critical for LangChain4j to discover your implementation. This is howAiServices
(and other parts of the framework) dynamically load components. Without this, your module won't be usable. - Return type: The
findRelevant
method should take a {@link dev.langchain4j.store.embedding.EmbeddingSearchRequest} and return a {@link dev.langchain4j.store.embedding.EmbeddingSearchResult}. This would ensure that we have consistency through all embedding stores and allow advanced filtering and reranking features. - Interface segregation: Introduce a
EmbeddingStoreWithFiltering
interface, that extendsEmbeddingStore
and adds aremoveAll(Filter filter)
method. - Metadata handling: Implement the
MetadataHandler
interface, or extend one of its base classes if it's a good fit. - Support Vector Search API in different database modes: Support vector search in the different modes of Cosmos DB, implementing the search in each of them.
- Implement the
close()
method: You should close the connection to Firestore in theclose()
method. - Test your embedding store against the
EmbeddingStoreWithFilteringIT
andEmbeddingStoreWithRemovalIT
: The providedlangchain4j-core
contains those abstract test classes that you can use to test your embedding store. - Add documentation: Create Markdown files in the
docs/docs/integrations/embedding-stores
directory, following the structure of existing integrations. - Add an example: to the examples repository.
By following these steps and adapting the provided example code, you can create a robust and well-integrated LangChain4j module for Google Cloud Firestore.