usametov/graphrag-hybrid-qdrant-neo4j.md

Last active March 31, 2025 20:54

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/usametov/b85b10fbcc28eb7c7c635a0091d67544.js"></script>
Save usametov/b85b10fbcc28eb7c7c635a0091d67544 to your computer and use it in GitHub Desktop.

hybridsearch algorithm

Raw

To modify the previous implementation to use Qdrant for hybrid search (vector + keyword search) while keeping Neo4j for Cypher queries in Clojure, we’ll integrate the Qdrant database for the hybrid search part and use Neo4j solely for graph-based Cypher retrieval. Below, I’ll adapt the code assuming you have a Qdrant instance running and a Neo4j database populated with graph data. We’ll use the Qdrant REST API via clj-http for simplicity, as there’s no official Clojure client for Qdrant (though you could use a gRPC client if preferred).

Prerequisites

Qdrant Instance: Running locally (e.g., http://localhost:6333) or on a remote server, with a collection set up for hybrid search (vector embeddings + text payloads).
Neo4j Database: Populated and accessible via Bolt (e.g., bolt://localhost:7687).
Clojure Dependencies: Add clj-http and cheshire (for JSON parsing) to your project.clj:

:dependencies [[org.clojure/clojure "1.11.1"]
               [clj-http "3.12.3"] ;; For HTTP requests to Qdrant
               [cheshire "5.11.0"] ;; For JSON parsing
               [com.novemberain/neocons "3.3.0"]] ;; Neo4j Bolt client

Assumptions

Qdrant Setup: You have a Qdrant collection (e.g., chunks) with:
- Vector embeddings (e.g., 1536 dimensions) indexed for similarity search.
- A payload field text for keyword search.
Neo4j Schema: Contains nodes like Chunk (with a text property) connected to other nodes (e.g., Entity via relationships).
Embedding Model: You have a way to generate embeddings for your query (e.g., via an external API like Ollama or OpenAI).

Updated Implementation

1. Set Up Connections

(ns graphrag.core
  (:require [clojurewerkz.neocons.rest :as nr]
            [clojurewerkz.neocons.rest.cypher :as cy]
            [clj-http.client :as http]
            [cheshire.core :as json]
            [clojure.string :as str]))

;; Neo4j Bolt connection
(def neo4j-conn
  (nr/connect "bolt://neo4j:password@localhost:7687"))

;; Qdrant base URL
(def qdrant-url "http://localhost:6333")
(def qdrant-collection "chunks") ;; Replace with your collection name

Replace the Neo4j credentials and Qdrant URL/collection as needed.

2. Define Hybrid Search with Qdrant

Qdrant supports hybrid search through its /search endpoint (vector search) and /scroll endpoint with filters (keyword search). We’ll combine these.

(defn hybrid-search [query top-k]
  (let [;; Placeholder embedding for the query (replace with real embedding)
        embedding (repeat 1536 0.1) ;; Generate this with an embedding model
        ;; Vector search request
        vector-req {:vector embedding
                    :limit top-k
                    :with_payload true}
        vector-resp (http/post (str qdrant-url "/collections/" qdrant-collection "/points/search")
                              {:body (json/generate-string vector-req)
                               :headers {"Content-Type" "application/json"}
                               :as :json})
        vector-results (-> vector-resp :body :result)
        ;; Keyword search request using scroll with a filter
        keyword-req {:filter {:must [{:key "text"
                                      :match {:value query}}]}
                     :limit top-k
                     :with_payload true}
        keyword-resp (http/post (str qdrant-url "/collections/" qdrant-collection "/points/scroll")
                               {:body (json/generate-string keyword-req)
                                :headers {"Content-Type" "application/json"}
                                :as :json})
        keyword-results (-> keyword-resp :body :result)]
    {:vector (map #(hash-map :text (get-in % [:payload :text])
                            :score (:score %))
                  vector-results)
     :keyword (map #(hash-map :text (get-in % [:payload :text])
                             :score 1.0) ;; No score from scroll, assign default
                   keyword-results)}))

Notes:

Embedding: The embedding is a dummy vector. Replace it with an actual embedding generated for query (e.g., via an API call to Ollama).
Vector Search: Uses Qdrant’s /search endpoint for semantic similarity.
Keyword Search: Uses /scroll with a match filter on the text payload field. Qdrant doesn’t provide a score for this, so we assign a default score of 1.0.
Payload: Assumes each point in Qdrant has a text field in its payload.

3. Define Cypher Retrieval with Neo4j

This remains largely unchanged, querying Neo4j based on texts from Qdrant:

(defn cypher-retrieval [conn texts]
  (let [cypher-query (str "MATCH (c:Chunk)-[r]->(e:Entity) "
                         "WHERE c.text IN $texts "
                         "RETURN c.text AS chunk_text, r, e.name AS entity_name")
        results (cy/tquery conn cypher-query {:texts texts})]
    results))

Adjust the Cypher query to match your Neo4j schema.

4. Combine Qdrant Hybrid Search and Neo4j Cypher Retrieval

(defn hybrid-cypher-retrieval [conn query top-k]
  (let [;; Step 1: Perform hybrid search with Qdrant
        hybrid-results (hybrid-search query top-k)
        vector-texts (map :text (:vector hybrid-results))
        keyword-texts (map :text (:keyword hybrid-results))
        all-texts (distinct (concat vector-texts keyword-texts))
        ;; Step 2: Fetch related graph data with Neo4j Cypher
        cypher-results (cypher-retrieval conn all-texts)
        ;; Step 3: Merge and rank results
        merged-results (map (fn [result]
                              (let [text (:chunk_text result)
                                    vector-score (or (some #(when (= (:text %) text) (:score %))
                                                          (:vector hybrid-results))
                                                    0)
                                    keyword-score (or (some #(when (= (:text %) text) (:score %))
                                                           (:keyword hybrid-results))
                                                     0)]
                                {:text text
                                 :entity (:entity_name result)
                                 :combined-score (+ vector-score keyword-score)}))
                            cypher-results)]
    ;; Sort by combined score and take top-k
    (take top-k (sort-by :combined-score > merged-results))))

5. Example Usage

(defn -main []
  (let [query "What is the role of enzymes in biology?"
        top-k 5
        results (hybrid-cypher-retrieval neo4j-conn query top-k)]
    (doseq [result results]
      (println (str "Text: " (:text result)
                    ", Entity: " (:entity result)
                    ", Score: " (:combined-score result))))))

;; Run the main function
(-main)

Key Changes from Previous Version

Qdrant for Hybrid Search:
- Replaced Neo4j vector and full-text index queries with Qdrant’s REST API calls.
- Vector search uses /search for similarity based on embeddings.
- Keyword search uses /scroll with a filter to match the text payload field.
Neo4j for Cypher Only:
- Neo4j is now used solely for graph traversal and relationship queries, not for search.
Result Structure:
- Qdrant returns results with :payload (containing text) and :score (for vector search). We map these into a consistent format for merging with Cypher results.

Adapting to Your Setup

Qdrant Configuration:
- Ensure your Qdrant collection (chunks) has vectors and a text payload field. Create it if needed:
```
curl -X PUT http://localhost:6333/collections/chunks \
     -H "Content-Type: application/json" \
     -d '{"vectors": {"size": 1536, "distance": "Cosine"}}'
```
- Populate it with points containing embeddings and text payloads.

Embedding Generation:

Replace the dummy embedding in hybrid-search with a real one. For example, if using Ollama:

(defn generate-embedding [text]
  (let [resp (http/post "http://localhost:11434/api/embeddings" ;; Ollama API
                       {:body (json/generate-string {:model "nomic-embed-text" :prompt text})
                        :headers {"Content-Type" "application/json"}
                        :as :json})]
    (-> resp :body :embedding)))

Then update hybrid-search to use (generate-embedding query).

Neo4j Schema:
- Ensure the text values in Qdrant match the text properties in Neo4j Chunk nodes for seamless integration.
Scoring:
- The keyword search score is hardcoded to 1.0. You could enhance this by implementing a custom scoring mechanism (e.g., TF-IDF) or using Qdrant’s experimental hybrid search features if available in your version.

Running the Code

Start your Qdrant instance (e.g., via Docker: docker run -p 6333:6333 qdrant/qdrant).
Start your Neo4j instance with Bolt enabled.
Run lein repl and load the namespace.
Call -main.

This setup leverages Qdrant’s strengths for hybrid search and Neo4j’s graph capabilities for Cypher-based retrieval, all within Clojure. Let me know if you need further adjustments!

Raw

graphrag-hybrid.md

Based on the slides you provided and the latest developments in semantic search, particularly with GraphRAG and hybrid search using Neo4j, here’s an overview tailored to your request. The slides highlight how GraphRAG, combined with Neo4j’s capabilities, enhances generative AI (GenAI) applications by leveraging knowledge graphs (KGs) for retrieval-augmented generation (RAG). Hybrid search, which integrates vector search with graph-based context, is a key pattern discussed, and I’ll focus on that in the context of Neo4j. Since you use Clojure, I’ll convert the relevant Python code from the slides into Clojure equivalents, assuming integration with Neo4j via a library like neocons (a popular Clojure library for Neo4j).

Latest Developments in Semantic Search with GraphRAG and Hybrid Search

GraphRAG Overview:
- GraphRAG extends traditional RAG by using knowledge graphs as the data source, providing structured relationships and context that improve accuracy and explainability over unstructured text.
- Neo4j, a leading graph database, supports this by offering efficient indexing (e.g., vector, full-text) and traversal capabilities, enabling real-time updates and domain-specific responses.
Hybrid Search with Graph Context:
- Hybrid search combines vector search (semantic similarity using embeddings) with graph traversals (structural relationships). This is particularly powerful in Neo4j, where vector indexes support Approximate Nearest Neighbor (ANN) search, and Cypher queries allow precise filtering and context retrieval.
- The slides (Page 14) mention "Vector & Hybrid Search with Graph Context" and "Graph Filters" as patterns, indicating that you can use graph relationships to refine vector search results, enhancing relevance.
Neo4j’s Role:
- Neo4j provides components like nodes (entities), relationships (associations), and properties (attributes, including vectors), as seen in slides 9–11.
- It supports multiple index types (slide 12), with the vector index being critical for semantic search and hybrid approaches.
- The GraphRAG Python package (slide 16) simplifies building GenAI-ready KGs with vectors and implementing hybrid retrievers, which can be adapted to Clojure.
Latest Trends:
- As of March 29, 2025, the focus in semantic search is on combining unstructured data (e.g., text embeddings) with structured graph data for richer context. Neo4j’s integration with tools like OpenAI embeddings and its Python package (released by 2024, per slides) reflects this trend.
- Real-time updates, explainability, and governance (slide 13) are emphasized, making hybrid search a standard for enterprise GenAI applications.

Converting Python Code to Clojure

The slides (Page 17) provide a Python workflow for the GraphRAG package. Below, I’ll convert this to Clojure, assuming you’re using neocons for Neo4j connectivity and a hypothetical Clojure wrapper for OpenAI embeddings (since no official GraphRAG Clojure package exists). Note that some parts are speculative due to the experimental nature of the Python code and OCR errors (e.g., "ganerotion" should be "generation").

Python Code from Slide 17

import neo4j
from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings as Embeddings
from neo4j_graphrag.experimental_pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.retrievers import VectorRetriever
from neo4j_graphrag.generation.graphrag import GraphRAG

# (Rest of the code is truncated or garbled in OCR)

Clojure Equivalent

Here’s a Clojure version using neocons for Neo4j, clj-http for OpenAI API calls, and a simplified structure:

(ns graphrag-example.core
  (:require [neocons.core :as neo]
            [clj-http.client :as http]
            [clojure.data.json :as json]))

;; Connect to Neo4j
(def neo4j-conn
  (neo/connect "http://localhost:7474/db/data" "neo4j" "password"))

;; OpenAI API setup (mocked for embeddings and LLM)
(defn openai-embeddings [text]
  (let [response (http/post "https://api.openai.com/v1/embeddings"
                           {:headers {"Authorization" "Bearer YOUR_API_KEY"}
                            :body (json/write-str {:input text :model "text-embedding-ada-002"})
                            :content-type :json})]
    (-> response :body json/read-str :data first :embedding)))

(defn openai-llm [prompt]
  (let [response (http/post "https://api.openai.com/v1/completions"
                           {:headers {"Authorization" "Bearer YOUR_API_KEY"}
                            :body (json/write-str {:prompt prompt :model "text-davinci-003"})
                            :content-type :json})]
    (-> response :body json/read-str :choices first :text)))

;; Simple Knowledge Graph Pipeline (mocked)
(defn build-kg [documents]
  (doseq [doc documents]
    (let [embedding (openai-embeddings doc)
          cypher-query (str "CREATE (d:Document {content: $content, embedding: $embedding})")]
      (neo/execute-query neo4j-conn cypher-query {:content doc :embedding embedding}))))

;; Vector Retriever with Hybrid Search
(defn vector-retriever [query]
  (let [query-embedding (openai-embeddings query)
        cypher-query (str "CALL db.index.vector.queryNodes('documentEmbeddingIndex', 5, $embedding) "
                          "YIELD node, score "
                          "MATCH (node)-[:RELATED_TO]->(related) "
                          "RETURN node.content, related.content, score")]
    (neo/execute-query neo4j-conn cypher-query {:embedding query-embedding})))

;; GraphRAG Workflow
(defn graphrag [query]
  (let [retrieved (vector-retriever query)
        prompt (str "Based on this data: " (pr-str retrieved) ", answer: " query)
        response (openai-llm prompt)]
    response))

;; Example Usage
(def documents ["Doc1 content" "Doc2 content"])
(build-kg documents)
(println (graphrag "What is the main topic of the documents?")))

Explanation of Clojure Code

Neo4j Connection: Uses neocons to connect to a local Neo4j instance. Adjust the URL, username, and password as needed.
OpenAI Embeddings and LLM: Mocked using clj-http to call OpenAI’s API. You’d need an API key and possibly a library like clojure-openai if available by 2025.
KG Building: The build-kg function creates nodes with embeddings, simulating the SimpleKGPipeline.
Hybrid Search: The vector-retriever uses Neo4j’s vector index (assumed to be pre-created as documentEmbeddingIndex) and a Cypher query to fetch related nodes, combining vector similarity with graph relationships.
GraphRAG: Combines retrieval and LLM generation in a simple workflow.

Applying This to Your Use Case

Setup: Ensure Neo4j is running with a vector index on your document nodes (see slide 12). Use Cypher to create it: CREATE VECTOR INDEX documentEmbeddingIndex FOR (d:Document) ON (d.embedding).
Hybrid Search: The example above retrieves documents via vector search and enriches them with related nodes, aligning with the "Vector & Hybrid Search with Graph Context" pattern (slide 14).
Clojure Integration: Since GraphRAG’s Python package isn’t directly available in Clojure, you’d need to replicate its functionality using Neo4j queries and OpenAI calls, as shown.

Resources for Further Exploration

From your slides (Pages 19–20):

Check the Neo4j blog on hybrid retrieval: neo4j.com/developer-blog/hybrid-retrieval-graphrag-python-package/.
Explore the free GraphRAG course: graphacademy.neo4i.com/courses/genai-workshop-graphrag/.

Let me know if you’d like me to refine this further or focus on a specific aspect!

Raw

rrf.md

Reciprocal Rank Fusion (RRF) algorithm—a clever and elegant method for combining multiple ranked lists into a single, unified ranking. It’s widely used in hybrid search systems (like the ones we’ve been exploring with Neo4j and Meilisearch) to merge results from different search strategies, such as keyword-based full-text search and semantic vector search. I’ll break it down step-by-step, explain how it works, why it’s effective, and show you a practical example—all in a natural, engaging way.

What is Reciprocal Rank Fusion (RRF)?

RRF is a rank aggregation technique that takes multiple ranked lists (e.g., lists of search results from different methods) and produces a single ranked list by assigning scores based on the reciprocal of each item’s rank in its original list. The idea is simple yet powerful: items ranked higher in any list get a bigger boost, and the final score reflects their combined prominence across all lists.

It was originally introduced in the 2009 paper "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods" by Cormack et al., and it’s become a go-to method in information retrieval because it’s:

Parameter-light: Requires minimal tuning.
Robust: Works well even when input lists are noisy or incomplete.
Fair: Balances contributions from all sources without needing to know their relative quality upfront.

How Does RRF Work?

Here’s the core concept:

For each item in each ranked list, compute a score based on its rank using the formula: [ \text{RRF Score} = \frac{1}{k + \text{rank}} ]
- rank is the position of the item in a given list (1 for first, 2 for second, etc.).
- k is a constant (typically 60, but adjustable) that prevents scores from dropping too sharply for lower ranks and avoids division-by-zero issues for unranked items.
Sum the reciprocal scores for each item across all lists.
Sort items by their total RRF score in descending order to get the final ranking.

The k parameter smooths the scoring curve—higher values give more weight to lower-ranked items, while lower values emphasize top ranks.

Why Use RRF for Hybrid Search?

In hybrid search, you often have results from:

Vector search: Ranked by semantic similarity (e.g., cosine distance of embeddings).
Full-text search: Ranked by keyword relevance (e.g., BM25 scores). These lists might rank the same items differently, and RRF fuses them without needing to normalize scores or assume one method is “better.” It’s all about positional fairness—items consistently ranked high across lists bubble up.

Step-by-Step Example

Let’s say we’re searching for movies with “hacker virtual world” and have two ranked lists:

Vector Search Results (semantic similarity):
1. "The Matrix" (rank 1)
2. "Tron" (rank 2)
3. "Inception" (rank 3)
Full-Text Search Results (keyword match):
1. "The Matrix" (rank 1)
2. "Hackers" (rank 2)
3. "Inception" (rank 3)

We’ll use ( k = 60 ) (a common default) and calculate RRF scores.

1. Compute Reciprocal Scores

For each movie, calculate ( \frac{1}{k + \text{rank}} ) for each list it appears in, then sum:

"The Matrix":
- Vector: ( \frac{1}{60 + 1} = \frac{1}{61} \approx 0.01639 )
- Full-Text: ( \frac{1}{60 + 1} = \frac{1}{61} \approx 0.01639 )
- Total: ( 0.01639 + 0.01639 = 0.03278 )
"Tron":
- Vector: ( \frac{1}{60 + 2} = \frac{1}{62} \approx 0.01613 )
- Full-Text: Not ranked (implicit rank > 3), so ( \frac{1}{60 + \infty} = 0 )
- Total: ( 0.01613 + 0 = 0.01613 )
"Inception":
- Vector: ( \frac{1}{60 + 3} = \frac{1}{63} \approx 0.01587 )
- Full-Text: ( \frac{1}{60 + 3} = \frac{1}{63} \approx 0.01587 )
- Total: ( 0.01587 + 0.01587 = 0.03174 )
"Hackers":
- Vector: Not ranked, so ( 0 )
- Full-Text: ( \frac{1}{60 + 2} = \frac{1}{62} \approx 0.01613 )
- Total: ( 0 + 0.01613 = 0.01613 )

2. Final Ranking

Sort by total RRF score:

"The Matrix": 0.03278
"Inception": 0.03174
"Tron": 0.01613 (tied)
"Hackers": 0.01613 (tied)

"The Matrix" wins because it’s top-ranked in both lists. "Inception" comes next, benefiting from consistent mid-tier placement. "Tron" and "Hackers" tie since they only appear in one list each, but at different ranks—RRF doesn’t break ties, so you could use original scores or another tiebreaker.

Implementing RRF

Here’s a simple Python snippet to compute RRF, adaptable to any hybrid search system:

def rrf_score(ranks, k=60):
    return sum(1 / (k + rank) for rank in ranks)

# Example data: {item: [ranks in each list]}
results = {
    "The Matrix": [1, 1],
    "Tron": [2],
    "Inception": [3, 3],
    "Hackers": [2]
}

# Calculate RRF scores
rrf_scores = {item: rrf_score(ranks) for item, ranks in results.items()}

# Sort by score
final_ranking = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

for item, score in final_ranking:
    print(f"{item}: {score:.5f}")

Output:

The Matrix: 0.03279
Inception: 0.03175
Tron: 0.01613
Hackers: 0.01613

Advantages of RRF

No Score Normalization: Works with ranks alone, not raw scores (e.g., vector distances or BM25), avoiding scaling issues.
Handles Missing Items: Unranked items get a score of 0, naturally falling to the bottom.
Simple: Easy to implement and understand.

Tuning RRF

Adjusting ( k ):
- Smaller ( k ) (e.g., 10): Heavily favors top ranks.
- Larger ( k ) (e.g., 100): Gives more weight to lower ranks.
- Default ( k = 60 ) is a balanced starting point, based on empirical testing in the original paper.
Weighting Lists: Multiply scores from certain lists (e.g., ( w \cdot \frac{1}{k + \text{rank}} )) if you trust one method more (e.g., vector over full-text).

RRF in Hybrid Search Context

In your Neo4j or Meilisearch setups:

Fetch top-N results from each method (e.g., vector and full-text).
Assign ranks based on their order in each list.
Apply RRF to merge them into a final list.
Optionally, re-rank further (e.g., with Cohere, as we discussed).

For example, Meilisearch’s hybrid search internally blends results, but you could extract separate vector and full-text rankings via API tweaks and apply RRF manually for custom control.

Why It’s Effective

RRF mimics human judgment: if something’s highly ranked by multiple independent methods, it’s likely relevant. Its reciprocal nature ensures top items dominate without overwhelming contributions from lower ranks, thanks to ( k ).

Wrap-Up

RRF is a beautifully straightforward way to fuse rankings in hybrid search, balancing semantic and keyword relevance without complex math or training. Try tweaking ( k ) or integrating it into your Meilisearch pipeline—want a code example for that? Just ask!