The Similarity Reranker is a Python-based module designed to analyze and rank the similarity between documents. It utilizes advanced techniques such as BERT embeddings, BM25 scoring, and Language Model (LLM) refinement to provide a comprehensive similarity analysis. The module is configurable and can handle large document sets efficiently, making it suitable for various use cases like document retrieval, comparison, and clustering.
- Multi-stage Similarity Analysis: Combines embeddings, BM25, and LLM scoring to provide refined similarity rankings.
- Dimensionality Reduction: Uses random projection to reduce the dimensionality of BERT embeddings, improving computational efficiency.
- LLM-Based Refinement: Ranks document similarity using LLMs, with configurable models and parameters.
- Efficient Handling of Large Document Sets: Supports chunking and multiprocessing to handle large datasets.
- ArangoDB Integration: Stores similarity results and allows retrieval and upserts for efficient data management.
- Highly Configurable: Parameters for embeddings, LLM models, and other analysis techniques can be customized via configuration files.
- BERT Embeddings: To represent documents in high-dimensional vector space.
- BM25: A keyword-based similarity algorithm commonly used in document retrieval.
- OpenAI/LLM: For refining and ranking document similarity.
- Random Projection: Dimensionality reduction technique to make embedding computation more scalable.
- ArangoDB: A NoSQL database used for storing and retrieving similarity data.
similarity_reranker/
├── config_model.py # Configuration schemas for embeddings, LLM, and database settings
├── embeddings/
│ ├── __init__.py
│ ├── bert_embeddings.py # Manages BERT embedding generation
│ └── projection.py # Implements random projection for dimensionality reduction
├── llm/
│ ├── __init__.py
│ ├── llm_helpers.py # Helper functions for LLM interactions (API calls, response handling)
│ └── llm_ranking.py # LLM-based ranking logic for document similarity refinement
├── main.py # Entry point for running similarity analysis
├── requirements.txt # Project dependencies
├── similarity/
│ ├── __init__.py
│ ├── bm25_similarity.py # Implements BM25-based keyword similarity scoring
│ ├── similarity_combiner.py # Combines various similarity metrics into a final score
│ ├── similarity_ranking.py # Converts similarity scores to discrete rankings
│ └── similarity_refinement.py # Refines top-ranked similarities using embeddings
├── similarity_ranker_all.py # Comprehensive script that integrates all similarity ranking stages
└── utils/
└── __init__.py # Placeholder for utility functions
To set up the project, follow these steps:
-
Clone the repository:
git clone https://github.com/your-username/similarity-reranker.git cd similarity-reranker -
Create a virtual environment:
python3 -m venv venv source venv/bin/activate # For Linux/Mac venv\Scripts\activate # For Windows
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
- Create a
.envfile in the root directory and set your API keys for OpenAI and ArangoDB:
OPENAI_API_KEY=your-openai-key ARANGODB_HOST=http://localhost:8529 ARANGODB_USER=root ARANGODB_PASSWORD=openSesame
- Create a
The config_model.py defines the configuration schema for the project.