Skip to content

Instantly share code, notes, and snippets.

@sergeliatko
Created August 16, 2024 11:43
Show Gist options
  • Save sergeliatko/1aed65b8501160b7d117a70cdad68347 to your computer and use it in GitHub Desktop.
Save sergeliatko/1aed65b8501160b7d117a70cdad68347 to your computer and use it in GitHub Desktop.
SIMANTIKS API - Outline generated from structure.json
Semantic Chunking - 3 Methods for Better RAG
Preface: Introduction to Semantic Chunkers in RAG
Introduction to Semantic Chunkers for Text Modality in Retrieval-Augmented Generation (RAG).
Introduction to Three Types of Semantic Chunkers.
Introduction to Semantic Chunkers Library and Usage of Chunker’s Intro Notebook in Python via Colab.
Prerequisites
Prerequisites Installation: Semantic Chunkers and Hugging Face Datasets.
Data Testing for Chunking Methods: Impact on Latency and Quality of Results.
Data Setup
Introduction to Dataset and Structure of AI Archive Papers.
Limitation on Text Due to Resource-Intensive Chunker.
Requirement of Embedding Model for Semantic Chunking.
Use of OpenAI's Text-Embedding-Ada-002 Model and API Key Requirements.
1. Statistical Semantic Chunking
Introduction to the Statistical Chunking Method and Its Advantages.
Explanation of Statistical Chunker Functionality and Similarity Threshold Calculation.
Overview of Initial Document Chunking Results and Preliminary Assessment.
2. Consecutive Semantic Chunking
Recommendation Order for Consecutive Chunking Method.
Score Threshold Requirements for Various Text-Embedding Models.
User Input and Performance Adjustment for Chunker Threshold.
Explanation of Consecutive Chunker Functionality.
3. Cumulative Semantic Chunking
Cumulative Chunker Method: Step-by-Step Embedding Process and Similarity Comparison.
Higher Time and Cost Due to Increased Embeddings Creation.
Comparison of Noise Resistance and Performance of Chunkers.
Performance Analysis and Threshold Adjustment of the Chunker.
Threshold Adjustment for Improved Performance Over Consecutive Chunker.
Multi-modal Chunking
Introduction to Modalities Handled by Different Chunkers.
Statistical Chunker Limitation to Text Modality.
Capabilities and Future Demonstration of the Consecutive Chunker for Video Handling.
Text-Focused Nature of the Cumulative Chunker.
Conclusion and Sign-off for Semantic Chunkers Presentation !
@sergeliatko
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment