An LLM fine-tuning course online conference for everything LLMs.
Build skills to be effective with LLMs
Course website: https://maven.com/parlance-labs/fine-tuning
Guest speaker: Ben Clavié
Date: June 11, 2024
MVPs with a twist
LLMs are powerful, but have limitations: their knowledge is fixed in their weights, and their context window is limited. Worse: when they don’t know something, they might just make it up. RAG, for Retrieval Augmented Generation, has emerged as a way to mitigate both of those problems. RAG combines Information Retrieval (IR) tools with LLMs to act as a bridge between your documents, wherever they're stored, and the model: This allows it to answer questions based on their content. Imagine a computer: the LLM is the powerful CPU, its context window is the RAM, and RAG is the tool that retrieves information from your hard drive (database) for the LLM to process. However, implementing RAG effectively is more complex than it seems. The nitty gritty parts of what makes good retrieval good are rarely talked about: No, cosine similarity is, in fact, not all you need. In this workshop, we will explore what helps build a robust RAG pipeline, and how simple insights from retrieval research can greatly improve your RAG efforts. We’ll cover key topics like BM25, re-ranking, indexing, domain specificity, evaluation beyond LGTM@few, and filtering. Be prepared for a whole new crowd of incredibly useful buzzwords to enter your vocabulary.
00:00 Introduction
Hamel introduces Ben Clavier, a researcher at Answer.ai with a strong background in information retrieval and the creator of the RAGatouille library.
00:48 Ben’s Background
Ben shares his journey into AI and information retrieval, his work at Answer.ai, and the open-source libraries he maintains, including ReRankers.
02:20 Agenda
Ben defines Retrieval-Augmented Generation (RAG), clarifies common misconceptions, and explains that RAG is not a silver bullet or an end-to-end system.
05:01 RAG Basics and Limitations
Ben explains the basic mechanics of RAG, emphasizing that it is simply the process of stitching retrieval and generation together, and discusses common failure points.
06:29 RAG MVP Pipeline
Ben breaks down the simple RAG pipeline, including model loading, data encoding, cosine similarity search, and obtaining relevant documents.
07:54 Vector Databases
Ben explains the role of vector databases in handling large-scale document retrieval efficiently and their place in the RAG pipeline.
08:46 Bi-Encoders
Ben describes bi-encoders, their efficiency in pre-computing document representations, and their role in quick query encoding and retrieval.
11:24 Cross-Encoders and Re-Ranking
Ben introduces cross-encoders, their computational expense, and their ability to provide more accurate relevance scores by encoding query-document pairs together.
14:38 Importance of Keyword Search
Ben highlights the enduring relevance of keyword search methods like BM25 and their role in handling specific terms and acronyms effectively.
15:24 Integration of Full-Text Search
Ben discusses the integration of full-text search (TF-IDF) with vector search to handle detailed and specific queries better, especially in technical domains.
16:34 TF-IDF and BM25
Ben explains TF-IDF, BM25, and their implementation in modern retrieval systems, emphasizing their effectiveness despite being older techniques.
19:33 Combined Retrieval Approach
Ben illustrates a combined retrieval approach using both embeddings and keyword search, recommending a balanced weighting of scores.
19:22 Metadata Filtering
Ben emphasizes the importance of metadata in filtering documents, providing examples and explaining how metadata can significantly improve retrieval relevance.
22:37 Full Pipeline Overview
Ben presents a comprehensive RAG pipeline incorporating bi-encoders, cross-encoders, full-text search, and metadata filtering, showing how to implement these steps in code.
26:05 Q&A Session Introduction
26:14 Fine-Tuning Bi-Encoder and Cross-Encoder Models
Ben discusses the importance of fine-tuning bi-encoder and cross-encoder models for improved retrieval accuracy, emphasizing the need to make the bi-encoder more loose and the cross-encoder more precise.
26:59 Combining Scores from Different Retrieval Methods
A participant asks about combining scores from different retrieval methods. Ben explains the pros and cons of weighted averages versus taking top candidates from multiple rankers, emphasizing the importance of context and data specifics.
29:01 The Importance of RAG as Context Lengths Get Longer
Ben reflects on how RAG may evolve or change as context lengths of LLMs get larger, but emphasizing that long context lengths are not a silver bullet.
30:06 Chunking Strategies for Long Documents
Ben discusses effective chunking strategies for long documents, including overlapping chunks and ensuring chunks do not cut off sentences, while considering the importance of latency tolerance in production systems.
30:56 Fine-Tuning Encoders and Advanced Retrieval with ColBERT
Ben also discusses when to fine-tune your encoders, and explains ColBERT for advanced retrieval.
Summaries provide by Hamel: https://parlance-labs.com/talks/rag/ben.html
The notes below was based on the slides.
(WIP)
I do R&D at Answer.AI under Jeremy Howard, with other awesome people. Prior to joining Answer.AI, I worked in a variety of roles in NLP/Information Retrieval, eventually moving to consulting. I made the RAGatouille library, which makes ColBERT friendlier to use, and also maintain the rerankers lib (more on that in a few slides!) If you know me, it’s most likely via twitter, at @bclavie.
Overall theme: Loose presentation of the core Retrieval Basics, as they should exist in all RAG pipelines:
- Rant: Retrieval was not invented in December 2022
- The “compact MVP”: Bi-encoder single vector embeddings and cosine similarity are all you need
- What’s a cross-encoder and why do I need it?
- Tf-idf and full text search is so 2000s 1990s 1980s 1970s, there’s no way it’s still relevant, right?
- Metadata Filtering: when not all content is potentially useful, don’t make it harder than it needs to be!
- “Compact MVP++” : All of the above in 30 lines or less. - Bonus: Yes, one vector is good, but how about many of them?
What I won’t be talking about today:
- ❌ How to systematically monitor and improve RAG systems (See Jason & Dan’s upcoming course for that!)
- ❌ Evaluations: These are far too important to be covered quickly, and Jo Bergum will be covering how to efficiently do them in his upcoming talk.
- ❌ Benchmarks/Paper references: in the interest of time & space, we’ll avoid big scary Table 3. and Figure 2. in those slides (except once).
- ❌ An overview of all the best performing models
- ❌ Synthetic data and training
- ❌ All the approaches you could actually use (sparse models, ColBERT…), which go beyond the very basics!
- RAG is not:
- A new paradigm
- A framework
- An end-to-end system
- Something created by Jason Liu in his endless quest for a Porsche
- RAG is the act of stitching together Retrieval and Generation to ground the latter
- The Retrieval part comes from Information Retrieval, a very active field of research
- The Generation part is what’s handled by LLMs
- “Good RAG” is made up of good components:
- Good retrieval pipeline
- Good generative model
- Good way of linking them up
The most compact (& most common) deep retrieval pipeline boils down to a very simple process:
{add diagram}
- The vector db in this example is
np.array! - A key point of using a vector DB (or an index) is to allow Approximate search, so you don’t have to compute too many cosine similarities. - You don’t actually need one to search through vectors at small scales: any modern CPU can search through 100s of vectors in milliseconds.
- The representation method from the previous slides is commonly referred to as using “bi-encoders” - Bi-encoders are (generally) used to create single-vector representations. They pre-compute document representations. - Documents and query representations are computed entirely separately, they aren’t aware of each other. - Thus, all you need to do at inference is to encode your query and search for similar document vectors - This is very computationally efficient, but comes with retrieval performance tradeoffs.
- So if documents & query being unaware of each other is bad, how do we fix it? - The most common approach is using Cross-Encoders: - However, It’s not computationally realistic to compute query-aware document representations for every single query-document pair, everytime a new query comes up (imagine doing that against every Wikipedia paragraph!)
-
You might have also heard of other re-ranking approaches: RankGPT/RankLLM, T5-based rerankers, etc… - Their method differs but the core idea is the same: leverage a powerful but computationally expensive model to score only a subset your documents, previously retrieved by a more efficient model.
-
There are many models for you to try out, some of them API-based (Cohere, Jina…), some of them you can run locally (such as mixedbread). Luckily, I have a library to make that easy.
- With the addition of a re-ranking step, this is what your Retrieval pipeline now looks like:
- Semantic search via embeddings is powerful, but compressing information from hundreds of tokens to a single vector is bound to lose information. - Embeddings learn to represent information that is useful to their training queries. - This training data will never be fully representative, especially when you use the model on your own data, on which it hasn’t been trained. - Additionally, humans love to use keywords. We have very strong tendencies to notice and use certain acronyms, domain-specific words, etc… - To capture all this signal, you should ensure your pipeline uses Keyword search
- Keyword search, also called “full-text search”, is built on old technology: BM25, powered by tf-idf (a way of representing text and weighing down words that are common) - An ongoing joke is that information retrieval has progressed slowly because BM25 is too strong a baseline. - BM25 is especially powerful on longer documents and documents containing a lot of domain-specific jargon. - Its inference-time compute overhead is virtually unnoticeable, and it’s therefore a near free-lunch addition to any pipeline.
Results table from BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models (2021), Thakur et al. This paper introduces BEIR, aka the retrieval part of MTEB.
With text search and reranking, this is what your pipeline now looks like:
{add diagram}
-
An extremely important component of production Retrieval is metadata filtering. - Outside of academic benchmarks, documents do not exist in a vacuum. There’s a lot of metadata around them, some of which can be very informative. - Take this query:
-
There is a lot of ways semantic search can fail here, the two main ones being:
-
The model must accurately represent all of “financial report”, but also “cruise division”, “Q4” and “2022”, into a single vector, otherwise it will fetch documents that look relevant but aren’t meeting one or more of those criteria. - If the number of documents you search for (“k”) is set too high, you will be passing irrelevant financial reports to your LLM, hoping that it manages to figure out which set of numbers is correct.
-
It’s perfectly possible that vector search would succeed for this query, but it’s a lot more likely that it will fail in at least one way. - However, this is very easy to mitigate: there are entity detection models, such as GliNER, who can very easily extract zero-shot entity types from text:
-
All you need to do is ensure that business/query-relevant information is stored alongside their associated documents. - You can then use the extracted entities to pre-filter your documents, ensuring you only perform your search on documents whose attributes are related to the query.
❤GliNER demo from Tom Aarsen on HuggingFace Spaces, based on GliNER, introduced in GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer (2023), Zaratiana et al. (try it if you haven’t already, it is a massive game-changer for any sort of pipeline that could use robust entity-detection with little overhead!)
With this final additional component, this is what your MVP Retrieval pipeline should now look like:
This does look scarier (especially if you have to fit into a slide), but it’s very simple to implement. The Final Compact MVP++
-
This is the full implementation of all the tricks discussed. - It might look slightly unfriendly, but there is actually very little to parse! - Let’s shed the data loading and see what’s going on… That’s all folks
-
There’s a lot more to cover, but this is your ideal quick MVP! - Most other improvements are also very valuable, but will have decreasing cost-effort ratio. - It’s definitely worth learning about Sparse (like SPLADE) and multi-vector methods (like ColBERT) if you’re interested – feel free to bug me on the discord! - You should watch Jason’s talk about RAG systems and Jo’s upcoming talk about retrieval evaluations! - Any questions?
Questions?
(watch the video at 00:00:00)
(WIP)
List of links from the session:
- Ben Clavié, Twitter / X : https://x.com/bclavie
- Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. https://github.com/bclavie/RAGatouille
- A lightweight unified API for various reranking models: https://github.com/AnswerDotAI/rerankers
- A Hackers' Guide to Language Models: https://www.youtube.com/watch?v=jkrNMKz9pWU
- Excalidraw: https://excalidraw.com/
- GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer: https://arxiv.org/abs/2311.08526
- Fine-Tuning with Sentence Transformers: https://www.sbert.net/docs/sentence_transformer/training_overview.html
- Elastic, Dense vector field type: https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
Source: Discord (thanks to CodingWitcher)
Channel: WIP
Some highlights:
(WIP)