Brag Your RAG with the MLOPS Swag - Madhav Sathe, Google & Jitender Kumar, publicissapient

Introduction to LLMs and Grounding Techniques

Many enterprises use Large Language Models (Large language model) to build applications, but these models are not grounded in enterprise data, leading to the need for mechanisms to train or ground LLMs with enterprise data (00:01:47).
Prompt engineering is an interesting skill that will be widely adopted across industries, as it provides the basis for using LLMs, and involves providing inputs to get desired results from the LLM (00:02:06).
Retrieval Augmented Generation (Retrieval-augmented generation) is a technique that gives LLMs grounding in enterprise data, preventing hallucination, and involves three phases: pre-processing, retrieval, and model inference (00:04:31).
Most enterprises will use prompt engineering and RAG, but those with the skills and resources may also use fine-tuning efforts, such as parameter-efficient fine-tuning and full fine-tuning (00:03:44).

A demo was presented using the open-source product Kopi and Conifer cone database to show how RAG can be used to build custom applications (00:07:00).
A Wikipedia page from 2024 was used to test if the Pinecone and Kopy can communicate with each other and retrieve data (00:10:55)
The dataset was used as a grounding source dataset to get context-based answers, and a solution was built using L-Chain and Pinecone without Kopy (00:11:31)

An MLOps pipeline for Retrieval-augmented generation was created with a serving subsystem, embedding system, and data injection pipeline, and was tested on Ray and GKE (00:13:30)
The embedding subsystem uses Ray cluster on GKE, with a head node and worker pod, and reads data from GCS bucket using GCS Fuse (00:16:02)
GCS Fuse allows for optimal utilization of GPUs during training or embedding process by streaming data from object storage (00:17:42)
Cloud Storage Fuse provides a seamless application for developers, allowing them to access buckets like file systems (00:18:23)
A high-performance GPU fabric is required for distributed training, allowing GPUs to communicate with each other without network hops (00:19:33)
Google Kubernetes Engine (GKE) allows native support for multi-networking, enabling massive bandwidth for training jobs and access to all available network interfaces (Nicks) on a machine (00:21:00).
GKE's A3 machine type features 8 GPUs, 28 vCPUs, and a Nick arrangement of 8 plus one, suitable for massive distributed computing systems (00:21:55).
Secondary boot disks in GKE enable faster loading of large container images, making data readily available through the cache (00:22:27).
Service extensions in GKE help reduce unwanted requests and prevent potential data leaks, with app-level custom metrics reported back to the load balancer (00:23:05).

Retrieval-augmented generation (Reasoning Augmented Language model) provides a grounding aspect, but fine-tuning or supervised fine-tuning requires additional work and cost-benefit analysis (00:26:22).
GPUs are necessary for serving models, running embedding models, and translating text into vectors, even for RAG and prompt-based approaches (00:28:25).