Raphael Costa raphaelcosta

Question: Should I avoid using RAG for my AI application after reading that "RAG is dead" for coding agents?

Many developers are confused about when and how to use RAG after reading articles claiming "RAG is dead." Understanding what RAG actually means versus the narrow marketing definitions will help you make better architectural decisions for your AI applications.

Answer: The viral article claiming RAG is dead specifically argues against using naive vector database retrieval for autonomous coding agents, not RAG as a whole. This is a crucial distinction that many developers miss due to misleading marketing.

RAG simply means Retrieval-Augmented Generation - using retrieval to provide relevant context that improves your model's output. The core principle remains essential: your LLM needs the right context to generate accurate answers. The question isn't whether to use retrieval, but how to retrieve effectively.

For coding

Anti-hype LLM reading list

Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.

Foundational Concepts

Pre-Transformer Models

Hydra Performance Microbenchmark

Important: This microbenchmark is not intended to represent any real workload. Compression ratios, and therefore performance, will depend heavily on the specific workload. This is only for the purpose of illustrating a "columnar friendly" contrived workload that showcases the benefits of columnar.

	# A one liner to leverage the GPU on a mac to transcribe audio files
	# Inspired by https://simonwillison.net/2024/Aug/13/mlx-whisper/
	llm_transcribe_recording () {
	local file_path="$1"
	python3 -c "
	import mlx_whisper
	result = mlx_whisper.transcribe('$file_path', path_or_hf_repo='mlx-community/distil-whisper-large-v3')
	print(result['text'])
	"
	}

	# Clone llama.cpp
	git clone https://github.com/ggerganov/llama.cpp.git
	cd llama.cpp

	# Build it
	make clean
	LLAMA_METAL=1 make

	# Download model
	export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin

	# I couldn't get return generators from chains so I had to do a bit of low level SSE, Hope this is useful
	# Probably you'll use another Vector Store instead of OpenSearch, but if you want to mimic what I did here,
	# please use the fork of `OpenSearchVectorSearch` in https://github.com/oneryalcin/langchain


	import json
	import os
	import logging
	from typing import List, Generator

	import os
	os.environ["OPENAI_API_KEY"] = ""

	from flask import Flask, Response, request
	import threading
	import queue

	from langchain.chat_models import ChatOpenAI
	from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
	from langchain.schema import AIMessage, HumanMessage, SystemMessage

	/**
	* Parse Airtable's "ConstantPooledData" format. They recently started using
	* this format to compress some API responses, and it appears to be a
	* home-grown format.
	*
	* Call `parseData()` if you have an object with data (e.g. a JSON-parsed API
	* response body).
	*
	* Call `parseString()` if you have a raw string of data (e.g. an API response
	* body).

	<%= form_with(model: team) do \|form\| %>
	<div>
	<%= form.label :name %>
	<%= form.text_field :name, class: "input" %>
	</div>

	<div>
	<%= f.select :user_id, {}, {placeholder: "Select user"}, {class: "w-full", data: { controller: "select", select_url_value: users_path }} %>
	</div>