- How AI Agents Work
- Core Architecture
- Tokenization & Embeddings
- Transformer Architecture
- RAG System
- Vector Databases
- Tool Use & Function Calling
- Memory Systems
- Multi-Agent Systems
- Running Models Locally
- Spring Boot RAG Application
- Learning Roadmap
- Production Considerations
- Additional Topics
- Quick Reference
User Input
↓
Context Assembly (System Prompt + History + Tools)
↓
Tokenization → Embeddings
↓
Transformer Neural Network
↓
Next Token Prediction (Probability Distribution)
↓
Tool Decision? (Function Calling)
↓ Yes
Execute Tool → Inject Results → Continue Generation
↓ No
Token-by-Token Generation
↓
Final Response
AI agents don't "think" or "understand" - they predict the next most probable token based on:
- Training data patterns
- Current context window
- System instructions
- Available tools
- Tool Use: Can call external functions (search, calculator, API)
- Planning: Can break down tasks into steps
- Memory: Can reference conversation history
- Reflection: Can critique and improve their output
- Multi-turn reasoning: Can engage in back-and-forth problem solving
"Explain binary search in Python"
Step 1: TOKENIZATION
→ ["Explain", "Ġbinary", "Ġsearch", "Ġin", "ĠPython"]
→ [1234, 5678, 9101, 1121, 3141]
Step 2: EMBEDDINGS
Each token ID → High-dimensional vector
1234 → [0.12, -0.88, 1.04, ..., 0.45] // 768-4096 dimensions
Step 3: TRANSFORMER LAYERS
- Self-Attention (which tokens relate to each other?)
- Feed Forward Networks
- Layer Normalization
- Residual Connections
Repeated 12-96 times depending on model size
Step 4: OUTPUT HEAD
→ Logits for all possible next tokens (~50k vocabulary)
→ Softmax → Probability distribution
→ {"Sure": 0.15, "Binary": 0.12, "Here": 0.08, ...}
Step 5: SAMPLING
Pick next token using:
- Temperature (randomness)
- Top-k (consider only top k tokens)
- Top-p (nucleus sampling)
Step 6: LOOP
Append chosen token, repeat until:
- <|endoftext|> token
- Max length reached
- Stop sequence detected
Why not just split by spaces?
Problem:
"running" ≠ "run" (different words in vocab)
Solution: Subword tokenization
"running" → ["run", "ning"]
"unhappiness" → ["un", "happiness"]
Common algorithms:
- BPE (Byte Pair Encoding) - GPT models
- WordPiece - BERT
- SentencePiece - T5, LLaMA
Converting tokens to meaning:
Token: "king" → [0.2, 0.8, -0.3, 0.6, ...]
Token: "queen" → [0.3, 0.7, -0.2, 0.5, ...]
Token: "man" → [0.1, 0.2, -0.1, 0.9, ...]
Token: "woman" → [0.2, 0.1, 0.0, 0.8, ...]
Famous example:
king - man + woman ≈ queen
Why embeddings matter:
- Enable semantic search
- Power RAG systems
- Allow similarity comparisons
- Compress meaning into math
Popular embedding models:
- OpenAI:
text-embedding-3-small,text-embedding-3-large - Open source:
BAAI/bge-large,sentence-transformers/all-MiniLM-L6-v2 - Cohere:
embed-english-v3.0
Before transformers (RNNs/LSTMs):
- Sequential processing (slow)
- Gradient vanishing
- Limited context
After transformers:
- Parallel processing
- Attention mechanism
- Unlimited context (theoretically)
Question: "The animal didn't cross the street because it was too tired"
What does "it" refer to?
Attention weights:
"it" → "animal": 0.87 (high)
"it" → "street": 0.05 (low)
"it" → "tired": 0.42 (medium)
The model learns: "it" refers to "animal"
Input Embeddings
↓
+ Positional Encodings (order matters)
↓
┌─────────────────────────┐
│ Transformer Block x N │
│ │
│ 1. Multi-Head │
│ Self-Attention │
│ │
│ 2. Layer Norm │
│ │
│ 3. Feed Forward │
│ │
│ 4. Layer Norm │
│ │
│ 5. Residual Connection │
└─────────────────────────┘
↓
Output Layer (Logits)
↓
Softmax (Probabilities)
| Model | Parameters | Layers | Hidden Size |
|---|---|---|---|
| GPT-2 Small | 117M | 12 | 768 |
| GPT-3 | 175B | 96 | 12,288 |
| GPT-4 | ~1.7T (rumored) | ? | ? |
| LLaMA 2 7B | 7B | 32 | 4,096 |
| Claude Sonnet 4.5 | Unknown | ? | ? |
Pure LLM limitations:
- Knowledge cutoff (no recent data)
- Hallucinations (makes up facts)
- No private/proprietary data
- Limited context window
┌─────────────────────────────────────────────┐
│ 1. INDEXING (One-time) │
├─────────────────────────────────────────────┤
│ Documents │
│ ↓ │
│ Chunk into pieces │
│ ↓ │
│ Generate embeddings │
│ ↓ │
│ Store in Vector DB │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 2. RETRIEVAL (Per query) │
├─────────────────────────────────────────────┤
│ User Query │
│ ↓ │
│ Convert to embedding │
│ ↓ │
│ Similarity search in Vector DB │
│ ↓ │
│ Retrieve top-k relevant chunks │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 3. GENERATION (Augmented) │
├─────────────────────────────────────────────┤
│ System: "Answer ONLY from context" │
│ Context: [Retrieved chunks] │
│ User: [Original query] │
│ ↓ │
│ LLM generates grounded answer │
└─────────────────────────────────────────────┘
Fixed-size chunking:
Split every 512 tokens
Overlap: 50 tokens
Semantic chunking:
Split on paragraphs
Or section headers
Or sentence boundaries
Recursive chunking:
Try to split on:
1. "\n\n" (paragraphs)
2. "\n" (lines)
3. ". " (sentences)
4. " " (words)
5. "" (characters)
1. Hybrid Search
Combine:
- Dense vectors (semantic)
- Sparse vectors (keyword/BM25)
- Rerank results
2. Query Transformation
Original: "What's the capital?"
Expanded: "What is the capital city of France?"
3. Self-Query
LLM extracts:
- Query text
- Metadata filters
4. Parent-Child Chunking
Index: Small chunks (better retrieval)
Return: Larger parent chunks (better context)
SQL/NoSQL:
SELECT * FROM docs WHERE content = 'exact match'❌ Only exact matches
Vector DB:
query_embedding = [0.1, 0.8, -0.3, ...]
results = db.search(query_embedding, top_k=5)✅ Semantic similarity
1. Cosine Similarity (most common)
similarity = (A · B) / (||A|| × ||B||)
Range: [-1, 1]
2. Euclidean Distance
distance = sqrt(Σ(ai - bi)²)
Range: [0, ∞]
3. Dot Product
score = Σ(ai × bi)
Range: [-∞, ∞]
| Database | Type | Best For |
|---|---|---|
| Pinecone | Managed | Production, scale |
| Weaviate | Self-hosted | Open source, flexibility |
| Qdrant | Self-hosted | Performance, Rust |
| Chroma | Embedded | Prototyping, simple |
| FAISS | Library | Research, custom |
| Milvus | Self-hosted | Enterprise |
| pgvector | Postgres ext | Existing Postgres |
HNSW (Hierarchical Navigable Small World)
- Fast approximate search
- Good recall
- Most popular
IVF (Inverted File Index)
- Clusters vectors
- Good for large datasets
- Trade recall for speed
The agent can:
- Decide when to call a tool
- Choose which tool to call
- Generate correct parameters
- Process tool results
- Continue reasoning
User: "What's the weather in NYC?"
Agent thinks:
1. I need current weather data
2. I have a weather_tool available
3. Generate function call
Output:
{
"tool": "get_weather",
"parameters": {
"location": "New York City",
"units": "fahrenheit"
}
}
System executes tool → Returns result
Agent receives:
{
"temperature": 72,
"condition": "sunny",
"humidity": 45
}
Agent continues:
"The weather in NYC is currently 72°F and sunny..."
{
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name or coordinates"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}1. Information Retrieval
- Web search
- Database query
- API calls
2. Actions
- Send email
- Create calendar event
- Update database
3. Computation
- Calculator
- Code execution
- Data analysis
4. Memory
- Save to memory
- Recall past interactions
- Update user profile
1. Short-term (Context Window)
Current conversation
Limited by model (4k - 200k tokens)
2. Long-term (External Storage)
Stored in database/vector store
Unlimited capacity
Retrieved when relevant
┌────────────────────────────────────┐
│ Conversation Buffer │
│ (Last N messages) │
└────────────────────────────────────┘
↓
┌────────────────────────────────────┐
│ Memory Extraction │
│ - Key facts │
│ - User preferences │
│ - Important context │
└────────────────────────────────────┘
↓
┌────────────────────────────────────┐
│ Vector Store │
│ (Semantic search) │
└────────────────────────────────────┘
↓
┌────────────────────────────────────┐
│ Retrieval on New Query │
│ (Relevant past context) │
└────────────────────────────────────┘
1. Summary Memory
# Summarize after every N turns
if turn_count % 10 == 0:
summary = llm.summarize(conversation_history)
save_to_memory(summary)2. Entity Memory
# Extract and track entities
entities = {
"user_name": "Alex",
"preferences": ["Python", "system design"],
"projects": ["RAG chatbot"]
}3. Semantic Memory
# Store embeddings of important messages
for message in conversation:
if is_important(message):
embedding = embed(message)
vector_db.insert(embedding, message)1. Sequential Chain
Agent 1 (Researcher)
↓
Agent 2 (Analyzer)
↓
Agent 3 (Writer)
2. Hierarchical
Supervisor Agent
/ | \
Researcher Coder Reviewer
3. Collaborative
All agents in shared workspace
Self-organize around tasks
LangChain Agents
from langchain.agents import create_react_agent
agent = create_react_agent(
llm=llm,
tools=[search_tool, calculator_tool],
prompt=prompt_template
)AutoGPT Pattern
1. Receive goal
2. Break into sub-tasks
3. Execute tasks with tools
4. Self-critique and iterate
5. Return final result
CrewAI
from crewai import Agent, Task, Crew
researcher = Agent(
role="Researcher",
goal="Find relevant information",
tools=[search_tool]
)
writer = Agent(
role="Writer",
goal="Write comprehensive article",
tools=[write_tool]
)
crew = Crew(agents=[researcher, writer])Ollama = Docker for LLMs
Run models locally:
- No API costs
- Full privacy
- Offline capability
- Customization
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Download a model
ollama pull llama3.2
# Run interactively
ollama run llama3.2
# API server (runs on localhost:11434)
ollama serveimport requests
response = requests.post(
'http://localhost:11434/api/generate',
json={
'model': 'llama3.2',
'prompt': 'Explain RAG',
'stream': False
}
)
print(response.json()['response'])Why quantize?
- Original: 16-bit floats (70B model = 140GB)
- Quantized Q4: 4-bit (70B model = 35GB)
Common formats:
- Q2: Fastest, lowest quality
- Q4: Good balance (most common)
- Q5: Better quality
- Q8: Near-original quality
- F16: Original precision
spring-rag-app/
├── src/main/java/com/example/rag/
│ ├── RagApplication.java
│ ├── config/
│ │ ├── OpenAIConfig.java
│ │ └── VectorStoreConfig.java
│ ├── controller/
│ │ └── ChatController.java
│ ├── service/
│ │ ├── EmbeddingService.java
│ │ ├── VectorStoreService.java
│ │ ├── DocumentIngestionService.java
│ │ └── RagService.java
│ ├── model/
│ │ ├── ChatRequest.java
│ │ ├── ChatResponse.java
│ │ └── Document.java
│ └── repository/
│ └── DocumentRepository.java
├── pom.xml
└── application.yml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>3.2.0</version>
</parent>
<groupId>com.example</groupId>
<artifactId>spring-rag-app</artifactId>
<version>1.0.0</version>
<dependencies>
<!-- Spring Boot Web -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Spring Boot Data JPA -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<!-- PostgreSQL with pgvector -->
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
</dependency>
<!-- Pgvector for vector operations -->
<dependency>
<groupId>com.pgvector</groupId>
<artifactId>pgvector</artifactId>
<version>0.1.2</version>
</dependency>
<!-- OpenAI Java Client -->
<dependency>
<groupId>com.theokanning.openai-gpt3-java</groupId>
<artifactId>service</artifactId>
<version>0.18.2</version>
</dependency>
<!-- Apache PDFBox for PDF parsing -->
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>3.0.0</version>
</dependency>
<!-- Lombok -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
</dependencies>
</project>spring:
application:
name: spring-rag-app
datasource:
url: jdbc:postgresql://localhost:5432/ragdb
username: postgres
password: postgres
driver-class-name: org.postgresql.Driver
jpa:
hibernate:
ddl-auto: update
show-sql: true
properties:
hibernate:
dialect: org.hibernate.dialect.PostgreSQLDialect
openai:
api-key: ${OPENAI_API_KEY}
model: gpt-4-turbo-preview
embedding-model: text-embedding-3-small
max-tokens: 2000
temperature: 0.7
rag:
chunk-size: 500
chunk-overlap: 50
top-k-results: 5// Document.java
package com.example.rag.model;
import jakarta.persistence.*;
import lombok.Data;
import org.hibernate.annotations.Type;
@Data
@Entity
@Table(name = "documents")
public class Document {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
@Column(columnDefinition = "TEXT")
private String content;
@Column(name = "embedding", columnDefinition = "vector(1536)")
private float[] embedding;
private String source;
@Column(name = "chunk_index")
private Integer chunkIndex;
private String metadata;
}
// ChatRequest.java
package com.example.rag.model;
import lombok.Data;
@Data
public class ChatRequest {
private String query;
private boolean useRag = true;
}
// ChatResponse.java
package com.example.rag.model;
import lombok.Data;
import java.util.List;
@Data
public class ChatResponse {
private String response;
private List<String> sources;
private boolean usedRag;
}// DocumentRepository.java
package com.example.rag.repository;
import com.example.rag.model.Document;
import org.springframework.data.jpa.repository.JpaRepository;
import org.springframework.data.jpa.repository.Query;
import org.springframework.data.repository.query.Param;
import java.util.List;
public interface DocumentRepository extends JpaRepository<Document, Long> {
@Query(value = """
SELECT * FROM documents
ORDER BY embedding <-> CAST(:queryEmbedding AS vector)
LIMIT :limit
""", nativeQuery = true)
List<Document> findSimilarDocuments(
@Param("queryEmbedding") String queryEmbedding,
@Param("limit") int limit
);
}// EmbeddingService.java
package com.example.rag.service;
import com.theokanning.openai.service.OpenAiService;
import com.theokanning.openai.embedding.EmbeddingRequest;
import lombok.RequiredArgsConstructor;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import java.util.List;
@Service
@RequiredArgsConstructor
public class EmbeddingService {
private final OpenAiService openAiService;
@Value("${openai.embedding-model}")
private String embeddingModel;
public float[] generateEmbedding(String text) {
var request = EmbeddingRequest.builder()
.model(embeddingModel)
.input(List.of(text))
.build();
var response = openAiService.createEmbeddings(request);
// Convert Double[] to float[]
List<Double> embedding = response.getData().get(0).getEmbedding();
float[] result = new float[embedding.size()];
for (int i = 0; i < embedding.size(); i++) {
result[i] = embedding.get(i).floatValue();
}
return result;
}
}
// VectorStoreService.java
package com.example.rag.service;
import com.example.rag.model.Document;
import com.example.rag.repository.DocumentRepository;
import lombok.RequiredArgsConstructor;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import java.util.Arrays;
import java.util.List;
@Service
@RequiredArgsConstructor
public class VectorStoreService {
private final DocumentRepository documentRepository;
private final EmbeddingService embeddingService;
@Value("${rag.top-k-results}")
private int topK;
public void storeDocument(String content, String source, int chunkIndex) {
Document doc = new Document();
doc.setContent(content);
doc.setSource(source);
doc.setChunkIndex(chunkIndex);
doc.setEmbedding(embeddingService.generateEmbedding(content));
documentRepository.save(doc);
}
public List<Document> searchSimilar(String query) {
float[] queryEmbedding = embeddingService.generateEmbedding(query);
String embeddingStr = Arrays.toString(queryEmbedding);
return documentRepository.findSimilarDocuments(embeddingStr, topK);
}
}
// DocumentIngestionService.java
package com.example.rag.service;
import lombok.RequiredArgsConstructor;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
@Service
@RequiredArgsConstructor
public class DocumentIngestionService {
private final VectorStoreService vectorStoreService;
@Value("${rag.chunk-size}")
private int chunkSize;
@Value("${rag.chunk-overlap}")
private int chunkOverlap;
public void ingestPdf(MultipartFile file) throws IOException {
String text = extractTextFromPdf(file);
List<String> chunks = chunkText(text);
for (int i = 0; i < chunks.size(); i++) {
vectorStoreService.storeDocument(
chunks.get(i),
file.getOriginalFilename(),
i
);
}
}
private String extractTextFromPdf(MultipartFile file) throws IOException {
try (PDDocument document = PDDocument.load(file.getInputStream())) {
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(document);
}
}
private List<String> chunkText(String text) {
List<String> chunks = new ArrayList<>();
int start = 0;
while (start < text.length()) {
int end = Math.min(start + chunkSize, text.length());
chunks.add(text.substring(start, end));
start += chunkSize - chunkOverlap;
}
return chunks;
}
}
// RagService.java
package com.example.rag.service;
import com.example.rag.model.ChatResponse;
import com.example.rag.model.Document;
import com.theokanning.openai.completion.chat.ChatCompletionRequest;
import com.theokanning.openai.completion.chat.ChatMessage;
import com.theokanning.openai.completion.chat.ChatMessageRole;
import com.theokanning.openai.service.OpenAiService;
import lombok.RequiredArgsConstructor;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
@Service
@RequiredArgsConstructor
public class RagService {
private final OpenAiService openAiService;
private final VectorStoreService vectorStoreService;
@Value("${openai.model}")
private String model;
@Value("${openai.max-tokens}")
private int maxTokens;
@Value("${openai.temperature}")
private double temperature;
public ChatResponse chat(String query, boolean useRag) {
List<ChatMessage> messages = new ArrayList<>();
List<String> sources = new ArrayList<>();
if (useRag) {
// Retrieve relevant documents
List<Document> relevantDocs = vectorStoreService.searchSimilar(query);
if (!relevantDocs.isEmpty()) {
// Build context from retrieved documents
String context = relevantDocs.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n\n"));
sources = relevantDocs.stream()
.map(doc -> doc.getSource() + " (chunk " + doc.getChunkIndex() + ")")
.distinct()
.collect(Collectors.toList());
// System message with RAG context
messages.add(new ChatMessage(
ChatMessageRole.SYSTEM.value(),
"You are a helpful assistant. Answer the user's question based ONLY on the following context. " +
"If the answer cannot be found in the context, say so.\n\nContext:\n" + context
));
}
}
// User message
messages.add(new ChatMessage(ChatMessageRole.USER.value(), query));
// Create completion request
ChatCompletionRequest request = ChatCompletionRequest.builder()
.model(model)
.messages(messages)
.maxTokens(maxTokens)
.temperature(temperature)
.build();
String response = openAiService.createChatCompletion(request)
.getChoices()
.get(0)
.getMessage()
.getContent();
ChatResponse chatResponse = new ChatResponse();
chatResponse.setResponse(response);
chatResponse.setSources(sources);
chatResponse.setUsedRag(useRag && !sources.isEmpty());
return chatResponse;
}
}// ChatController.java
package com.example.rag.controller;
import com.example.rag.model.ChatRequest;
import com.example.rag.model.ChatResponse;
import com.example.rag.service.DocumentIngestionService;
import com.example.rag.service.RagService;
import lombok.RequiredArgsConstructor;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
@RestController
@RequestMapping("/api")
@RequiredArgsConstructor
public class ChatController {
private final RagService ragService;
private final DocumentIngestionService documentIngestionService;
@PostMapping("/chat")
public ResponseEntity<ChatResponse> chat(@RequestBody ChatRequest request) {
ChatResponse response = ragService.chat(
request.getQuery(),
request.isUseRag()
);
return ResponseEntity.ok(response);
}
@PostMapping("/ingest")
public ResponseEntity<String> ingestDocument(
@RequestParam("file") MultipartFile file
) {
try {
documentIngestionService.ingestPdf(file);
return ResponseEntity.ok("Document ingested successfully");
} catch (Exception e) {
return ResponseEntity.badRequest()
.body("Error ingesting document: " + e.getMessage());
}
}
@GetMapping("/health")
public ResponseEntity<String> health() {
return ResponseEntity.ok("RAG service is running");
}
}// OpenAIConfig.java
package com.example.rag.config;
import com.theokanning.openai.service.OpenAiService;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.time.Duration;
@Configuration
public class OpenAIConfig {
@Value("${openai.api-key}")
private String apiKey;
@Bean
public OpenAiService openAiService() {
return new OpenAiService(apiKey, Duration.ofSeconds(60));
}
}-- Install pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create documents table
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536),
source VARCHAR(255),
chunk_index INTEGER,
metadata TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create index for vector similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);# docker-compose.yml
version: '3.8'
services:
postgres:
image: ankane/pgvector:latest
environment:
POSTGRES_DB: ragdb
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:# Start PostgreSQL
docker-compose up -d
# Run the application
./mvnw spring-boot:run
# Test document ingestion
curl -X POST http://localhost:8080/api/ingest \
-F "file=@/path/to/document.pdf"
# Test chat without RAG
curl -X POST http://localhost:8080/api/chat \
-H "Content-Type: application/json" \
-d '{"query": "What is machine learning?", "useRag": false}'
# Test chat with RAG
curl -X POST http://localhost:8080/api/chat \
-H "Content-Type: application/json" \
-d '{"query": "What does the document say about embeddings?", "useRag": true}'Mathematics (Lightweight)
- Linear algebra basics: vectors, dot product, matrix multiplication
- Probability: distributions, conditional probability
- Cosine similarity and distance metrics
Machine Learning Basics
- Supervised vs unsupervised learning
- Training vs inference
- Overfitting and regularization
- Loss functions and optimization
Resources:
- 3Blue1Brown: Neural Networks series
- StatQuest: Machine Learning basics
- Fast.ai: Practical Deep Learning
Core Concepts
- Tokenization (BPE, WordPiece, SentencePiece)
- Word embeddings (Word2Vec, GloVe)
- Contextual embeddings (BERT, GPT)
- Attention mechanism
- Transformer architecture
Hands-on
- Implement tokenization from scratch
- Use HuggingFace Transformers library
- Fine-tune a small model
- Understand encoder vs decoder architectures
Resources:
- "Attention is All You Need" paper
- Jay Alammar's Illustrated Transformer
- HuggingFace Course
- Andrej Karpathy's Neural Networks: Zero to Hero
Techniques
- Zero-shot prompting
- Few-shot prompting
- Chain-of-thought (CoT)
- ReAct prompting
- System vs user vs assistant roles
Security
- Prompt injection attacks
- Jailbreaking
- Defense strategies
Resources:
- OpenAI Prompt Engineering Guide
- Anthropic's Prompt Engineering docs
- PromptingGuide.ai
Core Components
- Document chunking strategies
- Embedding models
- Vector databases
- Retrieval algorithms
- Reranking techniques
Advanced Topics
- Hybrid search (dense + sparse)
- Query transformation
- Context compression
- Hallucination detection
- Evaluation metrics
Hands-on Projects
- Build a document Q&A system
- Create a code documentation assistant
- Implement customer support bot
Resources:
- LangChain documentation
- LlamaIndex guides
- Pinecone learning center
Concepts
- Function calling
- Tool use
- Agent architectures (ReAct, Plan & Execute)
- Multi-agent systems
- Memory management
Frameworks
- LangChain Agents
- AutoGPT
- BabyAGI
- CrewAI
Projects
- Build a research assistant
- Create a coding agent
- Implement task automation system
Tools
- Ollama
- LM Studio
- llama.cpp
- vLLM (for serving)
Concepts
- Model quantization
- GGUF format
- GPU vs CPU inference
- Model fine-tuning (LoRA, QLoRA)
Projects
- Run LLaMA locally
- Build offline RAG system
- Fine-tune model on custom data
Engineering
- API design for LLM applications
- Streaming responses
- Caching strategies
- Rate limiting
- Cost optimization
Monitoring
- Latency tracking
- Token usage
- Error rates
- User feedback loops
Evaluation
- Unit tests for prompts
- Regression testing
- A/B testing
- Human evaluation
MLOps
- Model versioning
- Prompt versioning
- Feature flags
- Gradual rollouts
1. Caching
@Service
public class CachingService {
private final Cache<String, float[]> embeddingCache;
public float[] getEmbedding(String text) {
return embeddingCache.get(text,
key -> embeddingService.generateEmbedding(key)
);
}
}2. Batching
public List<float[]> generateEmbeddingsBatch(List<String> texts) {
// Process multiple texts in one API call
return openAiService.createEmbeddings(
EmbeddingRequest.builder()
.model(embeddingModel)
.input(texts)
.build()
).getData().stream()
.map(this::convertToFloatArray)
.collect(Collectors.toList());
}3. Streaming Responses
@GetMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> chatStream(@RequestParam String query) {
return Flux.create(sink -> {
openAiService.streamChatCompletion(request)
.doOnNext(chunk -> sink.next(chunk.getChoices().get(0).getText()))
.doOnComplete(sink::complete)
.subscribe();
});
}Token Management
public class TokenCounter {
// Estimate tokens (rough approximation)
public int estimateTokens(String text) {
return text.split("\\s+").length * 4 / 3;
}
public String truncateToTokenLimit(String text, int maxTokens) {
int estimatedTokens = estimateTokens(text);
if (estimatedTokens <= maxTokens) {
return text;
}
int ratio = maxTokens / estimatedTokens;
int targetLength = text.length() * ratio;
return text.substring(0, targetLength);
}
}Smart Context Selection
public List<Document> selectBestContext(
List<Document> candidates,
int maxTokens
) {
List<Document> selected = new ArrayList<>();
int totalTokens = 0;
for (Document doc : candidates) {
int docTokens = estimateTokens(doc.getContent());
if (totalTokens + docTokens <= maxTokens) {
selected.add(doc);
totalTokens += docTokens;
} else {
break;
}
}
return selected;
}1. Input Validation
public void validateInput(String query) {
if (query.length() > 10000) {
throw new IllegalArgumentException("Query too long");
}
if (containsSuspiciousPatterns(query)) {
throw new SecurityException("Potential prompt injection detected");
}
}2. Output Filtering
public String filterOutput(String response) {
// Remove PII
response = removePII(response);
// Check for harmful content
if (containsHarmfulContent(response)) {
return "I cannot provide that information.";
}
return response;
}3. Rate Limiting
@Component
public class RateLimitInterceptor implements HandlerInterceptor {
private final RateLimiter rateLimiter;
@Override
public boolean preHandle(HttpServletRequest request,
HttpServletResponse response,
Object handler) {
String userId = getUserId(request);
if (!rateLimiter.tryAcquire(userId)) {
response.setStatus(429);
return false;
}
return true;
}
}Metrics to Track
@Service
public class MetricsService {
private final MeterRegistry registry;
public void recordLatency(String operation, long milliseconds) {
registry.timer("llm.latency", "operation", operation)
.record(Duration.ofMillis(milliseconds));
}
public void recordTokenUsage(int promptTokens, int completionTokens) {
registry.counter("llm.tokens.prompt").increment(promptTokens);
registry.counter("llm.tokens.completion").increment(completionTokens);
}
public void recordCost(double cost) {
registry.counter("llm.cost").increment(cost);
}
}Automated Testing
@Test
public void testRagAccuracy() {
// Test dataset
List<QAPair> testCases = loadTestCases();
for (QAPair qa : testCases) {
ChatResponse response = ragService.chat(qa.getQuestion(), true);
// Evaluate
double similarity = calculateSimilarity(
response.getResponse(),
qa.getExpectedAnswer()
);
assertTrue(similarity > 0.8, "Response not similar enough");
}
}A/B Testing
@Service
public class ExperimentService {
public ChatResponse chat(String query, String userId) {
boolean useNewAlgorithm = isInExperimentGroup(userId);
if (useNewAlgorithm) {
return ragService.chatV2(query);
} else {
return ragService.chatV1(query);
}
}
}Vision + Language
- Image understanding (GPT-4V, Claude 3)
- Image generation (DALL-E, Stable Diffusion)
- OCR and document understanding
Audio
- Speech-to-text (Whisper)
- Text-to-speech (ElevenLabs, Bark)
- Audio embeddings
When to Fine-tune
- Domain-specific terminology
- Consistent output format
- Specialized tasks
Methods
- Full fine-tuning (expensive)
- LoRA (Low-Rank Adaptation)
- QLoRA (Quantized LoRA)
- Prompt tuning
- Prefix tuning
Techniques
- RLHF (Reinforcement Learning from Human Feedback)
- Constitutional AI (Anthropic)
- Red teaming
- Adversarial testing
Query Routing
Simple question → Direct LLM
Complex question → RAG pipeline
Calculation → Tool use
Hypothetical Document Embeddings (HyDE)
User query → LLM generates hypothetical answer
→ Embed hypothetical answer → Search for similar docs
Self-RAG
LLM generates → Self-critique → Retrieval if needed
→ Regenerate with context
JSON Mode
ChatCompletionRequest request = ChatCompletionRequest.builder()
.model("gpt-4-turbo-preview")
.messages(messages)
.responseFormat(ResponseFormat.builder()
.type("json_object")
.build())
.build();Function Calling for Structured Data
FunctionDefinition extractSchema = FunctionDefinition.builder()
.name("extract_entities")
.description("Extract entities from text")
.parameters(/* JSON Schema */)
.build();Bias & Fairness
- Training data bias
- Output fairness testing
- Demographic parity
Privacy
- Data retention policies
- PII removal
- Differential privacy
Environmental Impact
- Carbon footprint of training
- Inference efficiency
- Model compression
| Model | Context Window | Max Output |
|---|---|---|
| GPT-4 Turbo | 128K | 4K |
| GPT-3.5 Turbo | 16K | 4K |
| Claude 3.5 Sonnet | 200K | 8K |
| Gemini 1.5 Pro | 2M | 8K |
| LLaMA 3.1 70B | 128K | - |
| Model | Dimensions | Cost |
|---|---|---|
| text-embedding-3-small | 1536 | $0.02/1M tokens |
| text-embedding-3-large | 3072 | $0.13/1M tokens |
| BGE-large | 1024 | Free (open source) |
❌ Don't:
- Store raw API keys in code
- Skip input validation
- Ignore token limits
- Forget to handle rate limits
- Cache indefinitely
✅ Do:
- Use environment variables
- Validate and sanitize inputs
- Monitor token usage
- Implement exponential backoff
- Set cache expiration
For Immediate Practice:
- Build the Spring Boot RAG app above
- Ingest your own documents
- Experiment with different chunking strategies
- Try different embedding models
- Add tool use (web search, calculator)
For Deep Learning:
- Implement a mini-transformer from scratch
- Fine-tune a small model (LLaMA 7B)
- Build a multi-agent system
- Create an evaluation framework
- Deploy to production
Community Resources:
- HuggingFace Discord
- LangChain GitHub discussions
- r/LocalLLaMA subreddit
- AI Tinkerers meetups
- Papers with Code
Stay Updated:
- Anthropic's research blog
- OpenAI's blog
- Google AI blog
- arXiv cs.CL and cs.AI
- Twitter/X: Follow AI researchers
You now understand: ✅ How LLMs work internally (transformers, attention, embeddings) ✅ How agents use tools and memory ✅ How RAG systems retrieve and inject context ✅ How to build production-ready AI apps ✅ Complete Spring Boot implementation ✅ Best practices for cost, performance, and safety
The key insight: Modern AI applications are 70% engineering (RAG, tools, orchestration) and 30% ML. Focus on building robust systems that leverage LLMs effectively rather than training models from scratch.