Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save carefree-ladka/0e717eff79686e81bf7b211fa2487d1a to your computer and use it in GitHub Desktop.

Select an option

Save carefree-ladka/0e717eff79686e81bf7b211fa2487d1a to your computer and use it in GitHub Desktop.
Complete Guide: How AI Agents & GenAI Work

Complete Guide: How AI Agents & GenAI Work

Table of Contents

  1. How AI Agents Work
  2. Core Architecture
  3. Tokenization & Embeddings
  4. Transformer Architecture
  5. RAG System
  6. Vector Databases
  7. Tool Use & Function Calling
  8. Memory Systems
  9. Multi-Agent Systems
  10. Running Models Locally
  11. Spring Boot RAG Application
  12. Learning Roadmap
  13. Production Considerations
  14. Additional Topics
  15. Quick Reference

How AI Agents Work {#how-ai-agents-work}

Mental Model

User Input
    ↓
Context Assembly (System Prompt + History + Tools)
    ↓
Tokenization → Embeddings
    ↓
Transformer Neural Network
    ↓
Next Token Prediction (Probability Distribution)
    ↓
Tool Decision? (Function Calling)
    ↓  Yes
Execute Tool → Inject Results → Continue Generation
    ↓  No
Token-by-Token Generation
    ↓
Final Response

Key Insight

AI agents don't "think" or "understand" - they predict the next most probable token based on:

  • Training data patterns
  • Current context window
  • System instructions
  • Available tools

What Makes Them "Agentic"

  1. Tool Use: Can call external functions (search, calculator, API)
  2. Planning: Can break down tasks into steps
  3. Memory: Can reference conversation history
  4. Reflection: Can critique and improve their output
  5. Multi-turn reasoning: Can engage in back-and-forth problem solving

Core Architecture {#core-architecture}

The Complete Flow

"Explain binary search in Python"

Step 1: TOKENIZATION
→ ["Explain", "Ġbinary", "Ġsearch", "Ġin", "ĠPython"]
→ [1234, 5678, 9101, 1121, 3141]

Step 2: EMBEDDINGS
Each token ID → High-dimensional vector
1234 → [0.12, -0.88, 1.04, ..., 0.45]  // 768-4096 dimensions

Step 3: TRANSFORMER LAYERS
- Self-Attention (which tokens relate to each other?)
- Feed Forward Networks
- Layer Normalization
- Residual Connections
Repeated 12-96 times depending on model size

Step 4: OUTPUT HEAD
→ Logits for all possible next tokens (~50k vocabulary)
→ Softmax → Probability distribution
→ {"Sure": 0.15, "Binary": 0.12, "Here": 0.08, ...}

Step 5: SAMPLING
Pick next token using:
- Temperature (randomness)
- Top-k (consider only top k tokens)
- Top-p (nucleus sampling)

Step 6: LOOP
Append chosen token, repeat until:
- <|endoftext|> token
- Max length reached
- Stop sequence detected

Tokenization & Embeddings {#tokenization-embeddings}

Tokenization

Why not just split by spaces?

Problem:
"running" ≠ "run" (different words in vocab)

Solution: Subword tokenization
"running" → ["run", "ning"]
"unhappiness" → ["un", "happiness"]

Common algorithms:

  • BPE (Byte Pair Encoding) - GPT models
  • WordPiece - BERT
  • SentencePiece - T5, LLaMA

Embeddings: The Magic

Converting tokens to meaning:

Token: "king"    → [0.2, 0.8, -0.3, 0.6, ...]
Token: "queen"   → [0.3, 0.7, -0.2, 0.5, ...]
Token: "man"     → [0.1, 0.2, -0.1, 0.9, ...]
Token: "woman"   → [0.2, 0.1,  0.0, 0.8, ...]

Famous example:
king - man + woman ≈ queen

Why embeddings matter:

  • Enable semantic search
  • Power RAG systems
  • Allow similarity comparisons
  • Compress meaning into math

Popular embedding models:

  • OpenAI: text-embedding-3-small, text-embedding-3-large
  • Open source: BAAI/bge-large, sentence-transformers/all-MiniLM-L6-v2
  • Cohere: embed-english-v3.0

Transformer Architecture {#transformer-architecture}

The Revolution

Before transformers (RNNs/LSTMs):

  • Sequential processing (slow)
  • Gradient vanishing
  • Limited context

After transformers:

  • Parallel processing
  • Attention mechanism
  • Unlimited context (theoretically)

Self-Attention Explained

Question: "The animal didn't cross the street because it was too tired"

What does "it" refer to?

Attention weights:
"it" → "animal": 0.87 (high)
"it" → "street": 0.05 (low)
"it" → "tired": 0.42 (medium)

The model learns: "it" refers to "animal"

Architecture Components

Input Embeddings
    ↓
+ Positional Encodings (order matters)
    ↓
┌─────────────────────────┐
│  Transformer Block x N  │
│                         │
│  1. Multi-Head          │
│     Self-Attention      │
│                         │
│  2. Layer Norm          │
│                         │
│  3. Feed Forward        │
│                         │
│  4. Layer Norm          │
│                         │
│  5. Residual Connection │
└─────────────────────────┘
    ↓
Output Layer (Logits)
    ↓
Softmax (Probabilities)

Model Sizes

Model Parameters Layers Hidden Size
GPT-2 Small 117M 12 768
GPT-3 175B 96 12,288
GPT-4 ~1.7T (rumored) ? ?
LLaMA 2 7B 7B 32 4,096
Claude Sonnet 4.5 Unknown ? ?

RAG System {#rag}

The Problem RAG Solves

Pure LLM limitations:

  • Knowledge cutoff (no recent data)
  • Hallucinations (makes up facts)
  • No private/proprietary data
  • Limited context window

RAG Architecture

┌─────────────────────────────────────────────┐
│ 1. INDEXING (One-time)                      │
├─────────────────────────────────────────────┤
│   Documents                                  │
│      ↓                                      │
│   Chunk into pieces                         │
│      ↓                                      │
│   Generate embeddings                       │
│      ↓                                      │
│   Store in Vector DB                        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 2. RETRIEVAL (Per query)                    │
├─────────────────────────────────────────────┤
│   User Query                                 │
│      ↓                                      │
│   Convert to embedding                      │
│      ↓                                      │
│   Similarity search in Vector DB            │
│      ↓                                      │
│   Retrieve top-k relevant chunks            │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 3. GENERATION (Augmented)                   │
├─────────────────────────────────────────────┤
│   System: "Answer ONLY from context"        │
│   Context: [Retrieved chunks]               │
│   User: [Original query]                    │
│      ↓                                      │
│   LLM generates grounded answer             │
└─────────────────────────────────────────────┘

Chunking Strategies

Fixed-size chunking:

Split every 512 tokens
Overlap: 50 tokens

Semantic chunking:

Split on paragraphs
Or section headers
Or sentence boundaries

Recursive chunking:

Try to split on:
1. "\n\n" (paragraphs)
2. "\n" (lines)
3. ". " (sentences)
4. " " (words)
5. "" (characters)

Advanced RAG Techniques

1. Hybrid Search

Combine:
- Dense vectors (semantic)
- Sparse vectors (keyword/BM25)
- Rerank results

2. Query Transformation

Original: "What's the capital?"
Expanded: "What is the capital city of France?"

3. Self-Query

LLM extracts:
- Query text
- Metadata filters

4. Parent-Child Chunking

Index: Small chunks (better retrieval)
Return: Larger parent chunks (better context)

Vector Databases {#vector-databases}

Why Traditional DBs Don't Work

SQL/NoSQL:

SELECT * FROM docs WHERE content = 'exact match'

❌ Only exact matches

Vector DB:

query_embedding = [0.1, 0.8, -0.3, ...]
results = db.search(query_embedding, top_k=5)

✅ Semantic similarity

Similarity Metrics

1. Cosine Similarity (most common)

similarity = (A · B) / (||A|| × ||B||)
Range: [-1, 1]

2. Euclidean Distance

distance = sqrt(Σ(ai - bi)²)
Range: [0, ∞]

3. Dot Product

score = Σ(ai × bi)
Range: [-∞, ∞]

Popular Vector Databases

Database Type Best For
Pinecone Managed Production, scale
Weaviate Self-hosted Open source, flexibility
Qdrant Self-hosted Performance, Rust
Chroma Embedded Prototyping, simple
FAISS Library Research, custom
Milvus Self-hosted Enterprise
pgvector Postgres ext Existing Postgres

Indexing Strategies

HNSW (Hierarchical Navigable Small World)

  • Fast approximate search
  • Good recall
  • Most popular

IVF (Inverted File Index)

  • Clusters vectors
  • Good for large datasets
  • Trade recall for speed

Tool Use & Function Calling {#tool-use}

How Agents Use Tools

The agent can:

  1. Decide when to call a tool
  2. Choose which tool to call
  3. Generate correct parameters
  4. Process tool results
  5. Continue reasoning

Function Calling Flow

User: "What's the weather in NYC?"

Agent thinks:
1. I need current weather data
2. I have a weather_tool available
3. Generate function call

Output:
{
  "tool": "get_weather",
  "parameters": {
    "location": "New York City",
    "units": "fahrenheit"
  }
}

System executes tool → Returns result

Agent receives:
{
  "temperature": 72,
  "condition": "sunny",
  "humidity": 45
}

Agent continues:
"The weather in NYC is currently 72°F and sunny..."

Tool Definition Format

{
  "name": "get_weather",
  "description": "Get current weather for a location",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name or coordinates"
      },
      "units": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"]
      }
    },
    "required": ["location"]
  }
}

Common Tool Categories

1. Information Retrieval

  • Web search
  • Database query
  • API calls

2. Actions

  • Send email
  • Create calendar event
  • Update database

3. Computation

  • Calculator
  • Code execution
  • Data analysis

4. Memory

  • Save to memory
  • Recall past interactions
  • Update user profile

Memory Systems {#memory-systems}

Types of Memory

1. Short-term (Context Window)

Current conversation
Limited by model (4k - 200k tokens)

2. Long-term (External Storage)

Stored in database/vector store
Unlimited capacity
Retrieved when relevant

Memory Architecture

┌────────────────────────────────────┐
│  Conversation Buffer               │
│  (Last N messages)                 │
└────────────────────────────────────┘
         ↓
┌────────────────────────────────────┐
│  Memory Extraction                 │
│  - Key facts                       │
│  - User preferences                │
│  - Important context               │
└────────────────────────────────────┘
         ↓
┌────────────────────────────────────┐
│  Vector Store                      │
│  (Semantic search)                 │
└────────────────────────────────────┘
         ↓
┌────────────────────────────────────┐
│  Retrieval on New Query            │
│  (Relevant past context)           │
└────────────────────────────────────┘

Implementation Strategies

1. Summary Memory

# Summarize after every N turns
if turn_count % 10 == 0:
    summary = llm.summarize(conversation_history)
    save_to_memory(summary)

2. Entity Memory

# Extract and track entities
entities = {
    "user_name": "Alex",
    "preferences": ["Python", "system design"],
    "projects": ["RAG chatbot"]
}

3. Semantic Memory

# Store embeddings of important messages
for message in conversation:
    if is_important(message):
        embedding = embed(message)
        vector_db.insert(embedding, message)

Multi-Agent Systems {#multi-agent-systems}

Architecture Patterns

1. Sequential Chain

Agent 1 (Researcher)
    ↓
Agent 2 (Analyzer)
    ↓
Agent 3 (Writer)

2. Hierarchical

       Supervisor Agent
       /      |      \
Researcher  Coder  Reviewer

3. Collaborative

All agents in shared workspace
Self-organize around tasks

Agent Frameworks

LangChain Agents

from langchain.agents import create_react_agent

agent = create_react_agent(
    llm=llm,
    tools=[search_tool, calculator_tool],
    prompt=prompt_template
)

AutoGPT Pattern

1. Receive goal
2. Break into sub-tasks
3. Execute tasks with tools
4. Self-critique and iterate
5. Return final result

CrewAI

from crewai import Agent, Task, Crew

researcher = Agent(
    role="Researcher",
    goal="Find relevant information",
    tools=[search_tool]
)

writer = Agent(
    role="Writer",
    goal="Write comprehensive article",
    tools=[write_tool]
)

crew = Crew(agents=[researcher, writer])

Running Models Locally {#local-models}

What is Ollama?

Ollama = Docker for LLMs

Run models locally:

  • No API costs
  • Full privacy
  • Offline capability
  • Customization

Installation & Usage

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Download a model
ollama pull llama3.2

# Run interactively
ollama run llama3.2

# API server (runs on localhost:11434)
ollama serve

Using Ollama API

import requests

response = requests.post(
    'http://localhost:11434/api/generate',
    json={
        'model': 'llama3.2',
        'prompt': 'Explain RAG',
        'stream': False
    }
)

print(response.json()['response'])

Model Quantization

Why quantize?

  • Original: 16-bit floats (70B model = 140GB)
  • Quantized Q4: 4-bit (70B model = 35GB)

Common formats:

  • Q2: Fastest, lowest quality
  • Q4: Good balance (most common)
  • Q5: Better quality
  • Q8: Near-original quality
  • F16: Original precision

Spring Boot RAG Application {#spring-boot-app}

Project Structure

spring-rag-app/
├── src/main/java/com/example/rag/
│   ├── RagApplication.java
│   ├── config/
│   │   ├── OpenAIConfig.java
│   │   └── VectorStoreConfig.java
│   ├── controller/
│   │   └── ChatController.java
│   ├── service/
│   │   ├── EmbeddingService.java
│   │   ├── VectorStoreService.java
│   │   ├── DocumentIngestionService.java
│   │   └── RagService.java
│   ├── model/
│   │   ├── ChatRequest.java
│   │   ├── ChatResponse.java
│   │   └── Document.java
│   └── repository/
│       └── DocumentRepository.java
├── pom.xml
└── application.yml

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
    <modelVersion>4.0.0</modelVersion>
    
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.2.0</version>
    </parent>
    
    <groupId>com.example</groupId>
    <artifactId>spring-rag-app</artifactId>
    <version>1.0.0</version>
    
    <dependencies>
        <!-- Spring Boot Web -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        
        <!-- Spring Boot Data JPA -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        
        <!-- PostgreSQL with pgvector -->
        <dependency>
            <groupId>org.postgresql</groupId>
            <artifactId>postgresql</artifactId>
        </dependency>
        
        <!-- Pgvector for vector operations -->
        <dependency>
            <groupId>com.pgvector</groupId>
            <artifactId>pgvector</artifactId>
            <version>0.1.2</version>
        </dependency>
        
        <!-- OpenAI Java Client -->
        <dependency>
            <groupId>com.theokanning.openai-gpt3-java</groupId>
            <artifactId>service</artifactId>
            <version>0.18.2</version>
        </dependency>
        
        <!-- Apache PDFBox for PDF parsing -->
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>3.0.0</version>
        </dependency>
        
        <!-- Lombok -->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
    </dependencies>
</project>

application.yml

spring:
  application:
    name: spring-rag-app
  
  datasource:
    url: jdbc:postgresql://localhost:5432/ragdb
    username: postgres
    password: postgres
    driver-class-name: org.postgresql.Driver
  
  jpa:
    hibernate:
      ddl-auto: update
    show-sql: true
    properties:
      hibernate:
        dialect: org.hibernate.dialect.PostgreSQLDialect

openai:
  api-key: ${OPENAI_API_KEY}
  model: gpt-4-turbo-preview
  embedding-model: text-embedding-3-small
  max-tokens: 2000
  temperature: 0.7

rag:
  chunk-size: 500
  chunk-overlap: 50
  top-k-results: 5

Model Classes

// Document.java
package com.example.rag.model;

import jakarta.persistence.*;
import lombok.Data;
import org.hibernate.annotations.Type;

@Data
@Entity
@Table(name = "documents")
public class Document {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    
    @Column(columnDefinition = "TEXT")
    private String content;
    
    @Column(name = "embedding", columnDefinition = "vector(1536)")
    private float[] embedding;
    
    private String source;
    
    @Column(name = "chunk_index")
    private Integer chunkIndex;
    
    private String metadata;
}

// ChatRequest.java
package com.example.rag.model;

import lombok.Data;

@Data
public class ChatRequest {
    private String query;
    private boolean useRag = true;
}

// ChatResponse.java
package com.example.rag.model;

import lombok.Data;
import java.util.List;

@Data
public class ChatResponse {
    private String response;
    private List<String> sources;
    private boolean usedRag;
}

Repository

// DocumentRepository.java
package com.example.rag.repository;

import com.example.rag.model.Document;
import org.springframework.data.jpa.repository.JpaRepository;
import org.springframework.data.jpa.repository.Query;
import org.springframework.data.repository.query.Param;
import java.util.List;

public interface DocumentRepository extends JpaRepository<Document, Long> {
    
    @Query(value = """
        SELECT * FROM documents
        ORDER BY embedding <-> CAST(:queryEmbedding AS vector)
        LIMIT :limit
        """, nativeQuery = true)
    List<Document> findSimilarDocuments(
        @Param("queryEmbedding") String queryEmbedding,
        @Param("limit") int limit
    );
}

Service Classes

// EmbeddingService.java
package com.example.rag.service;

import com.theokanning.openai.service.OpenAiService;
import com.theokanning.openai.embedding.EmbeddingRequest;
import lombok.RequiredArgsConstructor;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import java.util.List;

@Service
@RequiredArgsConstructor
public class EmbeddingService {
    
    private final OpenAiService openAiService;
    
    @Value("${openai.embedding-model}")
    private String embeddingModel;
    
    public float[] generateEmbedding(String text) {
        var request = EmbeddingRequest.builder()
            .model(embeddingModel)
            .input(List.of(text))
            .build();
        
        var response = openAiService.createEmbeddings(request);
        
        // Convert Double[] to float[]
        List<Double> embedding = response.getData().get(0).getEmbedding();
        float[] result = new float[embedding.size()];
        for (int i = 0; i < embedding.size(); i++) {
            result[i] = embedding.get(i).floatValue();
        }
        return result;
    }
}

// VectorStoreService.java
package com.example.rag.service;

import com.example.rag.model.Document;
import com.example.rag.repository.DocumentRepository;
import lombok.RequiredArgsConstructor;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import java.util.Arrays;
import java.util.List;

@Service
@RequiredArgsConstructor
public class VectorStoreService {
    
    private final DocumentRepository documentRepository;
    private final EmbeddingService embeddingService;
    
    @Value("${rag.top-k-results}")
    private int topK;
    
    public void storeDocument(String content, String source, int chunkIndex) {
        Document doc = new Document();
        doc.setContent(content);
        doc.setSource(source);
        doc.setChunkIndex(chunkIndex);
        doc.setEmbedding(embeddingService.generateEmbedding(content));
        
        documentRepository.save(doc);
    }
    
    public List<Document> searchSimilar(String query) {
        float[] queryEmbedding = embeddingService.generateEmbedding(query);
        String embeddingStr = Arrays.toString(queryEmbedding);
        
        return documentRepository.findSimilarDocuments(embeddingStr, topK);
    }
}

// DocumentIngestionService.java
package com.example.rag.service;

import lombok.RequiredArgsConstructor;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

@Service
@RequiredArgsConstructor
public class DocumentIngestionService {
    
    private final VectorStoreService vectorStoreService;
    
    @Value("${rag.chunk-size}")
    private int chunkSize;
    
    @Value("${rag.chunk-overlap}")
    private int chunkOverlap;
    
    public void ingestPdf(MultipartFile file) throws IOException {
        String text = extractTextFromPdf(file);
        List<String> chunks = chunkText(text);
        
        for (int i = 0; i < chunks.size(); i++) {
            vectorStoreService.storeDocument(
                chunks.get(i),
                file.getOriginalFilename(),
                i
            );
        }
    }
    
    private String extractTextFromPdf(MultipartFile file) throws IOException {
        try (PDDocument document = PDDocument.load(file.getInputStream())) {
            PDFTextStripper stripper = new PDFTextStripper();
            return stripper.getText(document);
        }
    }
    
    private List<String> chunkText(String text) {
        List<String> chunks = new ArrayList<>();
        int start = 0;
        
        while (start < text.length()) {
            int end = Math.min(start + chunkSize, text.length());
            chunks.add(text.substring(start, end));
            start += chunkSize - chunkOverlap;
        }
        
        return chunks;
    }
}

// RagService.java
package com.example.rag.service;

import com.example.rag.model.ChatResponse;
import com.example.rag.model.Document;
import com.theokanning.openai.completion.chat.ChatCompletionRequest;
import com.theokanning.openai.completion.chat.ChatMessage;
import com.theokanning.openai.completion.chat.ChatMessageRole;
import com.theokanning.openai.service.OpenAiService;
import lombok.RequiredArgsConstructor;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;

@Service
@RequiredArgsConstructor
public class RagService {
    
    private final OpenAiService openAiService;
    private final VectorStoreService vectorStoreService;
    
    @Value("${openai.model}")
    private String model;
    
    @Value("${openai.max-tokens}")
    private int maxTokens;
    
    @Value("${openai.temperature}")
    private double temperature;
    
    public ChatResponse chat(String query, boolean useRag) {
        List<ChatMessage> messages = new ArrayList<>();
        List<String> sources = new ArrayList<>();
        
        if (useRag) {
            // Retrieve relevant documents
            List<Document> relevantDocs = vectorStoreService.searchSimilar(query);
            
            if (!relevantDocs.isEmpty()) {
                // Build context from retrieved documents
                String context = relevantDocs.stream()
                    .map(Document::getContent)
                    .collect(Collectors.joining("\n\n"));
                
                sources = relevantDocs.stream()
                    .map(doc -> doc.getSource() + " (chunk " + doc.getChunkIndex() + ")")
                    .distinct()
                    .collect(Collectors.toList());
                
                // System message with RAG context
                messages.add(new ChatMessage(
                    ChatMessageRole.SYSTEM.value(),
                    "You are a helpful assistant. Answer the user's question based ONLY on the following context. " +
                    "If the answer cannot be found in the context, say so.\n\nContext:\n" + context
                ));
            }
        }
        
        // User message
        messages.add(new ChatMessage(ChatMessageRole.USER.value(), query));
        
        // Create completion request
        ChatCompletionRequest request = ChatCompletionRequest.builder()
            .model(model)
            .messages(messages)
            .maxTokens(maxTokens)
            .temperature(temperature)
            .build();
        
        String response = openAiService.createChatCompletion(request)
            .getChoices()
            .get(0)
            .getMessage()
            .getContent();
        
        ChatResponse chatResponse = new ChatResponse();
        chatResponse.setResponse(response);
        chatResponse.setSources(sources);
        chatResponse.setUsedRag(useRag && !sources.isEmpty());
        
        return chatResponse;
    }
}

Controller

// ChatController.java
package com.example.rag.controller;

import com.example.rag.model.ChatRequest;
import com.example.rag.model.ChatResponse;
import com.example.rag.service.DocumentIngestionService;
import com.example.rag.service.RagService;
import lombok.RequiredArgsConstructor;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;

@RestController
@RequestMapping("/api")
@RequiredArgsConstructor
public class ChatController {
    
    private final RagService ragService;
    private final DocumentIngestionService documentIngestionService;
    
    @PostMapping("/chat")
    public ResponseEntity<ChatResponse> chat(@RequestBody ChatRequest request) {
        ChatResponse response = ragService.chat(
            request.getQuery(),
            request.isUseRag()
        );
        return ResponseEntity.ok(response);
    }
    
    @PostMapping("/ingest")
    public ResponseEntity<String> ingestDocument(
        @RequestParam("file") MultipartFile file
    ) {
        try {
            documentIngestionService.ingestPdf(file);
            return ResponseEntity.ok("Document ingested successfully");
        } catch (Exception e) {
            return ResponseEntity.badRequest()
                .body("Error ingesting document: " + e.getMessage());
        }
    }
    
    @GetMapping("/health")
    public ResponseEntity<String> health() {
        return ResponseEntity.ok("RAG service is running");
    }
}

Configuration Classes

// OpenAIConfig.java
package com.example.rag.config;

import com.theokanning.openai.service.OpenAiService;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.time.Duration;

@Configuration
public class OpenAIConfig {
    
    @Value("${openai.api-key}")
    private String apiKey;
    
    @Bean
    public OpenAiService openAiService() {
        return new OpenAiService(apiKey, Duration.ofSeconds(60));
    }
}

Database Setup (PostgreSQL with pgvector)

-- Install pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create documents table
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(1536),
    source VARCHAR(255),
    chunk_index INTEGER,
    metadata TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create index for vector similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);

Docker Compose Setup

# docker-compose.yml
version: '3.8'

services:
  postgres:
    image: ankane/pgvector:latest
    environment:
      POSTGRES_DB: ragdb
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

Testing the Application

# Start PostgreSQL
docker-compose up -d

# Run the application
./mvnw spring-boot:run

# Test document ingestion
curl -X POST http://localhost:8080/api/ingest \
  -F "file=@/path/to/document.pdf"

# Test chat without RAG
curl -X POST http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "What is machine learning?", "useRag": false}'

# Test chat with RAG
curl -X POST http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "What does the document say about embeddings?", "useRag": true}'

Learning Roadmap {#learning-roadmap}

Phase 1: Foundations (2-3 weeks)

Mathematics (Lightweight)

  • Linear algebra basics: vectors, dot product, matrix multiplication
  • Probability: distributions, conditional probability
  • Cosine similarity and distance metrics

Machine Learning Basics

  • Supervised vs unsupervised learning
  • Training vs inference
  • Overfitting and regularization
  • Loss functions and optimization

Resources:

  • 3Blue1Brown: Neural Networks series
  • StatQuest: Machine Learning basics
  • Fast.ai: Practical Deep Learning

Phase 2: NLP & Transformers (3-4 weeks)

Core Concepts

  • Tokenization (BPE, WordPiece, SentencePiece)
  • Word embeddings (Word2Vec, GloVe)
  • Contextual embeddings (BERT, GPT)
  • Attention mechanism
  • Transformer architecture

Hands-on

  • Implement tokenization from scratch
  • Use HuggingFace Transformers library
  • Fine-tune a small model
  • Understand encoder vs decoder architectures

Resources:

  • "Attention is All You Need" paper
  • Jay Alammar's Illustrated Transformer
  • HuggingFace Course
  • Andrej Karpathy's Neural Networks: Zero to Hero

Phase 3: Prompt Engineering (1 week)

Techniques

  • Zero-shot prompting
  • Few-shot prompting
  • Chain-of-thought (CoT)
  • ReAct prompting
  • System vs user vs assistant roles

Security

  • Prompt injection attacks
  • Jailbreaking
  • Defense strategies

Resources:

  • OpenAI Prompt Engineering Guide
  • Anthropic's Prompt Engineering docs
  • PromptingGuide.ai

Phase 4: RAG Systems (3-4 weeks) ⭐ CRITICAL

Core Components

  • Document chunking strategies
  • Embedding models
  • Vector databases
  • Retrieval algorithms
  • Reranking techniques

Advanced Topics

  • Hybrid search (dense + sparse)
  • Query transformation
  • Context compression
  • Hallucination detection
  • Evaluation metrics

Hands-on Projects

  • Build a document Q&A system
  • Create a code documentation assistant
  • Implement customer support bot

Resources:

  • LangChain documentation
  • LlamaIndex guides
  • Pinecone learning center

Phase 5: Agents & Tools (2-3 weeks)

Concepts

  • Function calling
  • Tool use
  • Agent architectures (ReAct, Plan & Execute)
  • Multi-agent systems
  • Memory management

Frameworks

  • LangChain Agents
  • AutoGPT
  • BabyAGI
  • CrewAI

Projects

  • Build a research assistant
  • Create a coding agent
  • Implement task automation system

Phase 6: Local & Open Source (2 weeks)

Tools

  • Ollama
  • LM Studio
  • llama.cpp
  • vLLM (for serving)

Concepts

  • Model quantization
  • GGUF format
  • GPU vs CPU inference
  • Model fine-tuning (LoRA, QLoRA)

Projects

  • Run LLaMA locally
  • Build offline RAG system
  • Fine-tune model on custom data

Phase 7: Production Systems (Ongoing)

Engineering

  • API design for LLM applications
  • Streaming responses
  • Caching strategies
  • Rate limiting
  • Cost optimization

Monitoring

  • Latency tracking
  • Token usage
  • Error rates
  • User feedback loops

Evaluation

  • Unit tests for prompts
  • Regression testing
  • A/B testing
  • Human evaluation

MLOps

  • Model versioning
  • Prompt versioning
  • Feature flags
  • Gradual rollouts

Production Considerations {#production}

Performance Optimization

1. Caching

@Service
public class CachingService {
    private final Cache<String, float[]> embeddingCache;
    
    public float[] getEmbedding(String text) {
        return embeddingCache.get(text, 
            key -> embeddingService.generateEmbedding(key)
        );
    }
}

2. Batching

public List<float[]> generateEmbeddingsBatch(List<String> texts) {
    // Process multiple texts in one API call
    return openAiService.createEmbeddings(
        EmbeddingRequest.builder()
            .model(embeddingModel)
            .input(texts)
            .build()
    ).getData().stream()
        .map(this::convertToFloatArray)
        .collect(Collectors.toList());
}

3. Streaming Responses

@GetMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> chatStream(@RequestParam String query) {
    return Flux.create(sink -> {
        openAiService.streamChatCompletion(request)
            .doOnNext(chunk -> sink.next(chunk.getChoices().get(0).getText()))
            .doOnComplete(sink::complete)
            .subscribe();
    });
}

Cost Optimization

Token Management

public class TokenCounter {
    // Estimate tokens (rough approximation)
    public int estimateTokens(String text) {
        return text.split("\\s+").length * 4 / 3;
    }
    
    public String truncateToTokenLimit(String text, int maxTokens) {
        int estimatedTokens = estimateTokens(text);
        if (estimatedTokens <= maxTokens) {
            return text;
        }
        
        int ratio = maxTokens / estimatedTokens;
        int targetLength = text.length() * ratio;
        return text.substring(0, targetLength);
    }
}

Smart Context Selection

public List<Document> selectBestContext(
    List<Document> candidates,
    int maxTokens
) {
    List<Document> selected = new ArrayList<>();
    int totalTokens = 0;
    
    for (Document doc : candidates) {
        int docTokens = estimateTokens(doc.getContent());
        if (totalTokens + docTokens <= maxTokens) {
            selected.add(doc);
            totalTokens += docTokens;
        } else {
            break;
        }
    }
    
    return selected;
}

Security & Safety

1. Input Validation

public void validateInput(String query) {
    if (query.length() > 10000) {
        throw new IllegalArgumentException("Query too long");
    }
    
    if (containsSuspiciousPatterns(query)) {
        throw new SecurityException("Potential prompt injection detected");
    }
}

2. Output Filtering

public String filterOutput(String response) {
    // Remove PII
    response = removePII(response);
    
    // Check for harmful content
    if (containsHarmfulContent(response)) {
        return "I cannot provide that information.";
    }
    
    return response;
}

3. Rate Limiting

@Component
public class RateLimitInterceptor implements HandlerInterceptor {
    private final RateLimiter rateLimiter;
    
    @Override
    public boolean preHandle(HttpServletRequest request, 
                           HttpServletResponse response, 
                           Object handler) {
        String userId = getUserId(request);
        if (!rateLimiter.tryAcquire(userId)) {
            response.setStatus(429);
            return false;
        }
        return true;
    }
}

Monitoring & Observability

Metrics to Track

@Service
public class MetricsService {
    private final MeterRegistry registry;
    
    public void recordLatency(String operation, long milliseconds) {
        registry.timer("llm.latency", "operation", operation)
            .record(Duration.ofMillis(milliseconds));
    }
    
    public void recordTokenUsage(int promptTokens, int completionTokens) {
        registry.counter("llm.tokens.prompt").increment(promptTokens);
        registry.counter("llm.tokens.completion").increment(completionTokens);
    }
    
    public void recordCost(double cost) {
        registry.counter("llm.cost").increment(cost);
    }
}

Evaluation Framework

Automated Testing

@Test
public void testRagAccuracy() {
    // Test dataset
    List<QAPair> testCases = loadTestCases();
    
    for (QAPair qa : testCases) {
        ChatResponse response = ragService.chat(qa.getQuestion(), true);
        
        // Evaluate
        double similarity = calculateSimilarity(
            response.getResponse(),
            qa.getExpectedAnswer()
        );
        
        assertTrue(similarity > 0.8, "Response not similar enough");
    }
}

A/B Testing

@Service
public class ExperimentService {
    public ChatResponse chat(String query, String userId) {
        boolean useNewAlgorithm = isInExperimentGroup(userId);
        
        if (useNewAlgorithm) {
            return ragService.chatV2(query);
        } else {
            return ragService.chatV1(query);
        }
    }
}

Additional Topics {#additional-topics}

1. Multimodal Models

Vision + Language

  • Image understanding (GPT-4V, Claude 3)
  • Image generation (DALL-E, Stable Diffusion)
  • OCR and document understanding

Audio

  • Speech-to-text (Whisper)
  • Text-to-speech (ElevenLabs, Bark)
  • Audio embeddings

2. Fine-tuning Strategies

When to Fine-tune

  • Domain-specific terminology
  • Consistent output format
  • Specialized tasks

Methods

  • Full fine-tuning (expensive)
  • LoRA (Low-Rank Adaptation)
  • QLoRA (Quantized LoRA)
  • Prompt tuning
  • Prefix tuning

3. Constitutional AI

Techniques

  • RLHF (Reinforcement Learning from Human Feedback)
  • Constitutional AI (Anthropic)
  • Red teaming
  • Adversarial testing

4. Advanced RAG Patterns

Query Routing

Simple question → Direct LLM
Complex question → RAG pipeline
Calculation → Tool use

Hypothetical Document Embeddings (HyDE)

User query → LLM generates hypothetical answer
→ Embed hypothetical answer → Search for similar docs

Self-RAG

LLM generates → Self-critique → Retrieval if needed
→ Regenerate with context

5. Structured Output

JSON Mode

ChatCompletionRequest request = ChatCompletionRequest.builder()
    .model("gpt-4-turbo-preview")
    .messages(messages)
    .responseFormat(ResponseFormat.builder()
        .type("json_object")
        .build())
    .build();

Function Calling for Structured Data

FunctionDefinition extractSchema = FunctionDefinition.builder()
    .name("extract_entities")
    .description("Extract entities from text")
    .parameters(/* JSON Schema */)
    .build();

6. Ethical Considerations

Bias & Fairness

  • Training data bias
  • Output fairness testing
  • Demographic parity

Privacy

  • Data retention policies
  • PII removal
  • Differential privacy

Environmental Impact

  • Carbon footprint of training
  • Inference efficiency
  • Model compression

Quick Reference

Token Limits by Model

Model Context Window Max Output
GPT-4 Turbo 128K 4K
GPT-3.5 Turbo 16K 4K
Claude 3.5 Sonnet 200K 8K
Gemini 1.5 Pro 2M 8K
LLaMA 3.1 70B 128K -

Embedding Dimensions

Model Dimensions Cost
text-embedding-3-small 1536 $0.02/1M tokens
text-embedding-3-large 3072 $0.13/1M tokens
BGE-large 1024 Free (open source)

Common Pitfalls

Don't:

  • Store raw API keys in code
  • Skip input validation
  • Ignore token limits
  • Forget to handle rate limits
  • Cache indefinitely

Do:

  • Use environment variables
  • Validate and sanitize inputs
  • Monitor token usage
  • Implement exponential backoff
  • Set cache expiration

Next Steps

For Immediate Practice:

  1. Build the Spring Boot RAG app above
  2. Ingest your own documents
  3. Experiment with different chunking strategies
  4. Try different embedding models
  5. Add tool use (web search, calculator)

For Deep Learning:

  1. Implement a mini-transformer from scratch
  2. Fine-tune a small model (LLaMA 7B)
  3. Build a multi-agent system
  4. Create an evaluation framework
  5. Deploy to production

Community Resources:

  • HuggingFace Discord
  • LangChain GitHub discussions
  • r/LocalLLaMA subreddit
  • AI Tinkerers meetups
  • Papers with Code

Stay Updated:

  • Anthropic's research blog
  • OpenAI's blog
  • Google AI blog
  • arXiv cs.CL and cs.AI
  • Twitter/X: Follow AI researchers

Summary

You now understand: ✅ How LLMs work internally (transformers, attention, embeddings) ✅ How agents use tools and memory ✅ How RAG systems retrieve and inject context ✅ How to build production-ready AI apps ✅ Complete Spring Boot implementation ✅ Best practices for cost, performance, and safety

The key insight: Modern AI applications are 70% engineering (RAG, tools, orchestration) and 30% ML. Focus on building robust systems that leverage LLMs effectively rather than training models from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment