Skip to content

Instantly share code, notes, and snippets.

@sanand0
Created November 1, 2025 16:08
Show Gist options
  • Select an option

  • Save sanand0/ce1b1edd9be63f460c695546b605525e to your computer and use it in GitHub Desktop.

Select an option

Save sanand0/ce1b1edd9be63f460c695546b605525e to your computer and use it in GitHub Desktop.
Defog's approach to Text-to-SQL

Generated via GitHub Copilot + Claude 4.5 Sonnet on 1 Nov 2025 on https://github.com/defog-ai/defog/ with the prompt: "Summarize the defog approach to text-to-SQL clearly with examples and links to source."

Defog's Approach to Text-to-SQL

Defog provides a comprehensive toolkit for converting natural language questions into SQL queries and executing them against databases. Here's how their approach works:

Core Architecture

1. Multi-Provider LLM Integration

Defog uses a unified interface across multiple LLM providers (OpenAI, Anthropic, Gemini, Together AI, xAI) to generate SQL queries. The system can work with any supported provider.

Source: defog/llm/sql_generator.py

2. Schema-Aware Prompt Construction

The system builds intelligent prompts by:

  • Extracting database metadata (tables, columns, data types, descriptions)
  • Formatting schema as DDL with inline comments for context
  • Including business glossaries and domain-specific terminology
  • Maintaining conversation history for follow-up questions

Key function: build_sql_generation_prompt

3. Automatic Table Filtering for Large Databases

For databases with many tables (>5) or columns (>1000), Defog automatically identifies the most relevant tables before generating SQL:

# Automatic relevance filtering
relevance_result = await identify_relevant_tables_tool(
    question="Show me customer orders from last month",
    db_type="postgres",
    db_creds=db_creds,
    model="gpt-4o",
    provider=LLMProvider.OPENAI,
    max_tables=10,
)

Source: defog/llm/sql.py

Example: Basic Usage

from defog.llm.sql import sql_answer_tool
from defog.llm.llm_providers import LLMProvider

# Ask questions in natural language
result = await sql_answer_tool(
    question="What are the top 10 customers by total sales?",
    db_type="postgres",
    db_creds={
        "host": "localhost",
        "database": "mydb",
        "user": "postgres",
        "password": "password",
        "port": 5432,
    },
    model="claude-sonnet-4-20250514",
    provider=LLMProvider.ANTHROPIC,
)

print(f"SQL: {result['query']}")
print(f"Results: {result['results']}")

Source: README.md

Advanced Features

Conversational Context

Maintains conversation history for follow-up questions:

# With conversation history
result = await sql_answer_tool(
    question="Show me the trend for the same customers",
    db_type="postgres",
    db_creds=db_creds,
    model="claude-sonnet-4-20250514",
    provider=LLMProvider.ANTHROPIC,
    previous_context=[
        {
            "question": "What are the top 10 customers?",
            "sql_query": "SELECT customer_id, SUM(amount)...",
            "results": "[(1, 50000), (2, 45000), ...]",
        }
    ],
)

Business Glossaries

Include domain-specific terminology and business rules:

result = await sql_answer_tool(
    question="Show me CLV for active customers",
    db_type="postgres",
    db_creds=db_creds,
    model="claude-sonnet-4-20250514",
    provider=LLMProvider.ANTHROPIC,
    glossary="""
    Total Sales: Sum of all order amounts for a customer
    Active Customer: Customer with purchase in last 90 days
    CLV: Customer Lifetime Value - total spend since first purchase
    """,
)

Documentation: Database Operations Guide

The Generation Process

  1. Metadata Extraction - Retrieve database schema information
  2. Table Filtering (if needed) - Identify relevant tables for large schemas
  3. Prompt Building - Construct context-aware prompts with schema, glossary, and conversation history
  4. SQL Generation - Use LLM to convert natural language to SQL
  5. Query Execution - Run generated SQL against the database
  6. Result Return - Provide results with column names and data

Implementation: generate_sql_query_local

Supported Databases

Works with 11+ database types including PostgreSQL, MySQL, BigQuery, Snowflake, Databricks, SQL Server, Redshift, SQLite, and DuckDB.

Full list: Database Operations - Supported Databases

Working Example

A complete example implementation is available at examples/sql_agent_example.py showing SQLite setup with sample e-commerce data and natural language querying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment