Skip to content

Instantly share code, notes, and snippets.

@ochafik
Created September 23, 2025 00:06
Show Gist options
  • Select an option

  • Save ochafik/64d01e709ba753e6dc0defbf40ccce86 to your computer and use it in GitHub Desktop.

Select an option

Save ochafik/64d01e709ba753e6dc0defbf40ccce86 to your computer and use it in GitHub Desktop.
GPT Researcher Deep Research Algo Analysis

In-Depth Analysis: Deep Research Algorithm

Executive Summary

The Deep Research algorithm implements a recursive breadth-first tree search with intelligent learning accumulation, creating a multi-layered research system that progressively deepens understanding through iterative exploration.

Core Algorithm Architecture

1. Recursive Tree Structure

deep_research(query, breadth=4, depth=2)
    ├── Level 1: Generate `breadth` search queries
    │   ├── Query 1 → Research → Extract learnings & follow-ups
    │   ├── Query 2 → Research → Extract learnings & follow-ups
    │   ├── Query 3 → Research → Extract learnings & follow-ups
    │   └── Query 4 → Research → Extract learnings & follow-ups
    │
    └── Level 2: For each result (if depth > 1)
        └── Recursively call deep_research with:
            - Combined follow-up questions as new query
            - Reduced breadth (breadth // 2)
            - Reduced depth (depth - 1)
            - Accumulated learnings passed forward

2. Key Data Structures

Learnings Accumulation (deep_research.py:191-200)

learnings: List[str]      # Insights extracted from research
citations: Dict[str, str]  # Maps learning → source URL
visited_urls: Set[str]     # Deduplication of visited sources
context: List[str]         # Raw research content
sources: List[dict]        # Metadata about sources

Progress Tracking (deep_research.py:37-46)

class ResearchProgress:
    current_depth: int     # 1 to total_depth
    total_depth: int       # Maximum recursion depth
    current_breadth: int   # Completed queries at level
    total_breadth: int     # Total queries at level
    current_query: str     # Active query being processed
    completed_queries: int # Global counter

Algorithm Flow Analysis

Phase 1: Query Generation (deep_research.py:209-211)

  1. Input: Original research query
  2. Process: LLM generates breadth search queries with research goals
  3. Output: List of structured queries with goals
    serp_queries = [
        {'query': 'AI safety regulations 2024',
         'researchGoal': 'Understand current regulatory framework'},
        {'query': 'AI alignment research progress',
         'researchGoal': 'Review technical advances'},
        ...
    ]

Phase 2: Concurrent Research Execution (deep_research.py:219-276)

Concurrency Control

semaphore = asyncio.Semaphore(self.concurrency_limit)  # Default: 2

async def process_query(serp_query):
    async with semaphore:  # Limits parallel executions
        # Create new GPTResearcher instance
        # Conduct research
        # Extract learnings
        # Return structured results

Per-Query Processing (deep_research.py:222-273)

  1. Create Researcher Instance: Fresh GPTResearcher for each query
  2. Conduct Research: Full research pipeline (search, scrape, summarize)
  3. Extract Learnings: LLM processes results to extract:
    • Key insights with citations
    • Follow-up questions
    • Context preservation

Phase 3: Result Aggregation (deep_research.py:284-293)

for result in results:
    all_learnings.extend(result['learnings'])      # Accumulate insights
    all_visited_urls.update(result['visited_urls']) # Track all sources
    all_citations.update(result['citations'])       # Map learnings→URLs
    all_context.append(result['context'])           # Preserve raw data
    all_sources.extend(result['sources'])           # Metadata tracking

Phase 4: Recursive Deepening (deep_research.py:294-324)

Recursion Logic

if depth > 1:
    new_breadth = max(2, breadth // 2)  # Reduce breadth
    new_depth = depth - 1                # Decrement depth

    # Combine follow-up questions into new query
    next_query = f"""
    Previous research goal: {result['researchGoal']}
    Follow-up questions: {' '.join(result['followUpQuestions'])}
    """

    # Recursive call with accumulated state
    deeper_results = await self.deep_research(
        query=next_query,
        breadth=new_breadth,
        depth=new_depth,
        learnings=all_learnings,      # Pass accumulated learnings
        citations=all_citations,      # Pass citation mappings
        visited_urls=all_visited_urls # Avoid revisiting URLs
    )

Critical Mechanisms

1. Learning Propagation

  • Forward Passing: Learnings from parent nodes passed to children
  • Accumulation: Each level adds to global learning pool
  • Deduplication: list(set(all_learnings)) at return (deep_research.py:334)

2. Citation Tracking

  • Source Attribution: Every learning linked to source URL
  • Format: learning_text → source_url mapping
  • Preservation: Citations carried through entire tree

3. Context Management

  • Word Limit: 25,000 words maximum (deep_research.py:15)
  • Trimming Strategy: Keep most recent content when limit exceeded
  • Preservation: Raw context maintained for final report

4. URL Deduplication

  • Purpose: Avoid re-scraping same sources
  • Implementation: Set-based tracking across tree
  • Inheritance: Child nodes inherit parent's visited URLs

Execution Example

Query: "Impact of AI on healthcare"

Level 1 (Breadth=4, Depth=2)

1. "AI diagnostic tools in radiology 2024"
   → Learnings: [FDA approvals, accuracy rates]
   → Follow-ups: ["regulatory challenges", "patient acceptance"]

2. "Machine learning drug discovery breakthroughs"
   → Learnings: [New compounds, trial results]
   → Follow-ups: ["cost reduction", "timeline acceleration"]

3. "AI-powered patient monitoring systems"
   → Learnings: [Hospital implementations, outcomes]
   → Follow-ups: ["privacy concerns", "integration challenges"]

4. "Healthcare AI ethics and bias"
   → Learnings: [Bias cases, mitigation strategies]
   → Follow-ups: ["regulatory frameworks", "accountability"]

Level 2 (Breadth=2, Depth=1)

For each Level 1 result:

1.1 "FDA regulatory challenges for AI diagnostics"
1.2 "Patient acceptance of AI medical tools"

2.1 "AI drug discovery cost analysis"
2.2 "Accelerated clinical trial timelines with ML"

... (continues for all branches)

Performance Characteristics

Complexity Analysis

  • Time Complexity: O(breadth^depth) queries
  • Space Complexity: O(breadth × depth × content_size)
  • Concurrency: Limited by semaphore (default: 2 parallel)

Default Configuration

  • Breadth: 4 queries per level
  • Depth: 2 recursive levels
  • Total Queries: 4 + 4×2 = 12 (with breadth halving)
  • Concurrency: 2 simultaneous research operations

Key Insights

1. Not a Traditional Graph Search

  • No visited node tracking (allows revisiting topics from different angles)
  • No frontier/queue management
  • No backtracking or pruning

2. Learning-Centric Design

  • Primary goal: Accumulate diverse learnings
  • Secondary: Build comprehensive context
  • Result: Multi-perspective understanding

3. Adaptive Exploration

  • Follow-up questions guide deeper research
  • Breadth reduction prevents exponential explosion
  • Context trimming maintains feasibility

4. Stateful Recursion

  • Each recursive call inherits parent's knowledge
  • Learnings accumulate across entire tree
  • Citations preserved through all levels

Conclusion

The Deep Research algorithm is a sophisticated knowledge accumulation system that:

  1. Explores broadly at each level (breadth-first)
  2. Dives deeply through recursion (controlled depth)
  3. Accumulates learnings progressively
  4. Maintains attribution through citation tracking
  5. Manages resources through concurrency control and context trimming

This creates a comprehensive research system that builds layered understanding through iterative, intelligent exploration—more akin to human research patterns than traditional graph algorithms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment