In-Depth Analysis: Deep Research Algorithm

Executive Summary

The Deep Research algorithm implements a recursive breadth-first tree search with intelligent learning accumulation, creating a multi-layered research system that progressively deepens understanding through iterative exploration.

Core Algorithm Architecture

1. Recursive Tree Structure

deep_research(query, breadth=4, depth=2)
    ├── Level 1: Generate `breadth` search queries
    │   ├── Query 1 → Research → Extract learnings & follow-ups
    │   ├── Query 2 → Research → Extract learnings & follow-ups
    │   ├── Query 3 → Research → Extract learnings & follow-ups
    │   └── Query 4 → Research → Extract learnings & follow-ups
    │
    └── Level 2: For each result (if depth > 1)
        └── Recursively call deep_research with:
            - Combined follow-up questions as new query
            - Reduced breadth (breadth // 2)
            - Reduced depth (depth - 1)
            - Accumulated learnings passed forward

2. Key Data Structures

Learnings Accumulation (`deep_research.py:191-200`)

learnings: List[str]      # Insights extracted from research
citations: Dict[str, str]  # Maps learning → source URL
visited_urls: Set[str]     # Deduplication of visited sources
context: List[str]         # Raw research content
sources: List[dict]        # Metadata about sources

Progress Tracking (`deep_research.py:37-46`)

class ResearchProgress:
    current_depth: int     # 1 to total_depth
    total_depth: int       # Maximum recursion depth
    current_breadth: int   # Completed queries at level
    total_breadth: int     # Total queries at level
    current_query: str     # Active query being processed
    completed_queries: int # Global counter

Algorithm Flow Analysis

Phase 1: Query Generation (`deep_research.py:209-211`)

Input: Original research query
Process: LLM generates breadth search queries with research goals

Output: List of structured queries with goals

serp_queries = [
    {'query': 'AI safety regulations 2024',
     'researchGoal': 'Understand current regulatory framework'},
    {'query': 'AI alignment research progress',
     'researchGoal': 'Review technical advances'},
    ...
]

Phase 2: Concurrent Research Execution (`deep_research.py:219-276`)

Concurrency Control

semaphore = asyncio.Semaphore(self.concurrency_limit)  # Default: 2

async def process_query(serp_query):
    async with semaphore:  # Limits parallel executions
        # Create new GPTResearcher instance
        # Conduct research
        # Extract learnings
        # Return structured results

Per-Query Processing (`deep_research.py:222-273`)

Create Researcher Instance: Fresh GPTResearcher for each query
Conduct Research: Full research pipeline (search, scrape, summarize)
Extract Learnings: LLM processes results to extract:
- Key insights with citations
- Follow-up questions
- Context preservation

Phase 3: Result Aggregation (`deep_research.py:284-293`)

for result in results:
    all_learnings.extend(result['learnings'])      # Accumulate insights
    all_visited_urls.update(result['visited_urls']) # Track all sources
    all_citations.update(result['citations'])       # Map learnings→URLs
    all_context.append(result['context'])           # Preserve raw data
    all_sources.extend(result['sources'])           # Metadata tracking

Phase 4: Recursive Deepening (`deep_research.py:294-324`)

Recursion Logic

if depth > 1:
    new_breadth = max(2, breadth // 2)  # Reduce breadth
    new_depth = depth - 1                # Decrement depth

    # Combine follow-up questions into new query
    next_query = f"""
    Previous research goal: {result['researchGoal']}
    Follow-up questions: {' '.join(result['followUpQuestions'])}
    """

    # Recursive call with accumulated state
    deeper_results = await self.deep_research(
        query=next_query,
        breadth=new_breadth,
        depth=new_depth,
        learnings=all_learnings,      # Pass accumulated learnings
        citations=all_citations,      # Pass citation mappings
        visited_urls=all_visited_urls # Avoid revisiting URLs
    )

Critical Mechanisms

1. Learning Propagation

Forward Passing: Learnings from parent nodes passed to children
Accumulation: Each level adds to global learning pool
Deduplication: list(set(all_learnings)) at return (deep_research.py:334)

2. Citation Tracking

Source Attribution: Every learning linked to source URL
Format: learning_text → source_url mapping
Preservation: Citations carried through entire tree

3. Context Management

Word Limit: 25,000 words maximum (deep_research.py:15)
Trimming Strategy: Keep most recent content when limit exceeded
Preservation: Raw context maintained for final report

4. URL Deduplication

Purpose: Avoid re-scraping same sources
Implementation: Set-based tracking across tree
Inheritance: Child nodes inherit parent's visited URLs

Execution Example

Query: "Impact of AI on healthcare"

Level 1 (Breadth=4, Depth=2)

1. "AI diagnostic tools in radiology 2024"
   → Learnings: [FDA approvals, accuracy rates]
   → Follow-ups: ["regulatory challenges", "patient acceptance"]

2. "Machine learning drug discovery breakthroughs"
   → Learnings: [New compounds, trial results]
   → Follow-ups: ["cost reduction", "timeline acceleration"]

3. "AI-powered patient monitoring systems"
   → Learnings: [Hospital implementations, outcomes]
   → Follow-ups: ["privacy concerns", "integration challenges"]

4. "Healthcare AI ethics and bias"
   → Learnings: [Bias cases, mitigation strategies]
   → Follow-ups: ["regulatory frameworks", "accountability"]

Level 2 (Breadth=2, Depth=1)

For each Level 1 result:

1.1 "FDA regulatory challenges for AI diagnostics"
1.2 "Patient acceptance of AI medical tools"

2.1 "AI drug discovery cost analysis"
2.2 "Accelerated clinical trial timelines with ML"

... (continues for all branches)

Performance Characteristics

Complexity Analysis

Time Complexity: O(breadth^depth) queries
Space Complexity: O(breadth × depth × content_size)
Concurrency: Limited by semaphore (default: 2 parallel)

Default Configuration

Breadth: 4 queries per level
Depth: 2 recursive levels
Total Queries: 4 + 4×2 = 12 (with breadth halving)
Concurrency: 2 simultaneous research operations

Key Insights

1. Not a Traditional Graph Search

No visited node tracking (allows revisiting topics from different angles)
No frontier/queue management
No backtracking or pruning

2. Learning-Centric Design

Primary goal: Accumulate diverse learnings
Secondary: Build comprehensive context
Result: Multi-perspective understanding

3. Adaptive Exploration

Follow-up questions guide deeper research
Breadth reduction prevents exponential explosion
Context trimming maintains feasibility

4. Stateful Recursion

Each recursive call inherits parent's knowledge
Learnings accumulate across entire tree
Citations preserved through all levels

Conclusion

The Deep Research algorithm is a sophisticated knowledge accumulation system that:

Explores broadly at each level (breadth-first)
Dives deeply through recursion (controlled depth)
Accumulates learnings progressively
Maintains attribution through citation tracking
Manages resources through concurrency control and context trimming

This creates a comprehensive research system that builds layered understanding through iterative, intelligent exploration—more akin to human research patterns than traditional graph algorithms.

ochafik/deep-research-algorithm-in-depth-analysis.md Secret

Select an option

No results found

Select an option

No results found

In-Depth Analysis: Deep Research Algorithm

Executive Summary

Core Algorithm Architecture

1. Recursive Tree Structure

2. Key Data Structures

Learnings Accumulation (`deep_research.py:191-200`)

Progress Tracking (`deep_research.py:37-46`)

Algorithm Flow Analysis

Phase 1: Query Generation (`deep_research.py:209-211`)

Phase 2: Concurrent Research Execution (`deep_research.py:219-276`)

Concurrency Control

Per-Query Processing (`deep_research.py:222-273`)

Phase 3: Result Aggregation (`deep_research.py:284-293`)

Phase 4: Recursive Deepening (`deep_research.py:294-324`)

Recursion Logic

Critical Mechanisms

1. Learning Propagation

2. Citation Tracking

3. Context Management

4. URL Deduplication

Execution Example

Query: "Impact of AI on healthcare"

Level 1 (Breadth=4, Depth=2)

Level 2 (Breadth=2, Depth=1)

Performance Characteristics

Complexity Analysis

Default Configuration

Key Insights

1. Not a Traditional Graph Search

2. Learning-Centric Design

3. Adaptive Exploration

4. Stateful Recursion

Conclusion

ochafik/deep-research-algorithm-in-depth-analysis.md Secret

In-Depth Analysis: Deep Research Algorithm

Executive Summary

Core Algorithm Architecture

1. Recursive Tree Structure

2. Key Data Structures

Learnings Accumulation (deep_research.py:191-200)

Progress Tracking (deep_research.py:37-46)

Algorithm Flow Analysis

Phase 1: Query Generation (deep_research.py:209-211)

Phase 2: Concurrent Research Execution (deep_research.py:219-276)

Concurrency Control

Per-Query Processing (deep_research.py:222-273)

Phase 3: Result Aggregation (deep_research.py:284-293)

Phase 4: Recursive Deepening (deep_research.py:294-324)

Recursion Logic

Critical Mechanisms

1. Learning Propagation

2. Citation Tracking

3. Context Management

4. URL Deduplication

Execution Example

Query: "Impact of AI on healthcare"

Level 1 (Breadth=4, Depth=2)

Level 2 (Breadth=2, Depth=1)

Performance Characteristics

Complexity Analysis

Default Configuration

Key Insights

1. Not a Traditional Graph Search

2. Learning-Centric Design

3. Adaptive Exploration

4. Stateful Recursion

Conclusion

Learnings Accumulation (`deep_research.py:191-200`)

Progress Tracking (`deep_research.py:37-46`)

Phase 1: Query Generation (`deep_research.py:209-211`)

Phase 2: Concurrent Research Execution (`deep_research.py:219-276`)

Per-Query Processing (`deep_research.py:222-273`)

Phase 3: Result Aggregation (`deep_research.py:284-293`)

Phase 4: Recursive Deepening (`deep_research.py:294-324`)