The Deep Research algorithm implements a recursive breadth-first tree search with intelligent learning accumulation, creating a multi-layered research system that progressively deepens understanding through iterative exploration.
deep_research(query, breadth=4, depth=2)
├── Level 1: Generate `breadth` search queries
│ ├── Query 1 → Research → Extract learnings & follow-ups
│ ├── Query 2 → Research → Extract learnings & follow-ups
│ ├── Query 3 → Research → Extract learnings & follow-ups
│ └── Query 4 → Research → Extract learnings & follow-ups
│
└── Level 2: For each result (if depth > 1)
└── Recursively call deep_research with:
- Combined follow-up questions as new query
- Reduced breadth (breadth // 2)
- Reduced depth (depth - 1)
- Accumulated learnings passed forward
learnings: List[str] # Insights extracted from research
citations: Dict[str, str] # Maps learning → source URL
visited_urls: Set[str] # Deduplication of visited sources
context: List[str] # Raw research content
sources: List[dict] # Metadata about sourcesclass ResearchProgress:
current_depth: int # 1 to total_depth
total_depth: int # Maximum recursion depth
current_breadth: int # Completed queries at level
total_breadth: int # Total queries at level
current_query: str # Active query being processed
completed_queries: int # Global counter- Input: Original research query
- Process: LLM generates
breadthsearch queries with research goals - Output: List of structured queries with goals
serp_queries = [ {'query': 'AI safety regulations 2024', 'researchGoal': 'Understand current regulatory framework'}, {'query': 'AI alignment research progress', 'researchGoal': 'Review technical advances'}, ... ]
semaphore = asyncio.Semaphore(self.concurrency_limit) # Default: 2
async def process_query(serp_query):
async with semaphore: # Limits parallel executions
# Create new GPTResearcher instance
# Conduct research
# Extract learnings
# Return structured results- Create Researcher Instance: Fresh GPTResearcher for each query
- Conduct Research: Full research pipeline (search, scrape, summarize)
- Extract Learnings: LLM processes results to extract:
- Key insights with citations
- Follow-up questions
- Context preservation
for result in results:
all_learnings.extend(result['learnings']) # Accumulate insights
all_visited_urls.update(result['visited_urls']) # Track all sources
all_citations.update(result['citations']) # Map learnings→URLs
all_context.append(result['context']) # Preserve raw data
all_sources.extend(result['sources']) # Metadata trackingif depth > 1:
new_breadth = max(2, breadth // 2) # Reduce breadth
new_depth = depth - 1 # Decrement depth
# Combine follow-up questions into new query
next_query = f"""
Previous research goal: {result['researchGoal']}
Follow-up questions: {' '.join(result['followUpQuestions'])}
"""
# Recursive call with accumulated state
deeper_results = await self.deep_research(
query=next_query,
breadth=new_breadth,
depth=new_depth,
learnings=all_learnings, # Pass accumulated learnings
citations=all_citations, # Pass citation mappings
visited_urls=all_visited_urls # Avoid revisiting URLs
)- Forward Passing: Learnings from parent nodes passed to children
- Accumulation: Each level adds to global learning pool
- Deduplication:
list(set(all_learnings))at return (deep_research.py:334)
- Source Attribution: Every learning linked to source URL
- Format:
learning_text → source_urlmapping - Preservation: Citations carried through entire tree
- Word Limit: 25,000 words maximum (
deep_research.py:15) - Trimming Strategy: Keep most recent content when limit exceeded
- Preservation: Raw context maintained for final report
- Purpose: Avoid re-scraping same sources
- Implementation: Set-based tracking across tree
- Inheritance: Child nodes inherit parent's visited URLs
1. "AI diagnostic tools in radiology 2024"
→ Learnings: [FDA approvals, accuracy rates]
→ Follow-ups: ["regulatory challenges", "patient acceptance"]
2. "Machine learning drug discovery breakthroughs"
→ Learnings: [New compounds, trial results]
→ Follow-ups: ["cost reduction", "timeline acceleration"]
3. "AI-powered patient monitoring systems"
→ Learnings: [Hospital implementations, outcomes]
→ Follow-ups: ["privacy concerns", "integration challenges"]
4. "Healthcare AI ethics and bias"
→ Learnings: [Bias cases, mitigation strategies]
→ Follow-ups: ["regulatory frameworks", "accountability"]
For each Level 1 result:
1.1 "FDA regulatory challenges for AI diagnostics"
1.2 "Patient acceptance of AI medical tools"
2.1 "AI drug discovery cost analysis"
2.2 "Accelerated clinical trial timelines with ML"
... (continues for all branches)
- Time Complexity: O(breadth^depth) queries
- Space Complexity: O(breadth × depth × content_size)
- Concurrency: Limited by semaphore (default: 2 parallel)
- Breadth: 4 queries per level
- Depth: 2 recursive levels
- Total Queries: 4 + 4×2 = 12 (with breadth halving)
- Concurrency: 2 simultaneous research operations
- No visited node tracking (allows revisiting topics from different angles)
- No frontier/queue management
- No backtracking or pruning
- Primary goal: Accumulate diverse learnings
- Secondary: Build comprehensive context
- Result: Multi-perspective understanding
- Follow-up questions guide deeper research
- Breadth reduction prevents exponential explosion
- Context trimming maintains feasibility
- Each recursive call inherits parent's knowledge
- Learnings accumulate across entire tree
- Citations preserved through all levels
The Deep Research algorithm is a sophisticated knowledge accumulation system that:
- Explores broadly at each level (breadth-first)
- Dives deeply through recursion (controlled depth)
- Accumulates learnings progressively
- Maintains attribution through citation tracking
- Manages resources through concurrency control and context trimming
This creates a comprehensive research system that builds layered understanding through iterative, intelligent exploration—more akin to human research patterns than traditional graph algorithms.