LangSmith's Insights Agent is an LLM-orchestrated data analysis system that automatically discovers usage patterns and failure modes in production agent traces. Unlike traditional clustering tools, it accepts natural language questions from users and dynamically adapts its entire analysis pipeline—from feature extraction to taxonomy generation—based on those questions.
Key Innovation: It's a conversational interface to trace analysis, where users describe what they want to learn, and the system translates that into a custom analytical workflow.
Core Capabilities:
- Analyzes up to 1,000 sampled traces in up to 30 minutes (typically 15-20 minutes)
- Generates hierarchical taxonomies (categories → subcategories) with auto-generated or predefined top-level categories
- Supports both usage pattern discovery and failure mode analysis, with configurable attributes and filters
- Integrates with Multi-turn Evals for thread-level conversation scoring
- Enables drill-down to individual traces and export to datasets/annotation queues
- Cost: $1-2 (OpenAI) or $3-4 (Anthropic) per 1,000 threads analyzed
Target Users: Plus and Enterprise tier LangSmith customers monitoring production agents
Traditional challenges in agent monitoring:
- Manual trace review is time-consuming and impossible at scale
- Predefined classifiers require knowing what to look for upfront
- Single-trace evaluations miss conversation-level patterns
- No systematic way to discover unknown usage patterns or failure modes
The gap: Teams need to understand "what's happening in production" to prioritize improvements, but agents generate millions (soon billions) of traces.
LangSmith now treats "threads" (multi-turn agent interactions) as first-class concepts, with two complementary features:
Insights Agent (this document)
- Purpose: Discover and categorize patterns across many threads
- Use case: "What are users asking?" / "Where is my agent failing?"
- Output: Hierarchical taxonomy of usage patterns or failure modes
Multi-turn Evals
- Purpose: Score individual thread quality
- Use case: Measure sentiment, goal completion, agent trajectory
- Output: Automated scoring on complete conversations
- Limits: Runs <1 week old; max 500 threads at once; max 10 evaluators per workspace
Together, they address: "What patterns exist?" (Insights) + "How well did this conversation go?" (Evals)
Paradigm shift from traditional BI/observability:
| Traditional Approach | Insights Agent Approach |
|---|---|
| You define features → Algorithm clusters → You interpret | You ask questions → LLM defines features → LLM clusters → LLM explains |
| Requires analytical expertise upfront | Conversational, accessible to non-technical users |
| Fixed clustering dimensions | Dynamic, question-guided analysis |
| Generic categories | Semantically coherent, domain-aware categories |
As noted in the X post: "The categorization is surprisingly good. Clusters are semantically coherent." This stems from LLMs understanding both trace semantics and user intent.
Trace Structure Requirements:
- Traces must be organized into threads (multi-turn conversations)
- Top-level input/output for each trace must contain a list of messages (LangChain, OpenAI Chat Completions, or Anthropic Messages formats)
- LangSmith automatically combines message histories if passed incrementally
- Idle time must be set for the project (defines when a conversation is "complete"); set during first multi-turn eval configuration
- Permissions: Create rules (for new reports), view tracing projects (for existing reports)
- API Keys: OpenAI or Anthropic workspace secrets required
Example thread structure:
{
"thread_id": "thread_123",
"traces": [
{
"trace_id": "trace_1",
"input": {"messages": [{"role": "human", "content": "..."}]},
"output": {"messages": [{"role": "ai", "content": "..."}]},
"tool_calls": [...],
"metadata": {...}
},
// More traces in thread...
]
}Step 1: Navigate to Project
- Go to your LangSmith tracing project (e.g., "chat-langchain")
- Click "+New" > "New Insights Report"
Step 2: Configuration via Natural Language (Auto Config)
-
Toggle "Auto" on
-
Answer three free-form questions: Q1: "What does the agent in this tracing project do?"
- Example: "Answers questions about LangChain products"
- Purpose: Provides domain context for semantic understanding
Q2: "What would you like to learn about this agent?"
- Example: "What questions are users asking about what products and features?"
- Alternative: "Where's my chatbot returning bad results, where is it making mistakes?"
- Purpose: Defines the analytical focus and clustering strategy
Q3: "How are the traces in this project structured?"
- Example: "We use threads and the messages field has the full chat history"
- Purpose: Guides data extraction and parsing
-
Insights translates answers into a draft config (job name, summary prompt, attributes, sampling)
Alternative Config Methods:
- Prebuilt Config: Load presets like "Usage Patterns" or "Error Analysis" from dropdown; run or customize
- From Scratch:
- Select traces: Sample size (max 1,000), time range, filters (preview matching count)
- Categories: Auto bottom-up or predefined top-level (with descriptions; subcategories auto-generated)
- Summary Prompt: Editable instructions + mustache templates (e.g., "{{run.inputs}}", "{{run.outputs}}") to include specific thread parts (focuses on last run by default)
- Attributes: Define extra categorical/numerical/boolean attributes (e.g., "user_satisfied: boolean")
- Filter Attributes: For booleans, set "filter_by: true" to include only true traces (evaluated during summarization)
Model Provider: OpenAI or Anthropic (Anthropic ~3x costlier)
Step 3: Submit & Wait
- Click "Generate config" to preview or "Run job" to launch
- Processing time: Up to 30 minutes (background job)
- System samples up to 1,000 traces
- Can view similar pre-generated reports while waiting
Report Structure:
📊 Insights Report: "Question Topics"
│
├─ 📁 Product Orientation (35% of traces)
│ ├─ Clarifying differences between LangChain products
│ ├─ Understanding LangGraph platform features
│ └─ General product capabilities questions
│
├─ 📁 Agent Orchestration (28% of traces)
│ ├─ Tool selection and sequencing
│ ├─ Memory management questions
│ └─ Multi-step planning patterns
│
├─ 📁 Retrieval (22% of traces)
│ ├─ Vector store questions
│ ├─ General retrieval design
│ └─ Document loading strategies
│
└─ 📁 [Other categories...]
For each category, the UI shows:
- Category name and description
- Relative frequency (% of total sampled traces)
- Aggregated metrics:
- Error rates
- Latency distributions
- Cost
- Token usage statistics
- Feedback scores
- Extracted attributes (e.g., avg user satisfaction)
Interactive Actions:
- Expand category → View subcategories with finer granularity
- Click subcategory → See trace table with actual conversations
- Select traces → Bulk actions:
- Add to annotation queue (for human review)
- Add to dataset (for offline evaluation)
- Create dashboard monitor
- Configure new multi-turn eval
Example from Chat LangChain case study:
Finding: "Product Orientation" category has high frequency with subcategory "Clarifying differences between LangChain products"
Insight: "We're getting a lot of basic questions just about how the different products relate to one another"
Action: "Maybe we need to add some more documentation on the differences between our different offerings"
This demonstrates the discovery → insight → action loop the tool enables.
┌─────────────────────────────────────────────────────────────────┐
│ Phase 0: NL Configuration → Analysis Spec Translation │
│ ─────────────────────────────────────────────────────────────── │
│ Input: Free-form user answers (domain, questions, structure) │
│ Output: Structured analysis specification (including summary prompt, attributes) │
│ Duration: <1 second │
└────────────────┬────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Phase 1: Thread Readiness & Intelligent Sampling │
│ ─────────────────────────────────────────────────────────────── │
│ • Apply user filters (time, length, keywords) │
│ • Validate thread structure (messages field present) │
│ • Stratified sampling (up to 1,000 traces) │
│ • Normalize message formats (All/Human-AI/First-Last via templates) │
│ Duration: 1-3 minutes │
└────────────────┬────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Phase 2: Question-Guided Feature Extraction │
│ ─────────────────────────────────────────────────────────────── │
│ For each sampled thread: │
│ • Semantic extraction (LLM-powered): │
│ - Intent/topic synopsis via summary prompt │
│ - Question-relevant features (products, errors, etc.) │
│ - Extract user-defined attributes (categorical/numerical/boolean) │
│ - Generate embeddings │
│ • Behavioral extraction (rule-based): │
│ - tool_chain_length, retry_rate, context_switches │
│ - Error signals, frustration heuristics │
│ • Filter by boolean attributes (exclude false/missing) │
│ Duration: 5-8 minutes (parallel LLM calls) │
│ Cost: Scales with threads; $1-4 per 1,000 │
└────────────────┬────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Phase 3: Question-Aware Semantic Clustering │
│ ─────────────────────────────────────────────────────────────── │
│ • Weight features based on user question: │
│ - Topics mode: Heavy semantic weight (0.7 semantic, 0.3 behavioral) │
│ - Failure mode: Heavy behavioral weight (0.3 semantic, 0.7 behavioral) │
│ • Build similarity graph on weighted features + attributes │
│ • Apply clustering algorithm (likely HDBSCAN or hierarchical) │
│ • Generate coarse clusters → subclusters (2-level hierarchy) │
│ - Top-level: Auto bottom-up or predefined │
│ - Subcategories: Always auto-generated │
│ Duration: 2-4 minutes │
└────────────────┬────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Phase 4: LLM-as-Judge Hierarchical Labeling │
│ ─────────────────────────────────────────────────────────────── │
│ For each cluster: │
│ • Analyze representative samples (5-10 threads per cluster) │
│ • Generate high-level category name + description │
│ • Create subcategory labels with rationales │
│ • Assign category type (usage pattern / failure mode) │
│ • Validate semantic coherence │
│ Duration: 3-5 minutes (sequential LLM calls) │
└────────────────┬────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Phase 5: Metrics Aggregation & Report Assembly │
│ ─────────────────────────────────────────────────────────────── │
│ For each (sub)category: │
│ • Calculate relative frequency (% of sampled traces) │
│ • Aggregate run statistics (latency, tokens, cost) │
│ • Join feedback scores and eval results │
│ • Compute trends vs previous periods │
│ • Generate visualizations (distribution bars) │
│ Duration: 1-2 minutes │
└────────────────┬────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Phase 6: Interactive Report Generation │
│ ─────────────────────────────────────────────────────────────── │
│ • Render hierarchical UI with collapsible categories │
│ • Link to underlying trace tables │
│ • Enable bulk actions (dataset, annotation queue) │
│ • Cache report for future viewing │
│ • Save configuration for reruns │
└─────────────────────────────────────────────────────────────────┘
Total Duration: Up to 30 minutes
Purpose: Convert natural language inputs into a structured analysis specification.
Input Example:
What does your agent do?
"Answers questions about LangChain products"
What would you like to learn?
"What questions are users asking about what products and features"
How are traces structured?
"We use threads and the messages field has the full chat history"
Output Specification (JSON):
{
"domain_context": "LangChain product support chatbot",
"analysis_mode": "usage_patterns",
"clustering_focus": "question_topics",
"extract_features": [
"product_mentions",
"feature_references",
"question_intent_type"
],
"attributes": [
{"name": "user_satisfied", "type": "boolean", "filter_by": false}
],
"summary_prompt": "Summarize this conversation: {{run.inputs}} {{run.outputs}}",
"message_view": "human_ai_pairs",
"feature_weights": {
"semantic": 0.75,
"behavioral": 0.25
},
"filters": {
"min_turns": 1,
"time_range": null
},
"sample_strategy": {
"target_size": 1000,
"stratify_by": ["thread_length", "recency"],
"oversample_long_threads": true
},
"taxonomy_depth": 2,
"metrics_to_include": [
"frequency",
"feedback_scores",
"latency",
"token_usage"
]
}Prompt Template (Inferred):
You are a data analysis configuration agent. Convert the user's natural language
inputs into a structured analysis specification.
User Domain Context: {agent_description}
Analysis Question: {what_to_learn}
Trace Structure: {structure_info}
Generate a JSON specification that includes:
1. analysis_mode: "usage_patterns" or "failure_modes"
2. clustering_focus: the primary dimension to cluster by
3. extract_features: list of specific features to extract from traces
4. attributes: list of user-defined attributes with types and filter_by
5. summary_prompt: template for LLM summarization
6. feature_weights: semantic vs behavioral importance (sum to 1.0)
7. message_view: "all_messages", "human_ai_pairs", or "first_last"
8. sample_strategy: how to sample traces
9. metrics_to_include: which metrics to aggregate
Respond with valid JSON only.
Alternative Configuration for Failure Analysis:
What would you like to learn?
"Where's my chatbot returning bad results, where is it making mistakes"
{
"analysis_mode": "failure_modes",
"clustering_focus": "error_signatures",
"extract_features": [
"error_types",
"user_frustration_signals",
"incomplete_responses",
"tool_call_failures"
],
"attributes": [
{"name": "has_error", "type": "boolean", "filter_by": true}
],
"feature_weights": {
"semantic": 0.3,
"behavioral": 0.7
},
"prefilter": {
"negative_feedback_only": true,
"min_retry_rate": 0.2
}
}Why Sampling?
- Production agents generate millions of traces
- Full dataset processing would be prohibitively slow and expensive
- Representative sampling maintains statistical validity
- Cap at 1,000 traces
Sampling Algorithm (Inferred):
def intelligent_sample(threads, target_size=1000, config):
"""
Stratified sampling with intelligent heuristics
"""
# Apply hard filters first
filtered = apply_filters(threads, config.filters)
# Stratify by key dimensions
strata = {
'recency': bucket_by_time(filtered, bins=5),
'length': bucket_by_turns(filtered, bins=[1, 2-5, 6-10, 11+]),
'tool_usage': bucket_by_tool_calls(filtered, bins=[0, 1-3, 4+])
}
# Oversample longer threads (richer signal)
weights = {
'length': {
'1': 0.5, # Downsample single-turn
'2-5': 1.0, # Normal weight
'6-10': 1.5, # Oversample
'11+': 2.0 # Heavy oversample
}
}
# Draw samples proportionally with weights
samples = stratified_sample(
strata,
target_size,
weights,
seed=config.seed # Reproducible
)
return samplesPractical Numbers (from sources):
- Target: Up to 1,000 traces
- Processing time: Up to 30 minutes
- Scales to projects with millions of traces
Dual Extraction Pipeline:
A. Semantic Features (LLM-Powered)
def extract_semantic_features(thread, config):
"""
Use LLM to extract question-guided semantic features
"""
prompt = f"""
Analyze this agent conversation thread:
{format_thread(thread, config.message_view)}
Agent Domain: {config.domain_context}
Analysis Focus: {config.clustering_focus}
Extract the following:
1. Primary user intent (what they're trying to accomplish)
2. Specific features mentioned: {config.extract_features}
3. Conversation outcome (satisfied/unsatisfied/unclear)
4. Key topics discussed
5. Attributes: {config.attributes} (e.g., user_satisfied: true/false)
Respond in JSON format.
"""
response = llm.complete(prompt)
features = parse_json(response)
# Apply filter attributes
if any(attr['filter_by'] and not features[attr['name']] for attr in config.attributes if attr['type'] == 'boolean'):
return None # Exclude trace
# Generate embedding for clustering
embedding = embed_model.encode(
features['intent'] + ' ' + ' '.join(features['topics'])
)
return {
'semantic': features,
'embedding': embedding
}B. Behavioral Features (Rule-Based)
def extract_behavioral_features(thread):
"""
Extract quantitative behavioral signals
"""
messages = thread['messages']
tool_calls = thread['tool_calls']
features = {
# Conversation dynamics
'turn_count': len(thread['traces']),
'context_switches': count_topic_changes(messages),
'user_message_length': avg_length(messages, role='human'),
# Tool usage patterns
'tool_chain_length': len(tool_calls),
'unique_tools_used': len(set(tc['tool_name'] for tc in tool_calls)),
'tool_loops': detect_repeated_sequences(tool_calls),
# Quality signals
'retry_rate': count_retries(messages) / len(messages),
'error_count': count_errors(thread),
'frustration_signals': detect_frustration(messages),
# Performance
'total_latency': sum(trace['latency'] for trace in thread['traces']),
'token_usage': sum(trace['tokens'] for trace in thread['traces'])
}
return features
def detect_frustration(messages):
"""Heuristics for user frustration"""
frustration_patterns = [
r"(?i)(you don't understand|not working|doesn't work|useless)",
r"(?i)(frustrat|annoying|stupid|wrong|incorrect)",
r"(?i)(tried already|told you|said before|repeating)"
]
human_messages = [m for m in messages if m['role'] == 'human']
count = 0
for msg in human_messages:
for pattern in frustration_patterns:
if re.search(pattern, msg['content']):
count += 1
break
return count / len(human_messages) if human_messages else 0Adaptive Feature Weighting:
def cluster_threads(threads_with_features, config):
"""
Cluster based on question-adapted feature weights
"""
# Extract weighted feature vectors
vectors = []
for thread in threads_with_features:
semantic_vec = thread['embedding'] # Dense embedding
behavioral_vec = normalize(
list(thread['behavioral'].values())
)
attribute_vec = encode_attributes(thread['semantic']['attributes']) # Encode user-defined attributes
# Weight based on analysis mode
if config.analysis_mode == 'usage_patterns':
# Emphasize semantic similarity
weighted = (
config.feature_weights['semantic'] * semantic_vec +
config.feature_weights['behavioral'] * behavioral_vec +
0.1 * attribute_vec # Inferred weight for attributes
)
elif config.analysis_mode == 'failure_modes':
# Emphasize behavioral signals
weighted = (
config.feature_weights['semantic'] * semantic_vec +
config.feature_weights['behavioral'] * behavioral_vec +
0.4 * attribute_vec # Higher for failure filters
)
vectors.append(weighted)
# Build similarity graph
similarity_matrix = cosine_similarity(vectors)
# Apply hierarchical clustering for taxonomy
linkage = hierarchical_cluster(similarity_matrix, method='ward')
# Cut tree at two levels for hierarchy
high_level_labels = cut_tree(linkage, n_clusters=5-8) if not config.predefined_categories else assign_to_predefined(similarity_matrix, config.categories)
subclusters = {}
for cluster_id in unique(high_level_labels):
cluster_members = vectors[high_level_labels == cluster_id]
sub_linkage = hierarchical_cluster(cluster_members)
subclusters[cluster_id] = cut_tree(sub_linkage, n_clusters=3-5)
return {
'high_level': high_level_labels,
'subclusters': subclusters,
'similarity_matrix': similarity_matrix
}Why Hierarchical Taxonomy Works:
From the X post: "Each subcategory represents a distinct agent system and context engineering pattern. RAG retrieval, for example, has different failure modes than web scraping, even though both are 'information sourcing.'"
The two-level hierarchy captures:
- High-level operational modes (e.g., "Information Sourcing") – auto or predefined
- Specific implementation patterns (e.g., "RAG Retrieval" vs "Web Scraping") – always auto
Category Generation:
def label_clusters(clusters, threads, config):
"""
Use LLM to generate human-readable labels and taxonomy
"""
taxonomy = {}
for cluster_id, thread_indices in clusters['high_level'].items():
# Sample representative threads
representatives = sample_representatives(
threads[thread_indices],
n=5-10
)
prompt = f"""
You are analyzing a cluster of similar agent conversations.
Domain: {config.domain_context}
Analysis Goal: {config.clustering_focus}
Here are representative conversations from this cluster:
{format_threads(representatives, config.message_view)}
Generate:
1. A concise category name (2-4 words)
2. A description explaining what these conversations have in common
3. The category type (usage_pattern, error_mode, or user_intent)
Respond in JSON format.
"""
category = llm.complete(prompt)
# Now label subclusters
subclusters = clusters['subclusters'][cluster_id]
subcategories = []
for subcluster_id in subclusters:
sub_representatives = sample_representatives(
threads[subcluster_id],
n=3-5
)
sub_prompt = f"""
This is a subset of conversations within the "{category['name']}" category.
Conversations:
{format_threads(sub_representatives)}
Generate a more specific subcategory name and description that
distinguishes this group from other conversations in the parent category.
"""
subcategory = llm.complete(sub_prompt)
subcategories.append(subcategory)
taxonomy[cluster_id] = {
'category': category,
'subcategories': subcategories,
'member_count': len(thread_indices)
}
return taxonomyExample Output:
{
"cluster_0": {
"category": {
"name": "Product Orientation",
"description": "Users seeking to understand LangChain product ecosystem and how different offerings relate to each other",
"type": "usage_pattern"
},
"subcategories": [
{
"name": "LangChain vs LangGraph Clarification",
"description": "Questions specifically about differences between LangChain and LangGraph libraries"
},
{
"name": "Platform Feature Discovery",
"description": "Inquiries about LangSmith platform capabilities and integrations"
},
{
"name": "General Product Capabilities",
"description": "Broad questions about what can be built with LangChain tools"
}
],
"member_count": 342
}
}Per-Category Metrics:
def aggregate_metrics(taxonomy, threads, config):
"""
Calculate metrics for each category and subcategory
"""
total_threads = len(threads)
for cluster_id, cluster_data in taxonomy.items():
member_threads = threads[cluster_data['member_indices']]
# Frequency metrics
cluster_data['metrics'] = {
'count': len(member_threads),
'relative_frequency': len(member_threads) / total_threads,
# Performance metrics
'avg_latency': mean([t['total_latency'] for t in member_threads]),
'p95_latency': percentile([t['total_latency'] for t in member_threads], 95),
'avg_tokens': mean([t['token_usage'] for t in member_threads]),
'avg_cost': mean([t['cost'] for t in member_threads]),
# Quality metrics
'error_rate': sum(bool(t['error']) for t in member_threads) / len(member_threads),
'avg_feedback': mean([t['feedback_score'] for t in member_threads if t.get('feedback_score')]),
'positive_feedback_rate': sum(t['feedback_score'] > 0 for t in member_threads) / len(member_threads),
# Eval scores (if multi-turn evals configured)
'avg_sentiment': mean([t['sentiment_score'] for t in member_threads if 'sentiment_score' in t]),
'goal_completion_rate': mean([t['goal_completed'] for t in member_threads if 'goal_completed' in t]),
# Behavioral metrics
'avg_turns': mean([t['turn_count'] for t in member_threads]),
'avg_tool_calls': mean([t['tool_chain_length'] for t in member_threads]),
# Attribute aggregations
'avg_user_satisfied': mean([t['attributes']['user_satisfied'] for t in member_threads if 'user_satisfied' in t['attributes']])
}
# Repeat for subcategories
for subcat in cluster_data['subcategories']:
subcat_threads = threads[subcat['member_indices']]
subcat['metrics'] = calculate_same_metrics(subcat_threads)
return taxonomyReport Structure:
class InsightsReport:
def __init__(self, taxonomy, config, timestamp):
self.taxonomy = taxonomy
self.config = config
self.timestamp = timestamp
self.report_id = generate_id()
def render_ui(self):
"""Generate interactive UI"""
return {
'header': {
'title': f"Insights Report: {self.config.clustering_focus}",
'subtitle': self.config.domain_context,
'generated_at': self.timestamp,
'sample_size': self.config.sample_strategy['target_size']
},
'categories': [
self.render_category(cat_id, cat_data)
for cat_id, cat_data in self.taxonomy.items()
],
'actions': {
'export_to_dataset': True,
'add_to_annotation_queue': True,
'configure_eval': True,
'create_dashboard': True
}
}
def render_category(self, cat_id, cat_data):
"""Render single category with drill-down"""
return {
'id': cat_id,
'name': cat_data['category']['name'],
'description': cat_data['category']['description'],
'frequency': f"{cat_data['metrics']['relative_frequency']*100:.1f}%",
'metrics_summary': {
'avg_latency': f"{cat_data['metrics']['avg_latency']:.2f}s",
'feedback': f"{cat_data['metrics']['avg_feedback']:.2f}/5",
'completion_rate': f"{cat_data['metrics']['goal_completion_rate']*100:.0f}%",
'error_rate': f"{cat_data['metrics']['error_rate']*100:.0f}%"
},
'subcategories': [
self.render_subcategory(sub)
for sub in cat_data['subcategories']
],
'trace_link': f"/traces?report_id={self.report_id}&category={cat_id}"
}
def save(self):
"""Cache report for future access"""
cache_key = f"insights_report:{self.report_id}"
cache.set(cache_key, self.to_dict(), ttl=30*24*3600) # 30 days
# Save configuration for reruns
config_key = f"insights_config:{self.config.project_id}:latest"
cache.set(config_key, self.config.to_dict())interface InsightsConfig {
// Core configuration
domain_context: string;
analysis_mode: 'usage_patterns' | 'failure_modes' | 'custom';
clustering_focus: string;
// Feature extraction
extract_features: string[];
message_view: 'all_messages' | 'human_ai_pairs' | 'first_last';
attributes: {name: string, type: 'categorical' | 'numerical' | 'boolean', filter_by?: boolean}[];
summary_prompt: string; // Mustache template e.g. "Summarize: {{run.inputs}} {{run.outputs}}"
// Clustering parameters
feature_weights: {
semantic: number; // 0-1, must sum to 1.0 with behavioral
behavioral: number; // 0-1
};
taxonomy_depth: 2 | 3; // Number of hierarchy levels
categories?: {name: string, description: string}[]; // Predefined top-level
// Sampling
sample_strategy: {
target_size: number; // <=1000
stratify_by: string[]; // ['thread_length', 'recency', 'tool_usage']
oversample_long_threads: boolean;
seed?: number; // For reproducibility
};
// Filters
filters: {
time_range?: [Date, Date];
min_turns?: number;
max_turns?: number;
keywords?: string[];
metadata_filters?: Record<string, any>;
};
// Prefilters (for failure mode)
prefilter?: {
negative_feedback_only?: boolean;
min_retry_rate?: number;
error_signals_required?: boolean;
};
// Metrics
metrics_to_include: string[]; // ['frequency', 'feedback', 'latency', 'cost', 'error_rate', ...]
// Metadata
project_id: string;
created_by: string;
timestamp: Date;
model_provider: 'openai' | 'anthropic';
}A. Configuration Translation Prompt
System: You are an expert at translating natural language into structured
data analysis configurations for agent trace analysis.
User Domain: {agent_description}
Analysis Question: {what_to_learn}
Trace Structure: {structure_info}
Task: Generate a complete InsightsConfig JSON object that will guide the
analysis pipeline to answer the user's question.
Guidelines:
1. If the question asks "what are users doing/asking", use analysis_mode: "usage_patterns"
2. If the question asks "where are failures/mistakes/problems", use analysis_mode: "failure_modes"
3. Set feature_weights to favor semantic (0.6-0.8) for usage, behavioral (0.6-0.8) for failures
4. Extract 3-5 specific features relevant to the question
5. Suggest 1-2 attributes (e.g., boolean for filtering)
6. Default summary_prompt to include {{run.inputs}} and {{run.outputs}}
7. Default to target_size: 1000 for sampling
8. Include standard metrics: frequency, latency, feedback, token_usage, cost, error_rate
Output valid JSON matching the InsightsConfig schema.
B. Summary Prompt (User-Editable Example)
Summarize this conversation thread focusing on key user intents and agent behaviors:
{{run.inputs}} // Inputs from last run
{{run.outputs}} // Outputs from last run
{{run.error}} // Error if any
Extract:
- Main topic
- Tools used
- Outcome
Ignore irrelevant metadata.
C. Cluster Labeling Prompt
System: You are an expert at analyzing and categorizing agent conversation patterns.
Context:
- Agent Purpose: {domain_context}
- Analysis Focus: {clustering_focus}
- Category Type: {usage_pattern | failure_mode}
Representative Conversations:
{formatted_thread_samples}
Task: Analyze these conversations and generate:
1. category_name: A concise, descriptive name (2-4 words)
- Use domain-appropriate terminology
- Make it specific and actionable
- Examples: "Product Orientation", "RAG Retrieval Errors", "Tool Selection Issues"
2. description: 2-3 sentences explaining:
- What these conversations have in common
- What users are trying to accomplish (or what's going wrong)
- Why this category is distinct from others
3. category_type: One of:
- "usage_pattern": Normal user behavior or intent
- "error_mode": Something going wrong
- "user_intent": Specific goal or question type
Output format:
{
"category_name": "...",
"description": "...",
"category_type": "..."
}
D. Failure Signal Extraction Prompt (for Attributes)
System: You are analyzing an agent conversation to detect failure signals.
Conversation Thread:
{formatted_thread}
Task: Identify if this conversation contains signs of poor agent performance:
1. User Frustration Signals:
- Repeated questions
- Negative language
- Explicit complaints
2. Agent Errors:
- Tool call failures
- Inconsistent responses
- Hallucinations or inaccuracies
3. Conversation Failures:
- Unresolved user requests
- Premature conversation end
- Circular reasoning loops
For each signal found, provide:
- signal_type: The category of failure
- evidence: Specific text or behavior demonstrating it
- severity: low | medium | high
Output JSON array of detected signals.
From the X post and docs, users can define custom attributes for specialized clustering:
class CustomAttribute:
"""User-defined feature for clustering"""
def __init__(self, name, type_, filter_by=False, prompt_instructions):
self.name = name
self.type_ = type_ # 'categorical' | 'numerical' | 'boolean'
self.filter_by = filter_by
self.prompt_instructions = prompt_instructions # Instructions for LLM extraction
def extract(self, thread):
# Integrate into summary prompt
extended_prompt = f"{base_summary_prompt}\nExtract {self.name} ({self.type_}): {self.prompt_instructions}"
response = llm.complete(extended_prompt)
value = parse_response(response, self.name)
return value
# Example custom attributes
custom_attributes = [
CustomAttribute(
name='context_switches',
type_='numerical',
prompt_instructions='Count the number of topic changes in the conversation.'
),
CustomAttribute(
name='tool_chain_length',
type_='numerical',
prompt_instructions='Count the number of tool calls in sequence.'
),
CustomAttribute(
name='retry_rate',
type_='numerical',
prompt_instructions='Calculate how often the agent needs multiple attempts.'
),
CustomAttribute(
name='has_error',
type_='boolean',
filter_by=True,
prompt_instructions='True if the conversation contains any errors or failures.'
)
]From Analysis #2 (Engineering Perspective):
-
Sampler optimizations:
- Cap by time (recent traces more relevant)
- Oversample longer threads (richer signal)
- Keep reproducible seeds for debugging
- Stratify to avoid bias
- Max 1,000 to control cost/time
-
Dual pathway weighting:
if analysis_mode == 'topics': weights = {'semantic': 0.75, 'behavioral': 0.25} elif analysis_mode == 'failures': weights = {'semantic': 0.25, 'behavioral': 0.75}
-
Reporting enhancements:
- Show % share of each category
- Trend vs last period (if historical data)
- Direct links to "Create Eval" and "Queue for Annotation"
- Export configuration for reproducibility
- Distribution bars for frequencies
-
Caching strategy:
- Cache report for 30 days
- Store last config per project
- Incremental recompute for new traces (future feature)
Statistical Validity:
- Central Limit Theorem: 500-1,000 samples sufficient for pattern discovery
- Stratified sampling ensures representation across key dimensions
- Oversampling long threads counters bias toward simple queries
Practical Benefits:
- Handles millions of traces without infrastructure strain
- Consistent runtime regardless of dataset size
- Cost-effective (fewer LLM calls; $1-4 per report)
From Analysis #3: The key innovation is treating this as a conversational interface to data analysis.
Traditional BI:
- User must know SQL/query language
- Requires upfront analytical expertise
- Fixed dashboards and reports
Insights Agent:
- Natural language questions
- System translates to analytical pipeline (auto config)
- Dynamic, question-specific insights
This makes sophisticated trace analysis accessible to non-technical users (PMs, designers, support teams).
From X post and docs: "Each subcategory represents a distinct agent system and context engineering pattern."
The two-level hierarchy reveals:
- Operational modes your agent has (often unknown to developers) – auto or predefined
- Specific implementation patterns within each mode – auto-generated
- Distinct failure modes per pattern
Example from sources:
- Category: Information Sourcing (operational mode)
- Subcategory 1: RAG Retrieval (pattern)
- Failure mode: Irrelevant chunks retrieved
- Subcategory 2: Web Scraping (pattern)
- Failure mode: Timeout errors
- Subcategory 1: RAG Retrieval (pattern)
This lets you "fix categories of failures, not just individual bugs."
Complementary Relationship:
| Feature | Insights Agent | Multi-turn Evals |
|---|---|---|
| Scope | Patterns across many threads | Individual thread quality |
| Question | "What's happening?" | "How well did this go?" |
| Output | Categories & taxonomy | Scores per thread (e.g., intent, outcome, trajectory) |
| Use Case | Discovery & prioritization | Quality monitoring |
| Timing | On-demand (up to 30 min) | Automatic per thread (after idle time) |
Workflow:
- Run Insights Agent → Discover "Retrieval Questions" category has low satisfaction
- Configure Multi-turn Eval → Score all "Retrieval Questions" threads for relevance (using high-context model, filtered messages)
- Filter for low-scoring threads → Add to annotation queue
- Manually review → Identify specific improvement (e.g., better chunk selection)
- Deploy fix → Monitor with Multi-turn Eval scores
Traditional clustering labels:
- "Cluster 0", "Cluster 1", "Cluster 2"
- Requires human interpretation
- No semantic meaning
LLM-generated labels:
- "Product Orientation", "Agent Orchestration", "Retrieval"
- Immediately actionable
- Semantically coherent (understands domain context)
From X post: "The categorization is surprisingly good. Clusters are semantically coherent."
This is because the LLM understands:
- The domain (from user's agent description)
- The traces (full conversation context via summary prompt)
- The analytical goal (from user's question)
- Additional attributes (extracted during summarization)
It generates labels that make sense to humans working in that domain.
Embedding & Clustering:
- Which embedding model? (Likely OpenAI text-embedding-3 or similar)
- Exact clustering algorithm? (Probably HDBSCAN or Leiden on KNN graph)
- How are embeddings combined with behavioral features and attributes?
Heuristics:
- Specific patterns for detecting user frustration?
- How are "subpar responses" identified?
- Tool loop detection algorithm?
Infrastructure:
- Caching strategy for incremental updates?
- How much of the pipeline is cached/recomputed?
- Rate limiting for LLM calls?
- Exact cost calculation formula?
Mentioned but not detailed:
- Thread-level metrics and dashboards (coming soon per blog)
- Automations to add threads to annotation queues (mentioned, not shown)
- SDK support for programmatic access (roadmap item)
- Trend analysis across multiple reports (% change over time)
Potential future features:
- Multi-turn conversation to refine reports ("dig deeper into Retrieval category")
- Automatic eval generation (AI suggests multi-turn evals based on insights)
- Anomaly detection (flag new, unusual patterns automatically)
- Cross-project comparison (how does Agent A compare to Agent B?)
For teams reverse-engineering or building similar systems:
- Optimal sampling ratios: How does sample size affect pattern discovery accuracy?
- Feature weight tuning: Can weights be learned rather than heuristic?
- Dynamic taxonomy depth: When should it be 2 vs 3 levels?
- Cold start problem: How does it perform on new projects with few traces?
- Prompt stability: How sensitive are results to small changes in user questions?
Ideal Scenarios:
- ✅ Agent has been in production for days/weeks (sufficient data)
- ✅ You want to discover unknown patterns (not test specific hypotheses)
- ✅ You need to prioritize what to improve next
- ✅ Manual trace review is overwhelming (100+ traces/day)
- ✅ You want to understand user behavior or common failures
Not Ideal For:
- ❌ Pre-production testing (use offline evals instead)
- ❌ Real-time monitoring (too slow, use dashboards + multi-turn evals)
- ❌ Debugging specific known issues (use trace search)
- ❌ Very low-traffic agents (<50 traces) (insufficient data)
Configuration:
- Start with broad questions, narrow down in subsequent runs
- First run: "What are users asking?" → Discover top categories
- Second run: "Where is [specific category] failing?" → Debug specific patterns (use filter attributes)
- Use filters to focus on recent data (last 7-14 days typically most relevant)
- Define attributes for deeper splits (e.g., boolean for satisfaction)
- Test summary prompt with mustache variables to reduce noise/cost
Iteration:
- Run Insights Agent → Identify top 3 categories by frequency or failure rate
- Drill into each → Review representative traces
- Add problematic traces to annotation queue
- Configure Multi-turn Evals for ongoing monitoring (set idle time, high-context model)
- Implement fixes → Re-run Insights Agent to validate improvement
Team Workflows:
- PMs: Weekly insights runs to track usage trends
- Engineers: On-demand runs after deployments to check for regressions
- Support: Filter for negative sentiment categories, add to training data
Discovery Questions:
- "What questions are users asking?"
- "What are the most common topics?"
- "How are users trying to use this agent?"
Quality Questions:
- "Where is my agent making mistakes?"
- "What causes user frustration?"
- "Where do conversations fail to resolve?"
Performance Questions:
- "Which interactions take longest?"
- "Where does the agent use the most tools?"
- "What causes retry loops?"
Product Questions:
- "What features are users asking about most?"
- "What documentation gaps exist?"
- "What unexpected use cases appear?"
From the ChatLangChain example:
Before Insights Agent:
- Manual review: 4 hours/week reviewing 500 traces
- Limited to obvious patterns
- Reactive to user complaints
After Insights Agent:
- 20-30 minutes to generate report
- Discovered: 35% of questions are product clarification
- Action: Added comparison docs → 22% reduction in basic questions
- Time saved: 3.5 hours/week + improved user experience
Value Equation:
ROI = (Time Saved + Quality Improvements) - (Report time + $1-4 cost + implementation time)
For most production agents with >100 traces/day, ROI is positive within first week.
LangSmith's Insights Agent represents a paradigm shift in production agent monitoring:
Old Paradigm: Passive observability → Manual analysis → Reactive fixes
New Paradigm: Question-driven discovery → AI-powered categorization → Proactive improvements
The key insight: By making the entire analytical pipeline LLM-orchestrated and question-aware, it becomes accessible to non-experts while remaining sophisticated enough for deep analysis.
For Individual Teams:
- Faster iteration cycles (discover → fix → validate in days, not weeks)
- Strategic prioritization (fix categories affecting most users)
- Democratized insights (non-technical team members can analyze)
For the Industry:
- Shows how LLMs can power analytical tools, not just generate content
- Demonstrates value of conversational interfaces to complex systems
- Proves sampling + smart clustering can handle massive scale
This architecture could extend to:
- Customer support ticket analysis
- Bug report categorization
- User feedback synthesis
- Code review pattern detection
- Security incident clustering
The pattern: Conversational interface → LLM-guided pipeline → Human-readable insights is broadly applicable.
This document synthesizes:
-
LangChain Blog Post (Oct 23, 2025)
- Strategic positioning
- Feature announcement
- Integration with Multi-turn Evals
- Grouping by poor interactions (new detail from changelog tie-in)
-
"Get Started with Multi-turn Evals" Video Transcript
- Thread structure requirements
- Evaluation configuration
- Measurement categories (intent, outcomes, trajectory)
-
"Get Started with Insights Agent" Video Transcript
- UI workflow
- Configuration questions
- ChatLangChain example
- Hierarchical drill-down
-
X Post (User Experience)
- Real-world validation (500 production traces)
- Custom attributes examples
- Quality assessment
- Strategic value
-
Official Docs (Insights Agent)
- Detailed config options (summary prompt, attributes, filters, model providers, cost)
- Sampling cap (1,000), runtime (up to 30 min)
- Predefined categories, filter_by booleans
-
Official Docs (Online Evaluations/Multi-turn Evals)
- Idle time, model selection, message filtering
- Limits, troubleshooting
- No direct Insights integration mentioned
-
Changelog Announcement
- Grouping by usage/poor interactions
- Video tutorial link
-
Three Analytical Perspectives:
- Conceptual (paradigm shift, strategic value)
- Engineering (implementation heuristics, code patterns)
- Documentation (systematic integration, user workflow)