A guide to implementing per-step tool filtering using ensemble scoring. This approach solves the problem of LLMs having too many tools to choose from effectively.
When you give an LLM agent 40+ tools, several issues emerge:
- Context bloat: Tool definitions consume tokens, leaving less room for actual conversation
- Selection confusion: LLMs struggle to pick the right tool when faced with many similar options
- Hallucination risk: More tools = more chances to pick the wrong one or hallucinate parameters
The solution: dynamically select 8-12 relevant tools per step based on conversation context.
┌─────────────────────────────────────────────────────────────┐
│ AI SDK Agent │
│ │
│ ┌─────────────┐ prepareStep() ┌─────────────────┐ │
│ │ Messages │ ──────────────────► │ ToolSelector │ │
│ │ + Steps │ │ │ │
│ └─────────────┘ │ ┌───────────┐ │ │
│ │ │ToolScorer │ │ │
│ │ │ │ │ │
│ ┌─────────────┐ activeTools[] │ │ BM25F │ │ │
│ │ LLM Call │ ◄────────────────── │ │ Focus │ │ │
│ │ (filtered) │ │ │ Transition│ │ │
│ └─────────────┘ │ │ Anchors │ │ │
│ │ └───────────┘ │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
The key is the prepareStep callback in the AI SDK's ToolLoopAgent:
import { ToolLoopAgent } from 'ai';
import { ToolSelector } from './search';
const selector = new ToolSelector();
const toolKeys = Object.keys(tools);
const agent = new ToolLoopAgent({
model,
instructions,
tools,
prepareStep: ({ messages, steps, stepNumber }) => {
// Select tools based on current context
const activeTools = selector.selectActiveTools(messages, steps);
return { activeTools };
}
});Pure BM25 text matching isn't enough. "Add a market role" might match roleUpdate higher than roleAdd because "role" appears more times in update's description.
The solution is ensemble scoring - combining multiple signals:
score = (0.40 × BM25F)
+ (0.15 × focus)
+ (0.15 × transition)
+ (0.10 × sticky)
+ microAnchor // Additive, not weighted
- (0.20 × avoidPenalty)
BM25 with field weights. Different parts of tool metadata matter more:
const FIELD_WEIGHTS = {
name: 3.0, // Tool name is highly relevant
title: 2.5, // Human-readable title
keywords: 3.0, // Explicit keywords we define
examples: 2.0, // Example phrases
description: 1.0, // General description
category: 0.5, // Less important
avoidWhen: 0.3, // Low weight (mainly for penalty)
};Key implementation details:
- Split camelCase tool names:
roleAdd→role add - Compute per-field IDF (inverse document frequency)
- Tokenize and score each field separately, then combine with weights
Track what entity type the agent is currently working with:
const TOOL_ENTITY_TYPE = {
groupAdd: 'group',
groupUpdate: 'group',
roleAdd: 'role',
roleUpdate: 'role',
activityAdd: 'activity',
queryData: null, // Doesn't change focus
navigateTo: null,
};After each tool call, update the focus. Then boost tools matching current focus:
getFocusBoost(toolName) {
const toolEntityType = TOOL_ENTITY_TYPE[toolName];
if (toolEntityType === this.currentFocus) {
return 1.0; // Exact match
}
// Related types get partial boost
// (group → role is common flow)
const RELATED_BOOSTS = {
group: { role: 0.6, activity: 0.3 },
role: { group: 0.5, activity: 0.7 },
activity: { role: 0.6, group: 0.3 },
};
return RELATED_BOOSTS[this.currentFocus]?.[toolEntityType] || 0.2;
}Hand-seed probabilities for P(nextTool | lastTool):
const TRANSITIONS = {
groupAdd: {
roleAdd: 0.8, // Very likely after creating a group
groupUpdate: 0.5,
groupAssignLead: 0.6,
activityAdd: 0.3,
},
roleAdd: {
activityAdd: 0.8, // Very likely after creating a role
roleAdd: 0.6, // Might add more roles
roleUpdate: 0.4,
},
// ... etc
};This captures natural workflows without being rigid.
Boost tools that were recently used, with decay:
#getStickyBoost(toolName) {
const index = this.recentTools.indexOf(toolName);
if (index === -1) return 0;
// Decay by recency: most recent = 1.0, then 0.7, 0.4, 0.2, 0.1
const decays = [1.0, 0.7, 0.4, 0.2, 0.1];
return decays[index] || 0;
}This is the most important signal for preventing catastrophic misses. When the query clearly indicates intent, apply a hard additive boost:
const ANCHOR_PATTERNS = [
{
pattern: /\b(add|create|new|make)\b.*\b(role|position|job)\b/i,
tools: ['roleAdd'],
boost: 2.0,
},
{
pattern: /\b(role|position|job)\b.*\b(add|create|new)\b/i,
tools: ['roleAdd'],
boost: 2.0, // Handles reversed word order
},
{
pattern: /\b(move|transfer|relocate)\b.*\b(role|position)\b/i,
tools: ['roleMove'],
boost: 1.8,
},
// ... more patterns
];Micro-anchors are additive, not weighted because they represent high-confidence intent detection that should override other signals.
Tools can define when they shouldn't be used:
// In tool config:
{
name: 'roleAdd',
avoidWhen: 'not for updating existing roles, use roleUpdate',
}If the query contains terms from avoidWhen (like "update"), apply a penalty.
Each tool needs rich metadata for the scoring to work:
export const toolsConfig = {
roleAdd: {
title: 'Add Roles',
description: 'Create new roles with associated activities',
category: 'Roles',
examples: [
'create a new position',
'add a manager role',
'create engineering team structure',
],
keywords: ['create', 'new', 'role', 'position', 'job', 'hire', 'add'],
avoidWhen: 'not for updating existing roles, use roleUpdate',
tool: roleAddTool, // The actual AI SDK tool
},
// ... more tools
};| Field | Purpose | Weight |
|---|---|---|
name |
Tool key (camelCase split) | 3.0 |
title |
Human-readable name | 2.5 |
keywords |
Explicit trigger words | 3.0 |
examples |
Example user phrases | 2.0 |
description |
What the tool does | 1.0 |
category |
Grouping (Roles, Groups, etc.) | 0.5 |
avoidWhen |
When NOT to use this tool | 0.3 |
Reserve one slot for exploration - randomly sample a tool from ranks 9-20:
selectTools(stepQuery, limit = 9, withExploration = true) {
// ... score and sort tools
const topCount = withExploration ? limit - 1 : limit;
const topTools = sorted.slice(0, topCount);
if (withExploration) {
const explorationPool = sorted.slice(topCount, 20);
const explorationTool = this.#sampleFromPool(explorationPool);
topTools.push(explorationTool);
}
return [...CORE_TOOLS, ...topTools];
}This prevents the system from getting stuck in local optima.
Some tools should always be available:
const CORE_TOOLS = ['queryData']; // Always need to query dataThese are prepended to the selection regardless of scoring.
Add comprehensive logging:
getDebugInfo(stepQuery) {
return {
focus: this.focusTracker.getSummary(),
lastTool: this.transitionMatrix.getLastTool(),
recentTools: this.recentTools,
topScores: sorted.slice(0, 15),
anchorBoosts: Object.fromEntries(this.microAnchors.getBoosts(stepQuery)),
};
}Log on every step:
console.log('[ToolSelector] stepQuery:', stepQuery.substring(0, 150));
console.log('[ToolSelector] activeTools:', activeTools);
console.log('[ToolSelector] top 5 scores:', debug.topScores.slice(0, 5));Check your micro-anchors first. If "add a role" isn't selecting roleAdd, add a pattern:
{
pattern: /\b(add|create)\b.*\brole\b/i,
tools: ['roleAdd'],
boost: 2.0,
}Verify tool key naming matches between:
- Your tools object keys
- The
activeToolsarray returned - Your tool config keys
Use the AI SDK debugger to see what's actually being attached.
BM25 alone will fail on cases like "add a market role" because:
- "role" appears more in roleUpdate's description
- "market" might match something unrelated
That's why ensemble scoring exists. Don't try to fix BM25 alone - rely on micro-anchors for clear intent.
search/
├── index.js # Exports
├── toolScorer.js # Main ensemble orchestration
├── bm25f.js # BM25 with field weights
├── focusTracker.js # Entity focus tracking
├── transitionMatrix.js # Tool-to-tool probabilities
├── microAnchors.js # Pattern-based hard boosts
└── ToolSelector.js # Integration layer (builds query from messages)
- Don't rely on BM25 alone - it will fail on ambiguous queries
- Micro-anchors are essential - they catch obvious intent that BM25 misses
- Validate tool keys - mismatches are silent failures
- Log everything - you need visibility into what's being selected
- Use exploration - prevents getting stuck
- Focus tracking helps continuity - boosts related tools in a workflow
- Transition matrix captures workflows - "after groupAdd, roleAdd is likely"