Your agent isn’t slow. Your telemetry is a hairball. Most teams bolt “analytics” onto agents like fairy lights. Cute in staging, smoke in production.
Here’s the boring truth that works: instrument the seams, not the guts. Put hooks at the agent step, the MCP client, and the MCP server adapters. Keep tools clean. No vendor SDKs sprinkled through business code.
My AI research agent pulled real patterns from MCP stacks, and the signal is clear: teams that trace the boundaries fix issues in minutes. The rest drown in logs and guesswork.
The blueprint:
- One tiny TelemetryPort. Two adapters behind it: OpenTelemetry for observability, your analytics sink for product events. Separate pipes, shared IDs.
- Interceptors do the work. Client side wraps every JSON-RPC call as a span. Server adapters wrap tools/resources/prompts. Agent host wraps each step.
- Name things like you mean it: spans as agent.step, mcp.rpc, mcp.tools.call. Events as agent_step_completed, tool_call_completed, outcome_emitted. Correlate with tr