AI systems built with Large Language Models (LLMs) present unique challenges that traditional observability tools weren't designed to handle:
- Non-deterministic behavior - The same input can produce different outputs
- Complex reasoning chains - Multi-step processes with branching decision paths
- Unpredictable execution - Agents may take different approaches each time
- Tool usage patterns - Interactions with external systems that impact results
- Agent collaboration - Sub-agents working together with complex delegation
Without proper observability, debugging becomes nearly impossible:
User: "Why did my agent give the wrong answer?"
Developer without observability: "Let me dig through 500 pages of LLM output..."
Developer with observability: "I can see it used the wrong tool here, then misinterpreted the result."
We'll build this in stages, with value at each step:
- Quick Win: Basic collector setup with TraceZ visualization
- Level Up: Custom configuration for better debugging
- Pro Level: Advanced visualization with Jaeger
Let's get started!
Run this single command to get a collector up and running:
docker run \
-p 127.0.0.1:4317:4317 \
-p 127.0.0.1:4318:4318 \
-p 127.0.0.1:55679:55679 \
otel/opentelemetry-collector-contrib:0.121.0
This starts a collector that:
- Listens for gRPC data on port 4317
- Listens for HTTP data on port 4318
- Provides TraceZ visualization on port 55679
pip install 'smolagents[telemetry]' opentelemetry-sdk opentelemetry-exporter-otlp openinference-instrumentation-smolagents
import os
# Set your Hugging Face API token
os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN")
# Configure environment variables for OpenTelemetry Endpoint
OTEL_COLLECTOR_HOST='localhost'
OTEL_COLLECTOR_PORT_GRPC=4317
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = f"http://{OTEL_COLLECTOR_HOST}:{OTEL_COLLECTOR_PORT_GRPC}"
# Other environment variables remain the same
os.environ["OTEL_RESOURCE_ATTRIBUTES"] = "service.namespace=smolagents-demo,service.name=smolagent"
os.environ["OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE"] = "cumulative"
# Import OpenTelemetry modules
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.smolagents import SmolagentsInstrumentor
# Configure OpenTelemetry
trace_provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(insecure=True))
trace_provider.add_span_processor(processor)
# Instrument SmolAgents
SmolagentsInstrumentor().instrument(tracer_provider=trace_provider)
Run your SmolAgents application:
from smolagents import CodeAgent, HfApiModel
model = HfApiModel()
agent = CodeAgent(tools=[], model=model, add_base_tools=True)
agent.run(
"Could you give me the 118th number in the Fibonacci sequence?",
)
Open your browser and go to: http://localhost:55679/debug/tracez
You'll see your agent runs visualized! Click on any trace to see high level trace information, but the details of what happened are captured in the console.
Congratulations! You now have basic observability for your AI system.
Now that you have basic tracing working, let's improve our setup with a custom configuration.
Create a file named base-config.yaml
with the following content:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
exporters:
debug:
verbosity: detailed
service:
pipelines:
logs:
receivers: [otlp]
processors: [batch]
exporters: [debug]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [debug]
traces:
receivers: [otlp]
processors: [batch]
exporters: [debug]
docker run -v $(pwd)/base-config.yaml:/etc/otelcol-contrib/config.yaml \
-p 127.0.0.1:4317:4317 \
-p 127.0.0.1:4318:4318 \
-p 127.0.0.1:55679:55679 \
otel/opentelemetry-collector-contrib:0.121.0
Now you'll have a better understanding of and more control over the implementation of tracing, metrics, and logs.
NOTE: be aware, with this custom configuration you no longer have access to TraceZ and will need to use the console logs.
Let's take your observability to the next level by adding Jaeger, a powerful distributed tracing platform.
Create a new file named jaeger-config.yaml
:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
exporters:
jaeger:
endpoint: "http://localhost:16686/api/traces"
tls:
insecure: true
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger, debug]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [debug]
logs:
receivers: [otlp]
processors: [batch]
exporters: [debug]
docker run -v $(pwd)/jaeger-config.yaml:/etc/otelcol-contrib/config.yaml \
-d --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
Open your browser and go to: http://localhost:16686
Select your service and click "Find Traces" to see detailed visualizations of your agent runs.
With Jaeger, you get:
- Timeline views of all trace operations
- Detailed flame graphs showing nested operations
- Powerful filtering and search capabilities
- The ability to compare different traces side by side
Once you have the basics working, you can explore these advanced techniques:
- track business metrics
- correlate with application logs
- create span events for key moments
- Start simple - Begin with basic tracing and add complexity as needed
- Focus on reasoning paths - Make agent thinking visible to understand failures
- Capture context - Include relevant attributes that help interpret the data
- Use attributes wisely - Add key-value pairs that help filter and analyze
- Correlate with user feedback - Link user satisfaction metrics to traces
- Establish baselines - Know what "good" looks like for your AI system
Stage | Implementation | Business Value |
---|---|---|
Stage 1 | Basic collector with TraceZ | - Basic debugging - Visibility into agent steps - Quick setup for dev environments |
Stage 2 | Custom configuration | - Enhanced debugging - More detailed insights - Better root cause analysis |
Stage 3 | Jaeger integration | - Professional-grade visualization - Advanced filtering and search - Comparing different runs - Production-ready observability |
You've now seen how to implement OpenTelemetry for your AI systems in progressive stages, with each stage delivering real value. Start with the quick win, then level up as your needs grow.
Remember, in the world of AI systems, you can't improve what you can't observe. With OpenTelemetry, you've given yourself the visibility needed to build reliable, efficient, and trustworthy AI applications.