OpenTelemetry for AI Systems: A Practical Guide

Raw

OpenTelemetry for AI Systems: A Practical Guide

Why Your AI Systems Need Observability

AI systems built with Large Language Models (LLMs) present unique challenges that traditional observability tools weren't designed to handle:

Non-deterministic behavior - The same input can produce different outputs
Complex reasoning chains - Multi-step processes with branching decision paths
Unpredictable execution - Agents may take different approaches each time
Tool usage patterns - Interactions with external systems that impact results
Agent collaboration - Sub-agents working together with complex delegation

Without proper observability, debugging becomes nearly impossible:

User: "Why did my agent give the wrong answer?"
Developer without observability: "Let me dig through 500 pages of LLM output..."
Developer with observability: "I can see it used the wrong tool here, then misinterpreted the result."

The Journey: From Zero to Hero with OpenTelemetry

We'll build this in stages, with value at each step:

Quick Win: Basic collector setup with TraceZ visualization
Level Up: Custom configuration for better debugging
Pro Level: Advanced visualization with Jaeger

Let's get started!

Stage 1: Quick Win - Basic Setup with TraceZ

Step 1: Start the OpenTelemetry Collector

Run this single command to get a collector up and running:

docker run \
  -p 127.0.0.1:4317:4317 \
  -p 127.0.0.1:4318:4318 \
  -p 127.0.0.1:55679:55679 \
  otel/opentelemetry-collector-contrib:0.121.0

This starts a collector that:

Listens for gRPC data on port 4317
Listens for HTTP data on port 4318
Provides TraceZ visualization on port 55679

Step 2: Instrument Your SmolAgents Application

Install Python Libraries

pip install 'smolagents[telemetry]' opentelemetry-sdk opentelemetry-exporter-otlp openinference-instrumentation-smolagents

Setup Environment Variables

import os

# Set your Hugging Face API token
os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN")

# Configure environment variables for OpenTelemetry Endpoint
OTEL_COLLECTOR_HOST='localhost'
OTEL_COLLECTOR_PORT_GRPC=4317

os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = f"http://{OTEL_COLLECTOR_HOST}:{OTEL_COLLECTOR_PORT_GRPC}"

# Other environment variables remain the same
os.environ["OTEL_RESOURCE_ATTRIBUTES"] = "service.namespace=smolagents-demo,service.name=smolagent"
os.environ["OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE"] = "cumulative"

Instrumentation

# Import OpenTelemetry modules
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.smolagents import SmolagentsInstrumentor

# Configure OpenTelemetry
trace_provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(insecure=True))
trace_provider.add_span_processor(processor)

# Instrument SmolAgents
SmolagentsInstrumentor().instrument(tracer_provider=trace_provider)

Step 3: Run a Test and See Results

Run your SmolAgents application:

from smolagents import CodeAgent, HfApiModel

model = HfApiModel()
agent = CodeAgent(tools=[], model=model, add_base_tools=True)

agent.run(
    "Could you give me the 118th number in the Fibonacci sequence?",
)

Step 4: View the Results in TraceZ

Open your browser and go to: http://localhost:55679/debug/tracez

You'll see your agent runs visualized! Click on any trace to see high level trace information, but the details of what happened are captured in the console.

Congratulations! You now have basic observability for your AI system.

Stage 2: Level Up - Custom Configuration for Better Debugging

Now that you have basic tracing working, let's improve our setup with a custom configuration.

Step 1: Create a Custom Configuration File

Create a file named base-config.yaml with the following content:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

Step 2: Restart the Collector with Your Configuration

docker run -v $(pwd)/base-config.yaml:/etc/otelcol-contrib/config.yaml \
  -p 127.0.0.1:4317:4317 \
  -p 127.0.0.1:4318:4318 \
  -p 127.0.0.1:55679:55679 \
  otel/opentelemetry-collector-contrib:0.121.0

Now you'll have a better understanding of and more control over the implementation of tracing, metrics, and logs.

NOTE: be aware, with this custom configuration you no longer have access to TraceZ and will need to use the console logs.

Stage 3: Pro Level - Advanced Visualization with Jaeger

Let's take your observability to the next level by adding Jaeger, a powerful distributed tracing platform.

Step 1: Update the configuration

Create a new file named jaeger-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:

exporters:
  jaeger:
    endpoint: "http://localhost:16686/api/traces"
    tls:
      insecure: true
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger, debug]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

Step 3: Restart Your Collector with the New Configuration

docker run -v $(pwd)/jaeger-config.yaml:/etc/otelcol-contrib/config.yaml \
  -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Step 4: View Your Traces in Jaeger

Open your browser and go to: http://localhost:16686

Select your service and click "Find Traces" to see detailed visualizations of your agent runs.

With Jaeger, you get:

Timeline views of all trace operations
Detailed flame graphs showing nested operations
Powerful filtering and search capabilities
The ability to compare different traces side by side

Advanced Techniques: Getting More from Your Telemetry

Once you have the basics working, you can explore these advanced techniques:

track business metrics
correlate with application logs
create span events for key moments

Best Practices for AI Telemetry

Start simple - Begin with basic tracing and add complexity as needed
Focus on reasoning paths - Make agent thinking visible to understand failures
Capture context - Include relevant attributes that help interpret the data
Use attributes wisely - Add key-value pairs that help filter and analyze
Correlate with user feedback - Link user satisfaction metrics to traces
Establish baselines - Know what "good" looks like for your AI system

Business Value Summary

Stage	Implementation	Business Value
Stage 1	Basic collector with TraceZ	- Basic debugging - Visibility into agent steps - Quick setup for dev environments
Stage 2	Custom configuration	- Enhanced debugging - More detailed insights - Better root cause analysis
Stage 3	Jaeger integration	- Professional-grade visualization - Advanced filtering and search - Comparing different runs - Production-ready observability

Conclusion

You've now seen how to implement OpenTelemetry for your AI systems in progressive stages, with each stage delivering real value. Start with the quick win, then level up as your needs grow.

Remember, in the world of AI systems, you can't improve what you can't observe. With OpenTelemetry, you've given yourself the visibility needed to build reliable, efficient, and trustworthy AI applications.

Raw

otel-demo.md

Running OpenTelemetry Demo in Telemetry-Only Mode

I had created a different guide that explains how to run only the telemetry components of the official OpenTelemetry Demo -- without the extra demo services.

The Demo is fantastic for understanding the larger space in which the OpenTelemetry standard already is leveraged, the information in the practical guide is more geared toward the prototyping of solutions.

Architecture Overview

This setup is based on the OpenTelemetry Demo Architecture.

graph TB
subgraph tdf[Telemetry Data Flow]
    subgraph subgraph_padding [ ]
        style subgraph_padding fill:none,stroke:none;
        subgraph od[OpenTelemetry Demo]
        ms(Microservice)
        end

        ms -.->|"OTLP<br/>gRPC"| oc-grpc
        ms -.->|"OTLP<br/>HTTP POST"| oc-http

        subgraph oc[OTel Collector]
            style oc fill:#97aef3,color:black;
            oc-grpc[/"OTLP Receiver<br/>listening on<br/>grpc://localhost:4317"/]
            oc-http[/"OTLP Receiver<br/>listening on <br/>localhost:4318<br/>"/]
            oc-proc(Processors)
            oc-prom[/"OTLP HTTP Exporter"/]
            oc-otlp[/"OTLP Exporter"/]

            oc-grpc --> oc-proc
            oc-http --> oc-proc

            oc-proc --> oc-prom
            oc-proc --> oc-otlp
        end

        oc-prom -->|"localhost:9090/api/v1/otlp"| pr-sc
        oc-otlp -->|gRPC| ja-col

        subgraph pr[Prometheus]
            style pr fill:#e75128,color:black;
            pr-sc[/"Prometheus OTLP Write Receiver"/]
            pr-tsdb[(Prometheus TSDB)]
            pr-http[/"Prometheus HTTP<br/>listening on<br/>localhost:9090"/]

            pr-sc --> pr-tsdb
            pr-tsdb --> pr-http
        end

        pr-b{{"Browser<br/>Prometheus UI"}}
        pr-http ---->|"localhost:9090/graph"| pr-b

        subgraph ja[Jaeger]
            style ja fill:#60d0e4,color:black;
            ja-col[/"Jaeger Collector<br/>listening on<br/>grpc://jaeger:4317"/]
            ja-db[(Jaeger DB)]
            ja-http[/"Jaeger HTTP<br/>listening on<br/>localhost:16686"/]

            ja-col --> ja-db
            ja-db --> ja-http
        end

        subgraph gr[Grafana]
            style gr fill:#f8b91e,color:black;
            gr-srv["Grafana Server"]
            gr-http[/"Grafana HTTP<br/>listening on<br/>localhost:3000"/]

            gr-srv --> gr-http
        end

        pr-http --> |"localhost:9090/api"| gr-srv
        ja-http --> |"localhost:16686/api"| gr-srv

        ja-b{{"Browser<br/>Jaeger UI"}}
        ja-http ---->|"localhost:16686/search"| ja-b

        gr-b{{"Browser<br/>Grafana UI"}}
        gr-http -->|"localhost:3000/dashboard"| gr-b
    end
end

Raw

otel-summary-table.md

OpenTelemetry Transport and Format Summary

Visualization and Debugging Tools

Tool	Purpose	URL/Port	Strengths
TraceZ	Debug interface for all signal types	`http://localhost:55679/debug/tracez`	Comprehensive view of traces, metrics, logs and events; excellent for debugging
Jaeger	Distributed tracing visualization	`http://localhost:16686`	Excellent trace visualization and analysis with query capabilities
Prometheus	Metrics collection and visualization	Typically `http://localhost:9090`	Time-series metrics visualization with powerful query language
Grafana	Multi-source dashboard creation	Typically `http://localhost:3000`	Custom dashboards combining multiple data sources

Transport Protocols and Data Formats

Feature	gRPC/Protobuf	HTTP/JSON
Transport Protocol	gRPC	HTTP
Data Format	Binary Protobuf	JSON
Default Port	4317	4318
Endpoint Path Pattern	N/A (service-based)	`/v1/{signal}` (e.g., `/v1/traces`)
Performance	Better performance (binary)	Slightly higher overhead
Human Readability	Not human-readable	Human-readable
Tool Support	`telemetrygen` (default)	`telemetrygen` (with `--use-http`) or curl
Use Cases	Production systems	Debugging, manual testing

Signal Types and Structure

Signal Type	Root Element	Collection Element	Data Element	Endpoint
Traces	`resourceSpans`	`scopeSpans`	`spans`	`/v1/traces`
Metrics	`resourceMetrics`	`scopeMetrics`	`metrics`	`/v1/metrics`
Logs	`resourceLogs`	`scopeLogs`	`logRecords`	`/v1/logs`
Events	`resourceLogs`	`scopeLogs`	`logRecords` (with `event.name` attribute)	`/v1/logs`

Key Testing Tools

Tool	Purpose	Transport	Example Usage
telemetrygen	Generate test data for all signal types	gRPC (default) or HTTP	`telemetrygen traces --otlp-insecure --duration 5s`
curl	Manual testing with HTTP/JSON	HTTP only	`curl -X POST -H "Content-Type: application/json" -d @logs.json -i localhost:4318/v1/logs`

Common JSON Format Patterns

All OpenTelemetry signal types in JSON format follow a similar structure:

Resource Level: Contains service and environment information
- resource{SignalType} with resource.attributes including service.name
Scope Level: Contains instrumentation information
- scope{SignalType} with scope.name, scope.version, and optional scope.attributes
Data Level: Contains the actual telemetry data
- Signal-specific data elements (spans, metrics, or logRecords)
- Each with their own type-specific attributes and properties

Quick Start Testing Setup

# Start OpenTelemetry Collector with TraceZ debugging
docker run \
  -p 127.0.0.1:4317:4317 \
  -p 127.0.0.1:4318:4318 \
  -p 127.0.0.1:55679:55679 \
  otel/opentelemetry-collector-contrib:0.121.0

# Generate sample traces with telemetrygen
telemetrygen traces --otlp-insecure --traces 3

# Generate sample logs
telemetrygen logs --duration 5s --otlp-insecure

# Generate sample metrics
telemetrygen metrics --duration 5s --otlp-insecure

# Test HTTP/JSON endpoints
curl -X POST -H "Content-Type: application/json" -d @trace.json -i localhost:4318/v1/traces
curl -X POST -H "Content-Type: application/json" -d @metrics.json -i localhost:4318/v1/metrics
curl -X POST -H "Content-Type: application/json" -d @logs.json -i localhost:4318/v1/logs
curl -X POST -H "Content-Type: application/json" -d @events.json -i localhost:4318/v1/logs  # Note: events use logs endpoint

donbr/otel-ai-guide.md

OpenTelemetry for AI Systems: A Practical Guide

Why Your AI Systems Need Observability

The Journey: From Zero to Hero with OpenTelemetry

Stage 1: Quick Win - Basic Setup with TraceZ

Step 1: Start the OpenTelemetry Collector

Step 2: Instrument Your SmolAgents Application

Install Python Libraries

Setup Environment Variables

Instrumentation

Step 3: Run a Test and See Results

Step 4: View the Results in TraceZ

Stage 2: Level Up - Custom Configuration for Better Debugging

Step 1: Create a Custom Configuration File

Step 2: Restart the Collector with Your Configuration

Stage 3: Pro Level - Advanced Visualization with Jaeger

Step 1: Update the configuration

Step 3: Restart Your Collector with the New Configuration

Step 4: View Your Traces in Jaeger

Advanced Techniques: Getting More from Your Telemetry

Best Practices for AI Telemetry

Business Value Summary

Conclusion

Running OpenTelemetry Demo in Telemetry-Only Mode

Architecture Overview

OpenTelemetry Transport and Format Summary

Visualization and Debugging Tools

Transport Protocols and Data Formats

Signal Types and Structure

Key Testing Tools

Common JSON Format Patterns

Quick Start Testing Setup