Skip to content

Instantly share code, notes, and snippets.

@donbr
Last active March 14, 2025 07:24
Show Gist options
  • Save donbr/48fe6e95b2c02d95299b7acd6f568924 to your computer and use it in GitHub Desktop.
Save donbr/48fe6e95b2c02d95299b7acd6f568924 to your computer and use it in GitHub Desktop.
OpenTelemetry for AI Systems: A Practical Guide

OpenTelemetry for AI Systems: A Practical Guide

Why Your AI Systems Need Observability

AI systems built with Large Language Models (LLMs) present unique challenges that traditional observability tools weren't designed to handle:

  1. Non-deterministic behavior - The same input can produce different outputs
  2. Complex reasoning chains - Multi-step processes with branching decision paths
  3. Unpredictable execution - Agents may take different approaches each time
  4. Tool usage patterns - Interactions with external systems that impact results
  5. Agent collaboration - Sub-agents working together with complex delegation

Without proper observability, debugging becomes nearly impossible:

User: "Why did my agent give the wrong answer?"
Developer without observability: "Let me dig through 500 pages of LLM output..."
Developer with observability: "I can see it used the wrong tool here, then misinterpreted the result."

The Journey: From Zero to Hero with OpenTelemetry

We'll build this in stages, with value at each step:

  1. Quick Win: Basic collector setup with TraceZ visualization
  2. Level Up: Custom configuration for better debugging
  3. Pro Level: Advanced visualization with Jaeger

Let's get started!

Stage 1: Quick Win - Basic Setup with TraceZ

Step 1: Start the OpenTelemetry Collector

Run this single command to get a collector up and running:

docker run \
  -p 127.0.0.1:4317:4317 \
  -p 127.0.0.1:4318:4318 \
  -p 127.0.0.1:55679:55679 \
  otel/opentelemetry-collector-contrib:0.121.0

This starts a collector that:

  • Listens for gRPC data on port 4317
  • Listens for HTTP data on port 4318
  • Provides TraceZ visualization on port 55679

Step 2: Instrument Your SmolAgents Application

Install Python Libraries

pip install 'smolagents[telemetry]' opentelemetry-sdk opentelemetry-exporter-otlp openinference-instrumentation-smolagents

Setup Environment Variables

import os

# Set your Hugging Face API token
os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN")

# Configure environment variables for OpenTelemetry Endpoint
OTEL_COLLECTOR_HOST='localhost'
OTEL_COLLECTOR_PORT_GRPC=4317

os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = f"http://{OTEL_COLLECTOR_HOST}:{OTEL_COLLECTOR_PORT_GRPC}"

# Other environment variables remain the same
os.environ["OTEL_RESOURCE_ATTRIBUTES"] = "service.namespace=smolagents-demo,service.name=smolagent"
os.environ["OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE"] = "cumulative"

Instrumentation

# Import OpenTelemetry modules
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.smolagents import SmolagentsInstrumentor

# Configure OpenTelemetry
trace_provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(insecure=True))
trace_provider.add_span_processor(processor)

# Instrument SmolAgents
SmolagentsInstrumentor().instrument(tracer_provider=trace_provider)

Step 3: Run a Test and See Results

Run your SmolAgents application:

from smolagents import CodeAgent, HfApiModel

model = HfApiModel()
agent = CodeAgent(tools=[], model=model, add_base_tools=True)

agent.run(
    "Could you give me the 118th number in the Fibonacci sequence?",
)

Step 4: View the Results in TraceZ

Open your browser and go to: http://localhost:55679/debug/tracez

You'll see your agent runs visualized! Click on any trace to see high level trace information, but the details of what happened are captured in the console.

Congratulations! You now have basic observability for your AI system.

Stage 2: Level Up - Custom Configuration for Better Debugging

Now that you have basic tracing working, let's improve our setup with a custom configuration.

Step 1: Create a Custom Configuration File

Create a file named base-config.yaml with the following content:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

Step 2: Restart the Collector with Your Configuration

docker run -v $(pwd)/base-config.yaml:/etc/otelcol-contrib/config.yaml \
  -p 127.0.0.1:4317:4317 \
  -p 127.0.0.1:4318:4318 \
  -p 127.0.0.1:55679:55679 \
  otel/opentelemetry-collector-contrib:0.121.0

Now you'll have a better understanding of and more control over the implementation of tracing, metrics, and logs.

NOTE: be aware, with this custom configuration you no longer have access to TraceZ and will need to use the console logs.

Stage 3: Pro Level - Advanced Visualization with Jaeger

Let's take your observability to the next level by adding Jaeger, a powerful distributed tracing platform.

Step 1: Update the configuration

Create a new file named jaeger-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:

exporters:
  jaeger:
    endpoint: "http://localhost:16686/api/traces"
    tls:
      insecure: true
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger, debug]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

Step 3: Restart Your Collector with the New Configuration

docker run -v $(pwd)/jaeger-config.yaml:/etc/otelcol-contrib/config.yaml \
  -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Step 4: View Your Traces in Jaeger

Open your browser and go to: http://localhost:16686

Select your service and click "Find Traces" to see detailed visualizations of your agent runs.

With Jaeger, you get:

  • Timeline views of all trace operations
  • Detailed flame graphs showing nested operations
  • Powerful filtering and search capabilities
  • The ability to compare different traces side by side

Advanced Techniques: Getting More from Your Telemetry

Once you have the basics working, you can explore these advanced techniques:

  • track business metrics
  • correlate with application logs
  • create span events for key moments

Best Practices for AI Telemetry

  1. Start simple - Begin with basic tracing and add complexity as needed
  2. Focus on reasoning paths - Make agent thinking visible to understand failures
  3. Capture context - Include relevant attributes that help interpret the data
  4. Use attributes wisely - Add key-value pairs that help filter and analyze
  5. Correlate with user feedback - Link user satisfaction metrics to traces
  6. Establish baselines - Know what "good" looks like for your AI system

Business Value Summary

Stage Implementation Business Value
Stage 1 Basic collector with TraceZ - Basic debugging
- Visibility into agent steps
- Quick setup for dev environments
Stage 2 Custom configuration - Enhanced debugging
- More detailed insights
- Better root cause analysis
Stage 3 Jaeger integration - Professional-grade visualization
- Advanced filtering and search
- Comparing different runs
- Production-ready observability

Conclusion

You've now seen how to implement OpenTelemetry for your AI systems in progressive stages, with each stage delivering real value. Start with the quick win, then level up as your needs grow.

Remember, in the world of AI systems, you can't improve what you can't observe. With OpenTelemetry, you've given yourself the visibility needed to build reliable, efficient, and trustworthy AI applications.

Running OpenTelemetry Demo in Telemetry-Only Mode

I had created a different guide that explains how to run only the telemetry components of the official OpenTelemetry Demo -- without the extra demo services.

The Demo is fantastic for understanding the larger space in which the OpenTelemetry standard already is leveraged, the information in the practical guide is more geared toward the prototyping of solutions.

Architecture Overview

This setup is based on the OpenTelemetry Demo Architecture.

graph TB
subgraph tdf[Telemetry Data Flow]
    subgraph subgraph_padding [ ]
        style subgraph_padding fill:none,stroke:none;
        subgraph od[OpenTelemetry Demo]
        ms(Microservice)
        end

        ms -.->|"OTLP<br/>gRPC"| oc-grpc
        ms -.->|"OTLP<br/>HTTP POST"| oc-http

        subgraph oc[OTel Collector]
            style oc fill:#97aef3,color:black;
            oc-grpc[/"OTLP Receiver<br/>listening on<br/>grpc://localhost:4317"/]
            oc-http[/"OTLP Receiver<br/>listening on <br/>localhost:4318<br/>"/]
            oc-proc(Processors)
            oc-prom[/"OTLP HTTP Exporter"/]
            oc-otlp[/"OTLP Exporter"/]

            oc-grpc --> oc-proc
            oc-http --> oc-proc

            oc-proc --> oc-prom
            oc-proc --> oc-otlp
        end

        oc-prom -->|"localhost:9090/api/v1/otlp"| pr-sc
        oc-otlp -->|gRPC| ja-col

        subgraph pr[Prometheus]
            style pr fill:#e75128,color:black;
            pr-sc[/"Prometheus OTLP Write Receiver"/]
            pr-tsdb[(Prometheus TSDB)]
            pr-http[/"Prometheus HTTP<br/>listening on<br/>localhost:9090"/]

            pr-sc --> pr-tsdb
            pr-tsdb --> pr-http
        end

        pr-b{{"Browser<br/>Prometheus UI"}}
        pr-http ---->|"localhost:9090/graph"| pr-b

        subgraph ja[Jaeger]
            style ja fill:#60d0e4,color:black;
            ja-col[/"Jaeger Collector<br/>listening on<br/>grpc://jaeger:4317"/]
            ja-db[(Jaeger DB)]
            ja-http[/"Jaeger HTTP<br/>listening on<br/>localhost:16686"/]

            ja-col --> ja-db
            ja-db --> ja-http
        end

        subgraph gr[Grafana]
            style gr fill:#f8b91e,color:black;
            gr-srv["Grafana Server"]
            gr-http[/"Grafana HTTP<br/>listening on<br/>localhost:3000"/]

            gr-srv --> gr-http
        end

        pr-http --> |"localhost:9090/api"| gr-srv
        ja-http --> |"localhost:16686/api"| gr-srv

        ja-b{{"Browser<br/>Jaeger UI"}}
        ja-http ---->|"localhost:16686/search"| ja-b

        gr-b{{"Browser<br/>Grafana UI"}}
        gr-http -->|"localhost:3000/dashboard"| gr-b
    end
end
Loading

OpenTelemetry Transport and Format Summary

Visualization and Debugging Tools

Tool Purpose URL/Port Strengths
TraceZ Debug interface for all signal types http://localhost:55679/debug/tracez Comprehensive view of traces, metrics, logs and events; excellent for debugging
Jaeger Distributed tracing visualization http://localhost:16686 Excellent trace visualization and analysis with query capabilities
Prometheus Metrics collection and visualization Typically http://localhost:9090 Time-series metrics visualization with powerful query language
Grafana Multi-source dashboard creation Typically http://localhost:3000 Custom dashboards combining multiple data sources

Transport Protocols and Data Formats

Feature gRPC/Protobuf HTTP/JSON
Transport Protocol gRPC HTTP
Data Format Binary Protobuf JSON
Default Port 4317 4318
Endpoint Path Pattern N/A (service-based) /v1/{signal} (e.g., /v1/traces)
Performance Better performance (binary) Slightly higher overhead
Human Readability Not human-readable Human-readable
Tool Support telemetrygen (default) telemetrygen (with --use-http) or curl
Use Cases Production systems Debugging, manual testing

Signal Types and Structure

Signal Type Root Element Collection Element Data Element Endpoint
Traces resourceSpans scopeSpans spans /v1/traces
Metrics resourceMetrics scopeMetrics metrics /v1/metrics
Logs resourceLogs scopeLogs logRecords /v1/logs
Events resourceLogs scopeLogs logRecords (with event.name attribute) /v1/logs

Key Testing Tools

Tool Purpose Transport Example Usage
telemetrygen Generate test data for all signal types gRPC (default) or HTTP telemetrygen traces --otlp-insecure --duration 5s
curl Manual testing with HTTP/JSON HTTP only curl -X POST -H "Content-Type: application/json" -d @logs.json -i localhost:4318/v1/logs

Common JSON Format Patterns

All OpenTelemetry signal types in JSON format follow a similar structure:

  1. Resource Level: Contains service and environment information

    • resource{SignalType} with resource.attributes including service.name
  2. Scope Level: Contains instrumentation information

    • scope{SignalType} with scope.name, scope.version, and optional scope.attributes
  3. Data Level: Contains the actual telemetry data

    • Signal-specific data elements (spans, metrics, or logRecords)
    • Each with their own type-specific attributes and properties

Quick Start Testing Setup

# Start OpenTelemetry Collector with TraceZ debugging
docker run \
  -p 127.0.0.1:4317:4317 \
  -p 127.0.0.1:4318:4318 \
  -p 127.0.0.1:55679:55679 \
  otel/opentelemetry-collector-contrib:0.121.0

# Generate sample traces with telemetrygen
telemetrygen traces --otlp-insecure --traces 3

# Generate sample logs
telemetrygen logs --duration 5s --otlp-insecure

# Generate sample metrics
telemetrygen metrics --duration 5s --otlp-insecure

# Test HTTP/JSON endpoints
curl -X POST -H "Content-Type: application/json" -d @trace.json -i localhost:4318/v1/traces
curl -X POST -H "Content-Type: application/json" -d @metrics.json -i localhost:4318/v1/metrics
curl -X POST -H "Content-Type: application/json" -d @logs.json -i localhost:4318/v1/logs
curl -X POST -H "Content-Type: application/json" -d @events.json -i localhost:4318/v1/logs  # Note: events use logs endpoint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment