Skip to content

Instantly share code, notes, and snippets.

@shiftyp
Created May 25, 2025 15:23
Show Gist options
  • Save shiftyp/7d6741172f864133751efd02f3bda657 to your computer and use it in GitHub Desktop.
Save shiftyp/7d6741172f864133751efd02f3bda657 to your computer and use it in GitHub Desktop.

Incident Report: 2025-05-25

Executive Summary

During the past 24 hours, several incidents were detected in the OpenTelemetry Demo application. Analysis of telemetry data reveals three primary incidents:

  1. Product Catalog Service Failures: Feature flag-induced errors in the product-catalog service
  2. Load Generator Connection Issues: Multiple errors in the load-generator service, including export failures and browser context errors
  3. Performance Degradation: Significant latency spikes detected across multiple services

Incident Timeline

timeline
title Incident Timeline (with correlation)
    section load-generator
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otel-collector#58;4317... : warning
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting traces to otel-collector#58;43... : warning
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otel-collector#58;4317... : warning
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting traces to otel-collector#58;43... : warning
        01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
        01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
        01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otel-collector#58;4317... : warning
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting traces to otel-collector#58;43... : warning
        01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
        07#58;42 PM : Error POST [8768d8ad...][4bee997e...] : critical
        07#58;42 PM : Error POST [a930663a...][27f0d495...] : critical
        07#58;42 PM : Error GET [73465730...][be2732b2...] : critical
        07#58;43 PM : Error GET [0a2da977...][1fe25afa...] : critical
        07#58;43 PM : Error GET [51c8a395...][d52f6d2b...] : critical
    section product-catalog
        07#58;43 PM : Error oteldemo.ProductCatalogService/GetProduct [0a2da977...][9cef039b...] : critical
        07#58;43 PM : Error oteldemo.ProductCatalogService/GetProduct [51c8a395...][f86b60f4...] : critical
Let me try a different approach to fix the visualizations in the incident report. I'll
Loading

Error Distribution

pie
    title "Error Distribution"
    "load-generator" : 5
    "product-catalog" : 2
Loading

Service Dependencies

The service dependency graph below shows the key relationships between affected services:

graph TD
  B["frontend"]
  G["frontend-proxy"]
  J["load-generator"]
  L["product-catalog"]
  M["recommendation"]
  
  B --> |20 calls| G
  B --> |27 calls| J
  B --> |9 calls| L
  B --> |2 calls| M
  
  G --> |19 calls| J
  G --> |7 calls| L
  G --> |3 calls| M
  
  J --> |19 calls| G
  J --> |14 calls| L
  J --> |9 calls| M
  
  L --> |9 calls| B
  L --> |7 calls| M
  L --> |7 calls| G
  L --> |14 calls| J
  
  M --> |7 calls| L
  M --> |9 calls| J
Loading

Span Duration Anomalies

xychart-beta
    title "Span Duration Anomalies"
    x-axis "Operation" ["GET", "ProductCatalogService", "GET /api/products", "API route /products", "API route /cart"]
    y-axis "Duration (ms)" 0 --> 100
    bar "frontend" [27, 24, 13, 13, 12]
    bar "load-generator" [91, 0, 0, 0, 0]
Loading

Detailed Analysis

Incident 1: Product Catalog Feature Flag Failures

Symptoms

  • Errors in the product-catalog service with message: "Error: Product Catalog Fail Feature Flag Enabled"
  • HTTP 500 status codes returned from product catalog API
  • Errors propagated to dependent services (recommendation, frontend)

Root Cause Analysis

The errors in the product catalog service were triggered by an activated feature flag named "productCatalogFailure". This feature flag was set to the "on" variant, deliberately causing failures in the GetProduct method of the ProductCatalogService.

From the span data:

"Events.feature_flag.feature_flag.key": "productCatalogFailure",
"Events.feature_flag.feature_flag.provider_name": "flagd",
"Events.feature_flag.feature_flag.variant": "on"

This appears to be an intentional failure mode, likely for chaos engineering or testing purposes. The feature flag is managed by the "flagd" service, which is part of the OpenTelemetry Demo application.

Impact

  • Failed product lookups for users
  • Degraded user experience when browsing products
  • Cascading errors to the recommendation service, which depends on product data
  • 2 documented error occurrences, but likely more undocumented impact on user experience

Mitigation

  1. Short-term: Disable the "productCatalogFailure" feature flag in the flagd service
  2. Medium-term: Implement more robust fallback mechanisms in dependent services
  3. Long-term: Add better monitoring and alerting for feature flag-induced failures

Incident 2: Load Generator Connection Issues

Symptoms

  • Multiple errors in the load-generator service
  • Connection issues with the OpenTelemetry collector
  • Browser context errors in the load testing framework
  • HTTP errors during simulated user traffic

Root Cause Analysis

The load-generator service experienced two distinct issues:

  1. Telemetry Export Failures: The service was unable to export logs and traces to the OpenTelemetry collector, as evidenced by the "StatusCode.UNAVAILABLE" errors. This suggests network connectivity issues or resource constraints on the collector.

  2. Browser Automation Failures: The load-generator uses Playwright for browser automation to simulate user traffic. The "Browser.new_context: Target page, context or browser has been closed" errors indicate that browser contexts were being closed unexpectedly, likely due to resource constraints or timing issues.

From the error logs:

"message": "Browser.new_context: Target page, context or browser has been closed\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.12/site-packages/locust/user/task.py\", line 340, in run\n    self.execute_next_task()\n..."

Impact

  • Reduced effectiveness of load testing
  • Potential gaps in monitoring coverage due to incomplete test scenarios
  • 5 documented error occurrences
  • No direct impact on end-user experience as this is a testing component

Mitigation

  1. Short-term: Restart the load-generator service and monitor for recurrence
  2. Medium-term:
    • Increase resource allocation for the load-generator pods
    • Optimize the OpenTelemetry collector configuration to handle export spikes
  3. Long-term:
    • Implement better error handling and retry logic in the browser automation code
    • Consider implementing a more graceful shutdown procedure for browser contexts
    • Add dedicated monitoring for the load testing infrastructure

Incident 3: Performance Degradation

Symptoms

  • Significant latency spikes in several API operations
  • GET operations showing high durations, particularly in the load-generator service
  • Multiple operations exceeding their normal performance thresholds

Root Cause Analysis

The performance degradation appears to be correlated with the other incidents, particularly the product catalog failures. When the product catalog service fails, it likely causes retry logic and timeout handling in other services, leading to increased latency.

The span duration anomalies visualization shows that:

  • The load-generator service experienced the highest latency spikes (91ms for GET operations)
  • The frontend service also showed elevated latencies across multiple operations

The timing of these performance issues aligns with the product catalog errors, suggesting a causal relationship.

Impact

  • Degraded user experience due to slow page loads and API responses
  • Increased resource utilization from retries and timeouts
  • Potential for cascading failures if timeouts are not properly handled

Mitigation

  1. Short-term: Address the root cause (product catalog feature flag)
  2. Medium-term:
    • Optimize API response times, particularly for critical paths
    • Review and adjust timeout settings across services
  3. Long-term:
    • Implement circuit breakers to prevent cascading failures
    • Add better caching strategies for frequently accessed data
    • Enhance performance monitoring with more granular thresholds

Metrics Analysis

CPU Utilization

CPU metrics showed some anomalies during the incident period, with values ranging from 1% to 20%, averaging around 10.5%. The sporadic nature of the data points suggests potential resource contention issues.

Recommendation Counter

The app_recommendations_counter metric for the recommendation service showed a value of 340,330 at the end of the monitoring period. This metric was flagged as anomalous, potentially indicating an unusual number of recommendation requests or failures.

Recommendations

Immediate Actions

  1. Disable Feature Flags: Turn off the "productCatalogFailure" feature flag in the flagd service
  2. Restart Services: Restart the load-generator service to clear any lingering issues
  3. Increase Monitoring: Temporarily increase the frequency of health checks on affected services

Short-term Improvements (1-2 weeks)

  1. Feature Flag Governance: Establish clearer protocols for enabling failure-inducing feature flags, including:

    • Notification systems for when failure flags are enabled
    • Automatic time-based rollbacks for testing flags
    • Documentation requirements for chaos engineering experiments
  2. Resource Optimization:

    • Review resource allocation across the cluster, particularly for the load-generator
    • Optimize the OpenTelemetry collector configuration to handle export spikes
  3. Error Handling Enhancements:

    • Implement more robust error handling in services that depend on the product catalog
    • Add fallback mechanisms for recommendation service when product data is unavailable

Long-term Initiatives (1-3 months)

  1. Resilience Testing Framework:

    • Formalize chaos engineering practices with proper monitoring and automatic rollbacks
    • Develop a comprehensive test plan for failure scenarios
  2. Circuit Breaker Implementation:

    • Add circuit breakers to prevent cascading failures when one service experiences issues
    • Implement retry budgets and backoff strategies
  3. Enhanced Monitoring:

    • Implement more granular monitoring for key service interactions
    • Add anomaly detection with automated alerting
    • Create dashboards specifically for tracking feature flag impacts
  4. Performance Optimization:

    • Conduct a thorough performance review of critical API paths
    • Implement caching strategies for frequently accessed data
    • Optimize database queries and connection pooling

Conclusion

The observed incidents appear to be a combination of intentional testing (via feature flags) and resource constraints. While the impact on actual user experience may have been limited due to the testing nature of some components, these incidents highlight areas for improvement in the system's resilience and error handling capabilities.

The interconnected nature of the microservices architecture is evident in how failures propagate between services, underscoring the importance of robust fault tolerance mechanisms and comprehensive monitoring.

Appendix: Key Trace Information

Error Trace: 51c8a3955f26289b206907abbd2e8de4

This trace shows the propagation of errors from the load-generator through the recommendation service to the product-catalog service:

  • Root span: GET request from load-generator to frontend-proxy for recommendations
  • Error in product-catalog service: "Error: Product Catalog Fail Feature Flag Enabled"
  • Feature flag "productCatalogFailure" set to "on" variant
  • HTTP 500 status code returned to the client

This trace provides clear evidence that the product catalog errors were triggered by an intentional feature flag configuration.

Appendix B: Reproducing the Analysis

The following tools and commands can be used to recreate the visualizations and analysis in this report:

Getting Service Information

mcp0_servicesGet

Finding Top Errors

mcp0_errorsGetTop
{
  "limit": 20,
  "timeRange": {"start": "2025-05-24T10:44:27-04:00", "end": "2025-05-25T10:44:27-04:00"}
}

Analyzing Specific Traces

mcp0_traceAnalyze
{
  "traceId": "51c8a3955f26289b206907abbd2e8de4"
}

Getting Span Details

mcp0_spanGet
{
  "spanId": "f86b60f460866b66"
}

Detecting Metric Anomalies

mcp0_detectMetricAnomalies
{
  "startTime": "2025-05-24T10:44:27-04:00",
  "endTime": "2025-05-25T10:44:27-04:00"
}

Generating Metrics Aggregations

mcp0_generateMetricsRangeAggregation
{
  "startTime": "2025-05-24T10:44:27-04:00",
  "endTime": "2025-05-25T10:44:27-04:00",
  "metricField": "app_recommendations_counter",
  "service": "recommendation"
}

Creating Incident Timeline Visualization

mcp0_generateMarkdownVisualizations
{
  "config": {
    "timeRange": {
      "start": "2025-05-24T10:44:27-04:00", 
      "end": "2025-05-25T10:44:27-04:00"
    }, 
    "config": {
      "type": "incident-timeline", 
      "services": ["load-generator", "product-catalog", "recommendation"], 
      "maxEvents": 20, 
      "correlateEvents": true
    }
  }
}

Creating Service Dependency Visualization

mcp0_generateMarkdownVisualizations
{
  "config": {
    "timeRange": {
      "start": "2025-05-24T10:44:27-04:00", 
      "end": "2025-05-25T10:44:27-04:00"
    }, 
    "config": {
      "type": "service-dependency"
    }
  }
}

Creating Error Distribution Pie Chart

mcp0_generateMarkdownVisualizations
{
  "config": {
    "timeRange": {
      "start": "2025-05-24T10:44:27-04:00", 
      "end": "2025-05-25T10:44:27-04:00"
    }, 
    "config": {
      "type": "error-pie", 
      "services": ["product-catalog", "load-generator", "recommendation"], 
      "maxResults": 10, 
      "showData": true
    }
  }
}

Creating Span Duration Anomalies Chart

mcp0_generateMarkdownVisualizations
{
  "config": {
    "timeRange": {
      "start": "2025-05-24T10:44:27-04:00", 
      "end": "2025-05-25T10:44:27-04:00"
    }, 
    "config": {
      "type": "xy-chart", 
      "chartType": "bar", 
      "dataType": "traces", 
      "xField": "Name", 
      "yField": "Duration", 
      "title": "Span Duration Anomalies", 
      "multiSeries": true, 
      "seriesField": "Resource.service.name"
    }
  }
}

Finding Specific Logs

mcp0_findLogs
{
  "level": "error",
  "timeRange": {
    "start": "2025-05-24T10:44:27-04:00", 
    "end": "2025-05-25T10:44:27-04:00"
  }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment