Incident Report: 2025-05-25

Executive Summary

During the past 24 hours, several incidents were detected in the OpenTelemetry Demo application. Analysis of telemetry data reveals three primary incidents:

Product Catalog Service Failures: Feature flag-induced errors in the product-catalog service
Load Generator Connection Issues: Multiple errors in the load-generator service, including export failures and browser context errors
Performance Degradation: Significant latency spikes detected across multiple services

Incident Timeline

timeline
title Incident Timeline (with correlation)
    section load-generator
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otel-collector#58;4317... : warning
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting traces to otel-collector#58;43... : warning
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otel-collector#58;4317... : warning
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting traces to otel-collector#58;43... : warning
        01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
        01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
        01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otel-collector#58;4317... : warning
        01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting traces to otel-collector#58;43... : warning
        01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
        07#58;42 PM : Error POST [8768d8ad...][4bee997e...] : critical
        07#58;42 PM : Error POST [a930663a...][27f0d495...] : critical
        07#58;42 PM : Error GET [73465730...][be2732b2...] : critical
        07#58;43 PM : Error GET [0a2da977...][1fe25afa...] : critical
        07#58;43 PM : Error GET [51c8a395...][d52f6d2b...] : critical
    section product-catalog
        07#58;43 PM : Error oteldemo.ProductCatalogService/GetProduct [0a2da977...][9cef039b...] : critical
        07#58;43 PM : Error oteldemo.ProductCatalogService/GetProduct [51c8a395...][f86b60f4...] : critical
Let me try a different approach to fix the visualizations in the incident report. I'll

Error Distribution

pie
    title "Error Distribution"
    "load-generator" : 5
    "product-catalog" : 2

Service Dependencies

The service dependency graph below shows the key relationships between affected services:

graph TD
  B["frontend"]
  G["frontend-proxy"]
  J["load-generator"]
  L["product-catalog"]
  M["recommendation"]
  
  B --> |20 calls| G
  B --> |27 calls| J
  B --> |9 calls| L
  B --> |2 calls| M
  
  G --> |19 calls| J
  G --> |7 calls| L
  G --> |3 calls| M
  
  J --> |19 calls| G
  J --> |14 calls| L
  J --> |9 calls| M
  
  L --> |9 calls| B
  L --> |7 calls| M
  L --> |7 calls| G
  L --> |14 calls| J
  
  M --> |7 calls| L
  M --> |9 calls| J

Span Duration Anomalies

xychart-beta
    title "Span Duration Anomalies"
    x-axis "Operation" ["GET", "ProductCatalogService", "GET /api/products", "API route /products", "API route /cart"]
    y-axis "Duration (ms)" 0 --> 100
    bar "frontend" [27, 24, 13, 13, 12]
    bar "load-generator" [91, 0, 0, 0, 0]

Detailed Analysis

Incident 1: Product Catalog Feature Flag Failures

Symptoms

Errors in the product-catalog service with message: "Error: Product Catalog Fail Feature Flag Enabled"
HTTP 500 status codes returned from product catalog API
Errors propagated to dependent services (recommendation, frontend)

Root Cause Analysis

The errors in the product catalog service were triggered by an activated feature flag named "productCatalogFailure". This feature flag was set to the "on" variant, deliberately causing failures in the GetProduct method of the ProductCatalogService.

From the span data:

"Events.feature_flag.feature_flag.key": "productCatalogFailure",
"Events.feature_flag.feature_flag.provider_name": "flagd",
"Events.feature_flag.feature_flag.variant": "on"

This appears to be an intentional failure mode, likely for chaos engineering or testing purposes. The feature flag is managed by the "flagd" service, which is part of the OpenTelemetry Demo application.

Impact

Failed product lookups for users
Degraded user experience when browsing products
Cascading errors to the recommendation service, which depends on product data
2 documented error occurrences, but likely more undocumented impact on user experience

Mitigation

Short-term: Disable the "productCatalogFailure" feature flag in the flagd service
Medium-term: Implement more robust fallback mechanisms in dependent services
Long-term: Add better monitoring and alerting for feature flag-induced failures

Incident 2: Load Generator Connection Issues

Symptoms

Multiple errors in the load-generator service
Connection issues with the OpenTelemetry collector
Browser context errors in the load testing framework
HTTP errors during simulated user traffic

Root Cause Analysis

The load-generator service experienced two distinct issues:

Telemetry Export Failures: The service was unable to export logs and traces to the OpenTelemetry collector, as evidenced by the "StatusCode.UNAVAILABLE" errors. This suggests network connectivity issues or resource constraints on the collector.
Browser Automation Failures: The load-generator uses Playwright for browser automation to simulate user traffic. The "Browser.new_context: Target page, context or browser has been closed" errors indicate that browser contexts were being closed unexpectedly, likely due to resource constraints or timing issues.

From the error logs:

"message": "Browser.new_context: Target page, context or browser has been closed\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.12/site-packages/locust/user/task.py\", line 340, in run\n    self.execute_next_task()\n..."

Impact

Reduced effectiveness of load testing
Potential gaps in monitoring coverage due to incomplete test scenarios
5 documented error occurrences
No direct impact on end-user experience as this is a testing component

Mitigation

Short-term: Restart the load-generator service and monitor for recurrence
Medium-term:
- Increase resource allocation for the load-generator pods
- Optimize the OpenTelemetry collector configuration to handle export spikes
Long-term:
- Implement better error handling and retry logic in the browser automation code
- Consider implementing a more graceful shutdown procedure for browser contexts
- Add dedicated monitoring for the load testing infrastructure

Incident 3: Performance Degradation

Symptoms

Significant latency spikes in several API operations
GET operations showing high durations, particularly in the load-generator service
Multiple operations exceeding their normal performance thresholds

Root Cause Analysis

The performance degradation appears to be correlated with the other incidents, particularly the product catalog failures. When the product catalog service fails, it likely causes retry logic and timeout handling in other services, leading to increased latency.

The span duration anomalies visualization shows that:

The load-generator service experienced the highest latency spikes (91ms for GET operations)
The frontend service also showed elevated latencies across multiple operations

The timing of these performance issues aligns with the product catalog errors, suggesting a causal relationship.

Impact

Degraded user experience due to slow page loads and API responses
Increased resource utilization from retries and timeouts
Potential for cascading failures if timeouts are not properly handled

Mitigation

Short-term: Address the root cause (product catalog feature flag)
Medium-term:
- Optimize API response times, particularly for critical paths
- Review and adjust timeout settings across services
Long-term:
- Implement circuit breakers to prevent cascading failures
- Add better caching strategies for frequently accessed data
- Enhance performance monitoring with more granular thresholds

Metrics Analysis

CPU Utilization

CPU metrics showed some anomalies during the incident period, with values ranging from 1% to 20%, averaging around 10.5%. The sporadic nature of the data points suggests potential resource contention issues.

Recommendation Counter

The app_recommendations_counter metric for the recommendation service showed a value of 340,330 at the end of the monitoring period. This metric was flagged as anomalous, potentially indicating an unusual number of recommendation requests or failures.

Recommendations

Immediate Actions

Disable Feature Flags: Turn off the "productCatalogFailure" feature flag in the flagd service
Restart Services: Restart the load-generator service to clear any lingering issues
Increase Monitoring: Temporarily increase the frequency of health checks on affected services

Short-term Improvements (1-2 weeks)

Feature Flag Governance: Establish clearer protocols for enabling failure-inducing feature flags, including:
- Notification systems for when failure flags are enabled
- Automatic time-based rollbacks for testing flags
- Documentation requirements for chaos engineering experiments
Resource Optimization:
- Review resource allocation across the cluster, particularly for the load-generator
- Optimize the OpenTelemetry collector configuration to handle export spikes
Error Handling Enhancements:
- Implement more robust error handling in services that depend on the product catalog
- Add fallback mechanisms for recommendation service when product data is unavailable

Long-term Initiatives (1-3 months)

Resilience Testing Framework:
- Formalize chaos engineering practices with proper monitoring and automatic rollbacks
- Develop a comprehensive test plan for failure scenarios
Circuit Breaker Implementation:
- Add circuit breakers to prevent cascading failures when one service experiences issues
- Implement retry budgets and backoff strategies
Enhanced Monitoring:
- Implement more granular monitoring for key service interactions
- Add anomaly detection with automated alerting
- Create dashboards specifically for tracking feature flag impacts
Performance Optimization:
- Conduct a thorough performance review of critical API paths
- Implement caching strategies for frequently accessed data
- Optimize database queries and connection pooling

Conclusion

The observed incidents appear to be a combination of intentional testing (via feature flags) and resource constraints. While the impact on actual user experience may have been limited due to the testing nature of some components, these incidents highlight areas for improvement in the system's resilience and error handling capabilities.

The interconnected nature of the microservices architecture is evident in how failures propagate between services, underscoring the importance of robust fault tolerance mechanisms and comprehensive monitoring.

Appendix: Key Trace Information

Error Trace: 51c8a3955f26289b206907abbd2e8de4

This trace shows the propagation of errors from the load-generator through the recommendation service to the product-catalog service:

Root span: GET request from load-generator to frontend-proxy for recommendations
Error in product-catalog service: "Error: Product Catalog Fail Feature Flag Enabled"
Feature flag "productCatalogFailure" set to "on" variant
HTTP 500 status code returned to the client

This trace provides clear evidence that the product catalog errors were triggered by an intentional feature flag configuration.

Appendix B: Reproducing the Analysis

The following tools and commands can be used to recreate the visualizations and analysis in this report:

Getting Service Information

mcp0_servicesGet

Finding Top Errors

mcp0_errorsGetTop
{
  "limit": 20,
  "timeRange": {"start": "2025-05-24T10:44:27-04:00", "end": "2025-05-25T10:44:27-04:00"}
}

Analyzing Specific Traces

mcp0_traceAnalyze
{
  "traceId": "51c8a3955f26289b206907abbd2e8de4"
}

Getting Span Details

mcp0_spanGet
{
  "spanId": "f86b60f460866b66"
}

Detecting Metric Anomalies

mcp0_detectMetricAnomalies
{
  "startTime": "2025-05-24T10:44:27-04:00",
  "endTime": "2025-05-25T10:44:27-04:00"
}

Generating Metrics Aggregations

mcp0_generateMetricsRangeAggregation
{
  "startTime": "2025-05-24T10:44:27-04:00",
  "endTime": "2025-05-25T10:44:27-04:00",
  "metricField": "app_recommendations_counter",
  "service": "recommendation"
}

Creating Incident Timeline Visualization

mcp0_generateMarkdownVisualizations
{
  "config": {
    "timeRange": {
      "start": "2025-05-24T10:44:27-04:00", 
      "end": "2025-05-25T10:44:27-04:00"
    }, 
    "config": {
      "type": "incident-timeline", 
      "services": ["load-generator", "product-catalog", "recommendation"], 
      "maxEvents": 20, 
      "correlateEvents": true
    }
  }
}

Creating Service Dependency Visualization

mcp0_generateMarkdownVisualizations
{
  "config": {
    "timeRange": {
      "start": "2025-05-24T10:44:27-04:00", 
      "end": "2025-05-25T10:44:27-04:00"
    }, 
    "config": {
      "type": "service-dependency"
    }
  }
}

Creating Error Distribution Pie Chart

mcp0_generateMarkdownVisualizations
{
  "config": {
    "timeRange": {
      "start": "2025-05-24T10:44:27-04:00", 
      "end": "2025-05-25T10:44:27-04:00"
    }, 
    "config": {
      "type": "error-pie", 
      "services": ["product-catalog", "load-generator", "recommendation"], 
      "maxResults": 10, 
      "showData": true
    }
  }
}

Creating Span Duration Anomalies Chart

mcp0_generateMarkdownVisualizations
{
  "config": {
    "timeRange": {
      "start": "2025-05-24T10:44:27-04:00", 
      "end": "2025-05-25T10:44:27-04:00"
    }, 
    "config": {
      "type": "xy-chart", 
      "chartType": "bar", 
      "dataType": "traces", 
      "xField": "Name", 
      "yField": "Duration", 
      "title": "Span Duration Anomalies", 
      "multiSeries": true, 
      "seriesField": "Resource.service.name"
    }
  }
}

Finding Specific Logs

mcp0_findLogs
{
  "level": "error",
  "timeRange": {
    "start": "2025-05-24T10:44:27-04:00", 
    "end": "2025-05-25T10:44:27-04:00"
  }
}

shiftyp/incident.md

Incident Report: 2025-05-25

Executive Summary

Incident Timeline

Error Distribution

Service Dependencies

Span Duration Anomalies

Detailed Analysis

Incident 1: Product Catalog Feature Flag Failures

Symptoms

Root Cause Analysis

Impact

Mitigation

Incident 2: Load Generator Connection Issues

Symptoms

Root Cause Analysis

Impact

Mitigation

Incident 3: Performance Degradation

Symptoms

Root Cause Analysis

Impact

Mitigation

Metrics Analysis

CPU Utilization

Recommendation Counter

Recommendations

Immediate Actions

Short-term Improvements (1-2 weeks)

Long-term Initiatives (1-3 months)

Conclusion

Appendix: Key Trace Information

Error Trace: 51c8a3955f26289b206907abbd2e8de4

Appendix B: Reproducing the Analysis

Getting Service Information

Finding Top Errors

Analyzing Specific Traces

Getting Span Details

Detecting Metric Anomalies

Generating Metrics Aggregations

Creating Incident Timeline Visualization

Creating Service Dependency Visualization

Creating Error Distribution Pie Chart

Creating Span Duration Anomalies Chart

Finding Specific Logs