During the past 24 hours, several incidents were detected in the OpenTelemetry Demo application. Analysis of telemetry data reveals three primary incidents:
- Product Catalog Service Failures: Feature flag-induced errors in the product-catalog service
- Load Generator Connection Issues: Multiple errors in the load-generator service, including export failures and browser context errors
- Performance Degradation: Significant latency spikes detected across multiple services
timeline
title Incident Timeline (with correlation)
section load-generator
01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otel-collector#58;4317... : warning
01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting traces to otel-collector#58;43... : warning
01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otel-collector#58;4317... : warning
01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting traces to otel-collector#58;43... : warning
01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otel-collector#58;4317... : warning
01#58;59 PM : Warning Transient error StatusCode.UNAVAILABLE encountered while exporting traces to otel-collector#58;43... : warning
01#58;59 PM : Error Browser.new_context#58; Target page, context or browser has been closed Traceback (most recent ca... : critical
07#58;42 PM : Error POST [8768d8ad...][4bee997e...] : critical
07#58;42 PM : Error POST [a930663a...][27f0d495...] : critical
07#58;42 PM : Error GET [73465730...][be2732b2...] : critical
07#58;43 PM : Error GET [0a2da977...][1fe25afa...] : critical
07#58;43 PM : Error GET [51c8a395...][d52f6d2b...] : critical
section product-catalog
07#58;43 PM : Error oteldemo.ProductCatalogService/GetProduct [0a2da977...][9cef039b...] : critical
07#58;43 PM : Error oteldemo.ProductCatalogService/GetProduct [51c8a395...][f86b60f4...] : critical
Let me try a different approach to fix the visualizations in the incident report. I'll
pie
title "Error Distribution"
"load-generator" : 5
"product-catalog" : 2
The service dependency graph below shows the key relationships between affected services:
graph TD
B["frontend"]
G["frontend-proxy"]
J["load-generator"]
L["product-catalog"]
M["recommendation"]
B --> |20 calls| G
B --> |27 calls| J
B --> |9 calls| L
B --> |2 calls| M
G --> |19 calls| J
G --> |7 calls| L
G --> |3 calls| M
J --> |19 calls| G
J --> |14 calls| L
J --> |9 calls| M
L --> |9 calls| B
L --> |7 calls| M
L --> |7 calls| G
L --> |14 calls| J
M --> |7 calls| L
M --> |9 calls| J
xychart-beta
title "Span Duration Anomalies"
x-axis "Operation" ["GET", "ProductCatalogService", "GET /api/products", "API route /products", "API route /cart"]
y-axis "Duration (ms)" 0 --> 100
bar "frontend" [27, 24, 13, 13, 12]
bar "load-generator" [91, 0, 0, 0, 0]
- Errors in the product-catalog service with message: "Error: Product Catalog Fail Feature Flag Enabled"
- HTTP 500 status codes returned from product catalog API
- Errors propagated to dependent services (recommendation, frontend)
The errors in the product catalog service were triggered by an activated feature flag named "productCatalogFailure". This feature flag was set to the "on" variant, deliberately causing failures in the GetProduct method of the ProductCatalogService.
From the span data:
"Events.feature_flag.feature_flag.key": "productCatalogFailure",
"Events.feature_flag.feature_flag.provider_name": "flagd",
"Events.feature_flag.feature_flag.variant": "on"
This appears to be an intentional failure mode, likely for chaos engineering or testing purposes. The feature flag is managed by the "flagd" service, which is part of the OpenTelemetry Demo application.
- Failed product lookups for users
- Degraded user experience when browsing products
- Cascading errors to the recommendation service, which depends on product data
- 2 documented error occurrences, but likely more undocumented impact on user experience
- Short-term: Disable the "productCatalogFailure" feature flag in the flagd service
- Medium-term: Implement more robust fallback mechanisms in dependent services
- Long-term: Add better monitoring and alerting for feature flag-induced failures
- Multiple errors in the load-generator service
- Connection issues with the OpenTelemetry collector
- Browser context errors in the load testing framework
- HTTP errors during simulated user traffic
The load-generator service experienced two distinct issues:
-
Telemetry Export Failures: The service was unable to export logs and traces to the OpenTelemetry collector, as evidenced by the "StatusCode.UNAVAILABLE" errors. This suggests network connectivity issues or resource constraints on the collector.
-
Browser Automation Failures: The load-generator uses Playwright for browser automation to simulate user traffic. The "Browser.new_context: Target page, context or browser has been closed" errors indicate that browser contexts were being closed unexpectedly, likely due to resource constraints or timing issues.
From the error logs:
"message": "Browser.new_context: Target page, context or browser has been closed\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.12/site-packages/locust/user/task.py\", line 340, in run\n self.execute_next_task()\n..."
- Reduced effectiveness of load testing
- Potential gaps in monitoring coverage due to incomplete test scenarios
- 5 documented error occurrences
- No direct impact on end-user experience as this is a testing component
- Short-term: Restart the load-generator service and monitor for recurrence
- Medium-term:
- Increase resource allocation for the load-generator pods
- Optimize the OpenTelemetry collector configuration to handle export spikes
- Long-term:
- Implement better error handling and retry logic in the browser automation code
- Consider implementing a more graceful shutdown procedure for browser contexts
- Add dedicated monitoring for the load testing infrastructure
- Significant latency spikes in several API operations
- GET operations showing high durations, particularly in the load-generator service
- Multiple operations exceeding their normal performance thresholds
The performance degradation appears to be correlated with the other incidents, particularly the product catalog failures. When the product catalog service fails, it likely causes retry logic and timeout handling in other services, leading to increased latency.
The span duration anomalies visualization shows that:
- The load-generator service experienced the highest latency spikes (91ms for GET operations)
- The frontend service also showed elevated latencies across multiple operations
The timing of these performance issues aligns with the product catalog errors, suggesting a causal relationship.
- Degraded user experience due to slow page loads and API responses
- Increased resource utilization from retries and timeouts
- Potential for cascading failures if timeouts are not properly handled
- Short-term: Address the root cause (product catalog feature flag)
- Medium-term:
- Optimize API response times, particularly for critical paths
- Review and adjust timeout settings across services
- Long-term:
- Implement circuit breakers to prevent cascading failures
- Add better caching strategies for frequently accessed data
- Enhance performance monitoring with more granular thresholds
CPU metrics showed some anomalies during the incident period, with values ranging from 1% to 20%, averaging around 10.5%. The sporadic nature of the data points suggests potential resource contention issues.
The app_recommendations_counter
metric for the recommendation service showed a value of 340,330 at the end of the monitoring period. This metric was flagged as anomalous, potentially indicating an unusual number of recommendation requests or failures.
- Disable Feature Flags: Turn off the "productCatalogFailure" feature flag in the flagd service
- Restart Services: Restart the load-generator service to clear any lingering issues
- Increase Monitoring: Temporarily increase the frequency of health checks on affected services
-
Feature Flag Governance: Establish clearer protocols for enabling failure-inducing feature flags, including:
- Notification systems for when failure flags are enabled
- Automatic time-based rollbacks for testing flags
- Documentation requirements for chaos engineering experiments
-
Resource Optimization:
- Review resource allocation across the cluster, particularly for the load-generator
- Optimize the OpenTelemetry collector configuration to handle export spikes
-
Error Handling Enhancements:
- Implement more robust error handling in services that depend on the product catalog
- Add fallback mechanisms for recommendation service when product data is unavailable
-
Resilience Testing Framework:
- Formalize chaos engineering practices with proper monitoring and automatic rollbacks
- Develop a comprehensive test plan for failure scenarios
-
Circuit Breaker Implementation:
- Add circuit breakers to prevent cascading failures when one service experiences issues
- Implement retry budgets and backoff strategies
-
Enhanced Monitoring:
- Implement more granular monitoring for key service interactions
- Add anomaly detection with automated alerting
- Create dashboards specifically for tracking feature flag impacts
-
Performance Optimization:
- Conduct a thorough performance review of critical API paths
- Implement caching strategies for frequently accessed data
- Optimize database queries and connection pooling
The observed incidents appear to be a combination of intentional testing (via feature flags) and resource constraints. While the impact on actual user experience may have been limited due to the testing nature of some components, these incidents highlight areas for improvement in the system's resilience and error handling capabilities.
The interconnected nature of the microservices architecture is evident in how failures propagate between services, underscoring the importance of robust fault tolerance mechanisms and comprehensive monitoring.
This trace shows the propagation of errors from the load-generator through the recommendation service to the product-catalog service:
- Root span: GET request from load-generator to frontend-proxy for recommendations
- Error in product-catalog service: "Error: Product Catalog Fail Feature Flag Enabled"
- Feature flag "productCatalogFailure" set to "on" variant
- HTTP 500 status code returned to the client
This trace provides clear evidence that the product catalog errors were triggered by an intentional feature flag configuration.
The following tools and commands can be used to recreate the visualizations and analysis in this report:
mcp0_servicesGet
mcp0_errorsGetTop
{
"limit": 20,
"timeRange": {"start": "2025-05-24T10:44:27-04:00", "end": "2025-05-25T10:44:27-04:00"}
}
mcp0_traceAnalyze
{
"traceId": "51c8a3955f26289b206907abbd2e8de4"
}
mcp0_spanGet
{
"spanId": "f86b60f460866b66"
}
mcp0_detectMetricAnomalies
{
"startTime": "2025-05-24T10:44:27-04:00",
"endTime": "2025-05-25T10:44:27-04:00"
}
mcp0_generateMetricsRangeAggregation
{
"startTime": "2025-05-24T10:44:27-04:00",
"endTime": "2025-05-25T10:44:27-04:00",
"metricField": "app_recommendations_counter",
"service": "recommendation"
}
mcp0_generateMarkdownVisualizations
{
"config": {
"timeRange": {
"start": "2025-05-24T10:44:27-04:00",
"end": "2025-05-25T10:44:27-04:00"
},
"config": {
"type": "incident-timeline",
"services": ["load-generator", "product-catalog", "recommendation"],
"maxEvents": 20,
"correlateEvents": true
}
}
}
mcp0_generateMarkdownVisualizations
{
"config": {
"timeRange": {
"start": "2025-05-24T10:44:27-04:00",
"end": "2025-05-25T10:44:27-04:00"
},
"config": {
"type": "service-dependency"
}
}
}
mcp0_generateMarkdownVisualizations
{
"config": {
"timeRange": {
"start": "2025-05-24T10:44:27-04:00",
"end": "2025-05-25T10:44:27-04:00"
},
"config": {
"type": "error-pie",
"services": ["product-catalog", "load-generator", "recommendation"],
"maxResults": 10,
"showData": true
}
}
}
mcp0_generateMarkdownVisualizations
{
"config": {
"timeRange": {
"start": "2025-05-24T10:44:27-04:00",
"end": "2025-05-25T10:44:27-04:00"
},
"config": {
"type": "xy-chart",
"chartType": "bar",
"dataType": "traces",
"xField": "Name",
"yField": "Duration",
"title": "Span Duration Anomalies",
"multiSeries": true,
"seriesField": "Resource.service.name"
}
}
}
mcp0_findLogs
{
"level": "error",
"timeRange": {
"start": "2025-05-24T10:44:27-04:00",
"end": "2025-05-25T10:44:27-04:00"
}
}