Skip to content

Instantly share code, notes, and snippets.

@pradeep1991singh
Created March 14, 2024 21:04
Show Gist options
  • Save pradeep1991singh/94ff88f6016caae6713ff5f42756d162 to your computer and use it in GitHub Desktop.
Save pradeep1991singh/94ff88f6016caae6713ff5f42756d162 to your computer and use it in GitHub Desktop.
operational-dashboard.md

I would recommend the following operational dashboards and metrics for a product built with a mono/microservice architecture:

  • Service Health Dashboard: This dashboard would display the health status of each service in the system. Key metrics might include:

    • Availability: The percentage of time each service is up and running.
    • Response Time: The average, median, 95th percentile, and 99th percentile response times for each service.
    • 2xx Responses: The number or percentage of requests that result in 2xx (success) HTTP status codes.
    • 4xx Responses: The number or percentage of requests that result in 4xx (client error) HTTP status codes. These indicate issues like bad requests or unauthorized access.
    • 5xx Responses: The number or percentage of requests that result in 5xx (server error) HTTP status codes. These indicate issues with your services.
    • Error Rate: The number or percentage of requests that result in errors. This could be calculated as the sum of 4xx and 5xx responses.

These HTTP response code metrics can provide valuable insights into the types of responses your services are generating. For example, a high number of 5xx responses could indicate a problem with a service, while a high number of 4xx responses could indicate a problem with the requests being sent to the service.

  • System Load Dashboard: This dashboard would display metrics related to the load on the system. Key metrics might include:

    • Request Rate: The number of requests per second the system is handling.
    • Concurrency: The number of requests being handled concurrently.
    • Queue Length: The number of requests waiting to be handled.
    • Load Average: The average system load over a certain period of time.
    • Throughput: The amount of data being processed by the system, often measured in requests per second or data per second.
    • Latency: The time it takes for a request to travel from the sender to the receiver and for the receiver to process it.
  • Resource Utilization Dashboard: This dashboard would display metrics related to the utilization of system resources. Key metrics might include:

    • CPU Utilization: The percentage of CPU capacity being used.
    • Memory Utilization: The amount of memory being used.
    • Disk Utilization: The amount of disk space being used.
    • Network Utilization: The amount of network bandwidth being used.
    • Database Connections: The number of active and idle connections to your database.
    • Garbage Collection: The time and resources spent on garbage collection in your services, which can impact performance.
    • Thread Usage: The number of active threads in your application's thread pool, which can impact concurrency and resource usage.
  • Business Metrics Dashboard: This dashboard would display metrics related to the business aspects of the product. Key metrics might include:

    • User Activity: The number of active users, sessions, page views, etc.
    • User Engagement: Metrics like session duration, pages per session, and bounce rate.
    • Conversion Rate: The percentage of users who complete a desired action (e.g., making a purchase, signing up for a trial).
    • Churn Rate: The percentage of users who stop using the product over a given period.
    • Revenue: The revenue generated by the product.
    • Customer Acquisition Cost (CAC): The cost associated with acquiring a new customer.
    • Lifetime Value (LTV): The predicted revenue that a customer can generate during their lifetime.
  • Error and Exception Dashboard: This dashboard would display metrics related to errors and exceptions in the system. Key metrics might include:

    • Error Count: The number of errors occurring in the system.
    • Top Errors: The most common errors and their counts.
    • Error Locations: The services or parts of the system where errors are occurring.
    • Error Trends: How the error count is changing over time. This can help identify if a recent change has introduced new errors.
    • Unhandled Exceptions: The number of unhandled exceptions. These are exceptions that were not caught by your code and typically result in a crash.
    • Failed Jobs or Tasks: If your system involves background jobs or tasks, the number of those that failed.

These dashboards and metrics would provide a comprehensive view of the operational health and performance of the product. They would be useful for both the development team (for diagnosing and fixing issues) and the business team (for understanding the product's usage and performance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment