- Once our applications are deployed, our users don't care how we did it
- Our users only care that the application is working or not
- Application latency: will it increase over time?
- Application outages: customer experience should not be degraded
- AWS CloudWatch
- Metrics: Collect and track key metrics
- Logs: Collect, monitor, analyze and store log files
- Events: Send notifications when certain events happen in your AWS
- Alarms: React in real-time to metrics / events
- AWS X-Ray
- Troubleshooting application performance and errors
- Distrubuted tracing of microservices
- AWS CloudTrail
- Internal monitoring of API calls being made
- Audit changes to AWS resources by your users
- CloudWatch provides metrics for almost all the services in AWS
- "Metric" is a variable to monitor (CPUUtilization, NetworkIn, ...)
- Metrics belong to "namespaces"
- "Dimension" is an attribute of a metric (instance id, environment, etc...)
- Up to 10 dimensions per metric
- Metrics have "timestamps"
- Can create CloudWatch dashboards of metrics
-
EC2 instance metrics have metrics "every 5 mins"
-
With detailed monitoring (for a cost), you get data "every 1 min"
-
Use detailed monitoring if you want to more prompt scale your ASG!
-
Note: EC2 Memory usage is by default not pushed (must be pushed from inside the instance as a custom metric)
- Possibility to define and send your own custom metrics to CloudWatch
- Ability to use dimensions (attributes) to segment metrics
- Instance.id
- Environment.name
- Metric resolution
- Standard: 1 minute
- High Resolution: up to 1 second (StorageResolution API Paramter) - Higher cost
- Use API call "PutMetricData"
- Use exponential back off in case of throttle errors
- Alarms are used to trigger notifications for any metric
- Alarms can go to Auto Scaling, EC2 Actions, SNS notifications
- Various options (sampling, %, max, min, etc...)
- Alarm states:
- OK
- INSUFFICIENT_DATA
- ALARM
- Period:
- Length of time in seconds to evaluate the metric
- High resolution custom metrics: can only choose 10 secs or 30 secs
- Applications can send logs to CloudWatch using the SDK
- CloudWatch can collect logs from:
- Elastic Beanstalk: collection of logs from application
- ECS: collection from containers
- AWS Lambda: collection from function logs
- VPC Flow Logs: VPC specific logs
- API Gateway
- CloudTrail based on filter
- CloudWatch log agents: for example on EC2 machines
- Route53: Log DNS queries
- CloudWatch logs can go to:
- Batch exporter to S3 for archival
- Stream to ElastiSearch cluster for further analytics
- By default, no logs from your EC2 machine will go to CloudWatch
- You need to run a CloudWatch agent on EC2 to push the log files you want
- Both are for virtual servers (EC2 instances, on-premise servers)
- CloudWatch Logs Agent
- Old version of the agent
- Can only send to CloudWatch Logs
- CloudWatch Unified Agent
- Collect additional system-level metrics such as RAM, processes, etc
- Collect logs to send to CloudWatch Logs
- Centralized configuration using SSM Parameter Store
- CloudWatch Logs can use filter expressions
- For example, find a specific IP inside of a log
- Or count occurrences of "ERROR" in your logs
- Metric filter can be used to trigger alarms then
- Filters do not retroactively filter data. Filters only publish the metric data points for events that happen after the filter was created.
- Schedule: Cron jobs
- Event Pattern: Event rules to react to a service doing somehting
- Example: CodePipeline state changes!
- Triggers to Lambda functions, SQS/SNS/Kinesis Messages
- CloudWatch Event creates a small JSON document to give information about the change
-
EventBridge is the next evolution of CloudWatch Events
-
Default event bus: generated by AWS services (CloudWatch Events)
-
Partner event bus: receive events from SaaS service or applications (Zendesk, DataDog, Segment, Auth0, ...)
-
Custom event buses: for your own applications
-
Event buses can be accessed by other AWS accounts
-
Rules: how to process the events (similar to CloudWatch Events)
- EventBridge can analyze events in your bus and infer the schema
- The Schema Registry allows you to generate code for your application that will know in advance how data is structured in the event bus
- Schema can be versioned
-
Amazon EventBridge builds upon and extends CloudWatch Events
-
It uses the same service API and endpoint, and the same underlying service infrastructure
-
EventBridge allows extension to add event buses for your custom applications and your third-party SaaS apps
-
EventBridge has the Schema Registry capability
-
EventBridge has a different name to mark the new capabilities
-
Over time, the CloudWatch Events name will be replaced with EventBridge
- Debugging in Production, the good old way:
- Test locally
- Add log statements everywhere
- Re-deploy in production
- Log formats differ across applications using CloudWatch and analytics is hard
- Debugging: monolith "easy", distributed services "hard"
- No common views of your entire architecture!
.....Enter AWS X-Ray.....
- Troubleshooting performance (bottlenecks)
- Understand dependencies in a microservice architecture
- Pinpoint service issues
- Review request behaviour
- Find errors and exceptions
- Are we meeting time SLA?
- Where am I throttled?
- Identify users that are impacted
- Tracing is an end-to-endway to follow a "request"
- Each component dealing with request adds its own "trace"
- Tracing is made of segments (+ sub segments)
- Annotations can be added to traces to provide extra-information
- Ability to trace:
- Every request
- Sample request (as a & for example or rate/min)
- X-Ray Security
- IAM for authorization
- KMS for encryption at rest
- Your code must import the AWS X-Ray SDK
- Very little modification needed
- The application SDK will then capture:
- Calls to AWS services
- HTTP / HTTPS requests
- Database calls (MySQL, PostgreSQL, DynamoDB)
- Queue calls (SQS)
- Install the X-Ray daemon or enable X-Ray AWS Integration
- X-Ray daemon works as a low-level UDP packet interceptor (Linux, Windows, Mac)
- AWS Lambda / other AWS services already run the X-Ray daemon for you
- Each application must have the IAM rights to write data to X-Ray
-
If X-Ray is not working on EC2
- Ensure the EC2 IAM Role has the proper permissions
- Ensure the EC2 instance is running the X-Ray Daemon
-
To enable on AWS Lambda:
- Ensure it has an IAM execution role with proper policy (AWSX-RayWriteOnlyAccess)
- Ensure that X-Ray is imported in the code
- Instrumentation means the measure of product's performance, diagnose errors, and to write trace information
-
Segments: Each application / service will send them
-
Sub-segments: If you need more details in your segment
-
Trace: segments collected together to form an end-to-end trace
-
Sampling: decrease the amount of requests sent to X-Ray, reduce cost
-
Annotations: Key-value pairs used to index traces and use with filters
-
Metadata: Key-value pairs, not indexed, not used for searching
-
The X-Ray daemon / agent has a config to send traces cross account
- make sure the IAM permission are correct - the agent will assume the role
- This allows to have a central acocunt for all your application tracing
-
With sampling rules, you control the amount of data that you record
-
You can modify sampling rues without changing your code
-
By default, the X-Ray SDK records the first request "each second", and "five percent" of any additional requests
-
One request per second is the "reservoir", which ensures that at least one trace is recorded each second as long as the service is serving requests
-
Five percent is the "rate", at which additional requests beyond the reservoir size are sampled
- Provides governance, compliance and audit for your AWS account
- CloudTrail is enabled by default
- Get an history of events / API calls made within your AWS account by:
- Console
- SDK
- CLI
- AWS Services
- Can put logs from CloudTrail into CloudWatch Logs
- If a resource is deleted in AWS, look into CloudTrail first.