Skip to content

Instantly share code, notes, and snippets.

@huynhbaoan
Created December 9, 2024 02:56
Show Gist options
  • Save huynhbaoan/6d19aa0b0432009e202db1ea6b855339 to your computer and use it in GitHub Desktop.
Save huynhbaoan/6d19aa0b0432009e202db1ea6b855339 to your computer and use it in GitHub Desktop.
Let’s assume you’re monitoring the Route53 DNS query volume metric, which is commonly named DNSQueries, with dimensions like HostedZoneId. Below are detailed examples for each case:
1. Tweak Anomaly Detection Parameters
Adjust the Confidence Bound
Reduce false positives by widening the confidence interval from 99% (default) to 95%.
aws cloudwatch put-anomaly-detector \
--namespace "AWS/Route53" \
--metric-name "DNSQueries" \
--dimensions Name=HostedZoneId,Value=Z123456ABCDEFG \
--statistic "Sum" \
--configuration "{\"AnomalyDetectorSettings\":{\"ConfidenceBoundsPercentage\":95}}"
• This reduces sensitivity, allowing more variability before triggering an anomaly.
2. Combine Metrics or Dimensions
Add More Context
Instead of monitoring total DNSQueries, monitor it by HostedZoneId to isolate anomalies to specific hosted zones.
aws cloudwatch get-metric-data \
--metric-data-queries '[{
"Id": "dnsqueries_zone1",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Route53",
"MetricName": "DNSQueries",
"Dimensions": [{"Name": "HostedZoneId", "Value": "Z123456ABCDEFG"}]
},
"Period": 60,
"Stat": "Sum"
},
"ReturnData": true
}]'
• This allows anomaly detection on each hosted zone rather than a global average.
Aggregate Metrics Smartly
Monitor the total query volume across multiple hosted zones to detect trends affecting your entire setup.
{
"Expression": "SUM(METRICS('DNSQueries'))",
"Label": "Total DNS Queries Across Hosted Zones",
"Id": "total_dnsqueries",
"ReturnData": true
}
3. Set Smarter Alarms
Use Multiple Consecutive Data Points
Create an alarm that triggers only after three consecutive anomaly detections:
aws cloudwatch put-composite-alarm \
--alarm-name "Route53_QueryVolume_Anomaly" \
--alarm-rule "ANOMALY_DETECTION_BAND breached AND datapoints >= 3" \
--actions-enabled \
--alarm-actions arn:aws:sns:us-east-1:123456789012:MySNSTopic
Add a Baseline Threshold
Combine anomaly detection with a static threshold. For example, trigger an alarm if anomalies occur and queries exceed 1,000,000 in a minute:
{
"AlarmRule": "ANOMALY_DETECTION_BAND breached AND MetricName DNSQueries > 1000000"
}
4. Pre-Process Data
Smooth Your Data
Create a rolling average for DNSQueries using Metric Math and then apply anomaly detection.
{
"Id": "rolling_avg",
"Expression": "AVG(METRICS('DNSQueries'), 5)",
"Label": "Rolling Avg DNS Queries",
"ReturnData": true
}
Use the rolling_avg as input to the anomaly detector.
Example Workflow in the Console
1. Go to CloudWatch Console > Metrics > Select DNSQueries.
2. Add Metric Math:
• Expression: AVG(METRICS('DNSQueries'), 5)
• Label: “Smoothed DNS Queries”.
3. Create the anomaly detector on this smoothed metric.
5. Validate Data Quality
Filter Noise from Metrics
If certain hosted zones generate unreliable metrics, exclude them or apply anomaly detection only on zones with stable data.
aws cloudwatch list-metrics \
--namespace "AWS/Route53" \
--metric-name "DNSQueries" \
--dimensions Name=HostedZoneId,Value=Z123456ABCDEFG
Exclude noisy zones by filtering metric data queries or focusing on key zones.
6. Alternative Notification Channels
Suppression with EventBridge
Route anomaly detections to EventBridge and suppress repetitive alerts.
EventBridge rule example:
{
"Source": ["aws.cloudwatch"],
"DetailType": ["CloudWatch Alarm State Change"],
"Detail": {
"state": ["ALARM"],
"alarmName": ["Route53_QueryVolume_Anomaly"]
}
}
Suppress duplicate alerts with a Lambda function that implements cooldown logic.
7. Evaluate Alternative Tools
Amazon Lookout for Metrics
Use Lookout for Metrics for better handling of complex anomaly patterns in Route53 query volume data.
Steps:
1. Ingest Route53 query volume metrics into Lookout for Metrics.
2. Set up ML-based detectors with seasonal trends (e.g., high traffic during the day).
3. Apply fine-tuned alerting thresholds using Lookout’s UI.
Combine with CloudWatch
Use Lookout for Metrics for advanced detection and send the output to CloudWatch for centralized monitoring.
Example CloudFormation Template for Automation
Here’s an example combining anomaly detection, smoothed data, and smarter alarms:
Resources:
Route53AnomalyDetector:
Type: AWS::CloudWatch::AnomalyDetector
Properties:
MetricName: "DNSQueries"
Namespace: "AWS/Route53"
Dimensions:
- Name: "HostedZoneId"
Value: "Z123456ABCDEFG"
Stat: "Sum"
Configuration:
AnomalyDetectorSettings:
ConfidenceBoundsPercentage: 95
Route53Alarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "Route53_QueryVolume_Anomaly"
ComparisonOperator: GreaterThanOrEqualToThreshold
EvaluationPeriods: 3
Threshold: 1000000
Metrics:
- Id: "dnsqueries"
MetricStat:
Metric:
MetricName: "DNSQueries"
Namespace: "AWS/Route53"
Dimensions:
- Name: "HostedZoneId"
Value: "Z123456ABCDEFG"
Period: 60
Stat: "Sum"
- Id: "rolling_avg"
Expression: "AVG(METRICS('dnsqueries'), 5)"
ActionsEnabled: true
AlarmActions:
- arn:aws:sns:us-east-1:123456789012:MySNSTopic
Would you like help implementing any specific example or further explanation?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment