Skip to content

Instantly share code, notes, and snippets.

@Pabl0Aceved0
Created May 6, 2025 23:48
Show Gist options
  • Save Pabl0Aceved0/857b2a2f75d45df283c2885383114fe0 to your computer and use it in GitHub Desktop.
Save Pabl0Aceved0/857b2a2f75d45df283c2885383114fe0 to your computer and use it in GitHub Desktop.
AWS Backup Automation with Well-Architected Enhancements
AWSTemplateFormatVersion: 2010-09-09
Description: Advanced automation for backup, restore testing, and cleanup of EBS volumes with multi-volume support, reporting integration, and Well-Architected enhancements
Parameters:
BackupVaultName:
Type: String
Default: MyBackupVault
Description: Name of the backup vault
BackupRuleName:
Type: String
Default: DailyBackupRule
Description: Name of the backup rule
SNSTopicName:
Type: String
Default: BackupStatusTopic
Description: Name of the SNS topic for notifications
LambdaFunctionName:
Type: String
Default: RestoreTestCleanupLambda
Description: Name of the Lambda function
EBSVolumeIds:
Type: CommaDelimitedList
Description: Comma-separated list of EBS volume IDs to back up (e.g., vol-1234567890abcdef0,vol-0987654321fedcba0)
EC2InstanceId:
Type: String
Description: ID of the EC2 instance to use for restore testing (e.g., i-1234567890abcdef0)
DeviceNamePrefix:
Type: String
Default: /dev/xvd
Description: Prefix for device names (e.g., /dev/xvd, will append letters like a, b, etc.)
ReporterAccountId:
Type: String
Description: AWS Account ID of the Reporter account
ReporterRegion:
Type: String
Default: us-west-2
Description: Region where the Reporter EventBus is hosted
SSMDocumentName:
Type: String
Default: ValidateVolume
Description: Name of the SSM document for validation
DestinationRegion:
Type: String
Default: us-west-2
Description: Region for cross-region backup replication
Resources:
BackupVault:
Type: AWS::Backup::BackupVault
Properties:
BackupVaultName: !Ref BackupVaultName
EncryptionKeyArn: !GetAtt BackupVaultKMSKey.Arn
BackupVaultTags:
Purpose: BackupAutomation
Environment: Production
CostCenter: BackupAutomation
BackupVaultKMSKey:
Type: AWS::KMS::Key
Properties:
KeyPolicy:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
AWS: !Sub arn:${AWS::Partition}:iam::${AWS::AccountId}:root
Action: kms:*
Resource: '*'
- Effect: Allow
Principal:
Service: backup.amazonaws.com
Action:
- kms:Encrypt
- kms:Decrypt
- kms:GenerateDataKey
Resource: '*'
BackupPlan:
Type: AWS::Backup::BackupPlan
Properties:
BackupPlanName: !Ref BackupRuleName
BackupPlanRule:
- RuleName: !Ref BackupRuleName
TargetBackupVault: !Ref BackupVault
ScheduleExpression: cron(0 0 * * ? *) # Daily at midnight UTC
StartWindowMinutes: 60
CompletionWindowMinutes: 120
Lifecycle:
MoveToColdStorageAfterDays: 7
DeleteAfterDays: 30
CopyActions:
- DestinationBackupVaultArn: !Sub arn:${AWS::Partition}:${DestinationRegion}:${AWS::AccountId}:backup-vault:${BackupVaultName}
Lifecycle:
MoveToColdStorageAfterDays: 7
DeleteAfterDays: 90
RecoveryPointTags:
BackupType: Automated
BackupSelection:
Type: AWS::Backup::BackupSelection
Properties:
BackupPlanId: !Ref BackupPlan
BackupSelectionName: EBSBackupSelection
IamRoleArn: !GetAtt BackupRole.Arn
Resources:
!ForEach
- VolumeId : !Ref EBSVolumeIds
- !Sub arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:volume/${VolumeId}
BackupRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service: backup.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSBackupServiceRolePolicyForBackup
Policies:
- PolicyName: BackupCrossAccount
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: events:PutEvents
Resource: !Sub arn:${AWS::Partition}:events:${ReporterRegion}:${ReporterAccountId}:event-bus/GlobalBackupJobStatusEventBus-${ReporterAccountId}
- Effect: Allow
Action:
- kms:Encrypt
- kms:Decrypt
- kms:GenerateDataKey
Resource: !GetAtt BackupVaultKMSKey.Arn
SNSTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: !Ref SNSTopicName
Subscription:
- Endpoint: !Sub ${AWS::AccountId}@example.com # Replace with actual email
Protocol: email
SNSPolicy:
Type: AWS::SNS::TopicPolicy
Properties:
Topics:
- !Ref SNSTopic
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service: backup.amazonaws.com
Action: sns:Publish
Resource: !Ref SNSTopic
LambdaDLQ:
Type: AWS::SQS::Queue
Properties:
QueueName: !Sub ${LambdaFunctionName}-DLQ
LambdaFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: !Ref LambdaFunctionName
Handler: index.lambda_handler
Runtime: python3.12
Architectures:
- arm64
Role: !GetAtt LambdaRole.Arn
Code:
ZipFile: |
import boto3
import json
import time
import os
from botocore.exceptions import ClientError
from datetime import datetime
def send_to_reporter(event_bus_arn, event):
events_client = boto3.client('events')
events_client.put_events(
Entries=[{
'Time': datetime.utcnow(),
'Source': 'custom.backup.automation',
'DetailType': 'RestoreTestEvent',
'Detail': json.dumps(event),
'EventBusName': event_bus_arn.split('/')[-1]
}]
)
def run_ssm_validation(instance_id, volume_id, device_name):
ssm_client = boto3.client('ssm')
try:
response = ssm_client.send_command(
InstanceIds=[instance_id],
DocumentName=os.environ['SSM_DOCUMENT_NAME'],
Parameters={
'VolumeId': [volume_id],
'DeviceName': [device_name],
'Action': ['validate']
}
)
command_id = response['Command']['CommandId']
time.sleep(10)
output = ssm_client.get_command_invocation(
CommandId=command_id,
InstanceId=instance_id
)
return output['Status'] == 'Success'
except ClientError as e:
print(f"SSM validation failed: {str(e)}")
return False
def lambda_handler(event, context):
sns_message = json.loads(event['Records'][0]['Sns']['Message'])
job_status = sns_message.get('status', 'Unknown')
job_type = sns_message.get('jobType', 'Unknown')
recovery_point_arn = sns_message.get('recoveryPointArn', None)
volume_id = sns_message.get('resourceId', None)
ec2 = boto3.client('ec2')
backup = boto3.client('backup')
reporter_event_bus_arn = f"arn:aws:events:{os.environ['REPORTER_REGION']}:{os.environ['REPORTER_ACCOUNT_ID']}:event-bus/GlobalBackupJobStatusEventBus-{os.environ['REPORTER_ACCOUNT_ID']}"
max_retries = 3
retry_count = 0
while retry_count < max_retries:
try:
if job_status in ['COMPLETED', 'FAILED'] and job_type == 'RESTORE_JOB' and recovery_point_arn:
print(f"Processing restore job status: {job_status} with recovery point: {recovery_point_arn}")
event_data = {
'jobStatus': job_status,
'jobType': job_type,
'recoveryPointArn': recovery_point_arn,
'timestamp': time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime())
}
send_to_reporter(reporter_event_bus_arn, event_data)
if job_status == 'COMPLETED':
restore_job = backup.start_restore_job(
RecoveryPointArn=recovery_point_arn,
IamRoleArn=context.invoked_function_arn.split(':function:')[0] + ':role/' + os.environ['AWS_LAMBDA_FUNCTION_NAME'] + 'Role',
ResourceType='EBS',
Metadata={'volumeId': volume_id}
)
restore_job_id = restore_job['RestoreJobId']
print(f"Started restore job: {restore_job_id}")
while True:
job_info = backup.describe_restore_job(RestoreJobId=restore_job_id)
if job_info['Status'] in ['COMPLETED', 'FAILED']:
break
time.sleep(10)
if job_info['Status'] == 'COMPLETED':
restored_volume_id = job_info['CreatedResourceArn'].split('/')[-1]
print(f"Restored volume ID: {restored_volume_id}")
device_letter = 'a'
device_name = f"{os.environ['DEVICE_NAME_PREFIX']}{device_letter}"
ec2.attach_volume(
VolumeId=restored_volume_id,
InstanceId=os.environ['EC2_INSTANCE_ID'],
Device=device_name
)
waiter = ec2.get_waiter('volume_in_use')
waiter.wait(VolumeIds=[restored_volume_id])
print(f"Attached volume {restored_volume_id}")
if run_ssm_validation(os.environ['EC2_INSTANCE_ID'], restored_volume_id, device_name):
print("Validation successful")
event_data['validationStatus'] = 'Success'
else:
print("Validation failed")
event_data['validationStatus'] = 'Failed'
ec2.detach_volume(VolumeId=restored_volume_id, InstanceId=os.environ['EC2_INSTANCE_ID'])
waiter = ec2.get_waiter('volume_available')
waiter.wait(VolumeIds=[restored_volume_id])
print(f"Detached volume {restored_volume_id}")
ec2.delete_volume(VolumeId=restored_volume_id)
print(f"Deleted volume {restored_volume_id}")
event_data['cleanupStatus'] = 'Completed'
else:
event_data['restoreStatus'] = 'Failed'
send_to_reporter(reporter_event_bus_arn, event_data)
break
except ClientError as e:
retry_count += 1
if retry_count == max_retries:
print(f"Max retries reached. Error: {str(e)}")
event_data['error'] = str(e)
send_to_reporter(reporter_event_bus_arn, event_data)
raise
time.sleep(5 * retry_count)
return {
'statusCode': 200,
'body': json.dumps(f'Processed {job_type} with status {job_status}')
}
MemorySize: 256
Timeout: 300
DeadLetterConfig:
TargetArn: !GetAtt LambdaDLQ.Arn
Environment:
Variables:
EBS_VOLUME_ID: !Join [',', !Ref EBSVolumeIds]
EC2_INSTANCE_ID: !Ref EC2InstanceId
DEVICE_NAME_PREFIX: !Ref DeviceNamePrefix
REPORTER_ACCOUNT_ID: !Ref ReporterAccountId
REPORTER_REGION: !Ref ReporterRegion
SSM_DOCUMENT_NAME: !Ref SSMDocumentName
LambdaRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: LambdaEBSBackupPolicy
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- ec2:DescribeVolumes
- ec2:DeleteVolume
- ec2:AttachVolume
- ec2:DetachVolume
- backup:StartRestoreJob
- backup:DescribeRestoreJob
- ssm:SendCommand
- ssm:GetCommandInvocation
- events:PutEvents
Resource: !Sub arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:volume/*
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: !Sub arn:${AWS::Partition}:logs:*:*:*
- Effect: Allow
Action:
- kms:Decrypt
Resource: !GetAtt BackupVaultKMSKey.Arn
StepFunctionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service: states.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: StepFunctionPolicy
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- lambda:InvokeFunction
Resource: !GetAtt LambdaFunction.Arn
StepFunction:
Type: AWS::StepFunctions::StateMachine
Properties:
DefinitionString: !Sub |
{
"Comment": "Process multiple EBS volume restores",
"StartAt": "ProcessVolumes",
"States": {
"ProcessVolumes": {
"Type": "Map",
"ItemsPath": "$.volumes",
"MaxConcurrency": 5,
"Iterator": {
"StartAt": "InvokeLambda",
"States": {
"InvokeLambda": {
"Type": "Task",
"Resource": "${LambdaFunction.Arn}",
"Parameters": {
"volumeId.$": "$$.Map.Item.Value"
},
"End": true
}
}
},
"End": true
}
}
}
RoleArn: !GetAtt StepFunctionRole.Arn
SNSToLambdaPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref LambdaFunction
Action: lambda:InvokeFunction
Principal: sns.amazonaws.com
SourceArn: !Ref SNSTopic
SNSSubscription:
Type: AWS::SNS::Subscription
Properties:
TopicArn: !Ref SNSTopic
Protocol: lambda
Endpoint: !GetAtt LambdaFunction.Arn
BackupNotification:
Type: AWS::Backup::BackupVaultNotifications
Properties:
BackupVaultName: !Ref BackupVault
SNSTopicArn: !Ref SNSTopic
BackupVaultEvents:
- BACKUP_JOB_STARTED
- BACKUP_JOB_COMPLETED
- BACKUP_JOB_FAILED
- RESTORE_JOB_STARTED
- RESTORE_JOB_COMPLETED
- RESTORE_JOB_FAILED
SSMValidationDocument:
Type: AWS::SSM::Document
Properties:
Content:
schemaVersion: '2.2'
description: 'Validate EBS volume data integrity'
mainSteps:
- action: 'aws:runShellScript'
name: 'validateVolume'
inputs:
runCommand:
- |
#!/bin/bash
VOLUME_ID="{{ VolumeId }}"
DEVICE_NAME="{{ DeviceName }}"
if [ -b "$DEVICE_NAME" ]; then
mount "$DEVICE_NAME" /mnt
if [ -f "/mnt/testfile.txt" ]; then
echo "Validation successful: Test file found"
exit 0
else
echo "Validation failed: Test file not found"
exit 1
fi
umount /mnt
else
echo "Validation failed: Device not found"
exit 1
fi
DocumentFormat: YAML
DocumentType: Command
Outputs:
BackupVaultArn:
Value: !GetAtt BackupVault.Arn
SNSTopicArn:
Value: !Ref SNSTopic
LambdaFunctionArn:
Value: !GetAtt LambdaFunction.Arn
SSMDocumentArn:
Value: !Ref SSMValidationDocument
StepFunctionArn:
Value: !Ref StepFunction

Comments are disabled for this gist.