-
-
Save Pabl0Aceved0/857b2a2f75d45df283c2885383114fe0 to your computer and use it in GitHub Desktop.
AWS Backup Automation with Well-Architected Enhancements
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
AWSTemplateFormatVersion: 2010-09-09 | |
Description: Advanced automation for backup, restore testing, and cleanup of EBS volumes with multi-volume support, reporting integration, and Well-Architected enhancements | |
Parameters: | |
BackupVaultName: | |
Type: String | |
Default: MyBackupVault | |
Description: Name of the backup vault | |
BackupRuleName: | |
Type: String | |
Default: DailyBackupRule | |
Description: Name of the backup rule | |
SNSTopicName: | |
Type: String | |
Default: BackupStatusTopic | |
Description: Name of the SNS topic for notifications | |
LambdaFunctionName: | |
Type: String | |
Default: RestoreTestCleanupLambda | |
Description: Name of the Lambda function | |
EBSVolumeIds: | |
Type: CommaDelimitedList | |
Description: Comma-separated list of EBS volume IDs to back up (e.g., vol-1234567890abcdef0,vol-0987654321fedcba0) | |
EC2InstanceId: | |
Type: String | |
Description: ID of the EC2 instance to use for restore testing (e.g., i-1234567890abcdef0) | |
DeviceNamePrefix: | |
Type: String | |
Default: /dev/xvd | |
Description: Prefix for device names (e.g., /dev/xvd, will append letters like a, b, etc.) | |
ReporterAccountId: | |
Type: String | |
Description: AWS Account ID of the Reporter account | |
ReporterRegion: | |
Type: String | |
Default: us-west-2 | |
Description: Region where the Reporter EventBus is hosted | |
SSMDocumentName: | |
Type: String | |
Default: ValidateVolume | |
Description: Name of the SSM document for validation | |
DestinationRegion: | |
Type: String | |
Default: us-west-2 | |
Description: Region for cross-region backup replication | |
Resources: | |
BackupVault: | |
Type: AWS::Backup::BackupVault | |
Properties: | |
BackupVaultName: !Ref BackupVaultName | |
EncryptionKeyArn: !GetAtt BackupVaultKMSKey.Arn | |
BackupVaultTags: | |
Purpose: BackupAutomation | |
Environment: Production | |
CostCenter: BackupAutomation | |
BackupVaultKMSKey: | |
Type: AWS::KMS::Key | |
Properties: | |
KeyPolicy: | |
Version: 2012-10-17 | |
Statement: | |
- Effect: Allow | |
Principal: | |
AWS: !Sub arn:${AWS::Partition}:iam::${AWS::AccountId}:root | |
Action: kms:* | |
Resource: '*' | |
- Effect: Allow | |
Principal: | |
Service: backup.amazonaws.com | |
Action: | |
- kms:Encrypt | |
- kms:Decrypt | |
- kms:GenerateDataKey | |
Resource: '*' | |
BackupPlan: | |
Type: AWS::Backup::BackupPlan | |
Properties: | |
BackupPlanName: !Ref BackupRuleName | |
BackupPlanRule: | |
- RuleName: !Ref BackupRuleName | |
TargetBackupVault: !Ref BackupVault | |
ScheduleExpression: cron(0 0 * * ? *) # Daily at midnight UTC | |
StartWindowMinutes: 60 | |
CompletionWindowMinutes: 120 | |
Lifecycle: | |
MoveToColdStorageAfterDays: 7 | |
DeleteAfterDays: 30 | |
CopyActions: | |
- DestinationBackupVaultArn: !Sub arn:${AWS::Partition}:${DestinationRegion}:${AWS::AccountId}:backup-vault:${BackupVaultName} | |
Lifecycle: | |
MoveToColdStorageAfterDays: 7 | |
DeleteAfterDays: 90 | |
RecoveryPointTags: | |
BackupType: Automated | |
BackupSelection: | |
Type: AWS::Backup::BackupSelection | |
Properties: | |
BackupPlanId: !Ref BackupPlan | |
BackupSelectionName: EBSBackupSelection | |
IamRoleArn: !GetAtt BackupRole.Arn | |
Resources: | |
!ForEach | |
- VolumeId : !Ref EBSVolumeIds | |
- !Sub arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:volume/${VolumeId} | |
BackupRole: | |
Type: AWS::IAM::Role | |
Properties: | |
AssumeRolePolicyDocument: | |
Version: 2012-10-17 | |
Statement: | |
- Effect: Allow | |
Principal: | |
Service: backup.amazonaws.com | |
Action: sts:AssumeRole | |
ManagedPolicyArns: | |
- arn:aws:iam::aws:policy/service-role/AWSBackupServiceRolePolicyForBackup | |
Policies: | |
- PolicyName: BackupCrossAccount | |
PolicyDocument: | |
Version: 2012-10-17 | |
Statement: | |
- Effect: Allow | |
Action: events:PutEvents | |
Resource: !Sub arn:${AWS::Partition}:events:${ReporterRegion}:${ReporterAccountId}:event-bus/GlobalBackupJobStatusEventBus-${ReporterAccountId} | |
- Effect: Allow | |
Action: | |
- kms:Encrypt | |
- kms:Decrypt | |
- kms:GenerateDataKey | |
Resource: !GetAtt BackupVaultKMSKey.Arn | |
SNSTopic: | |
Type: AWS::SNS::Topic | |
Properties: | |
TopicName: !Ref SNSTopicName | |
Subscription: | |
- Endpoint: !Sub ${AWS::AccountId}@example.com # Replace with actual email | |
Protocol: email | |
SNSPolicy: | |
Type: AWS::SNS::TopicPolicy | |
Properties: | |
Topics: | |
- !Ref SNSTopic | |
PolicyDocument: | |
Version: 2012-10-17 | |
Statement: | |
- Effect: Allow | |
Principal: | |
Service: backup.amazonaws.com | |
Action: sns:Publish | |
Resource: !Ref SNSTopic | |
LambdaDLQ: | |
Type: AWS::SQS::Queue | |
Properties: | |
QueueName: !Sub ${LambdaFunctionName}-DLQ | |
LambdaFunction: | |
Type: AWS::Lambda::Function | |
Properties: | |
FunctionName: !Ref LambdaFunctionName | |
Handler: index.lambda_handler | |
Runtime: python3.12 | |
Architectures: | |
- arm64 | |
Role: !GetAtt LambdaRole.Arn | |
Code: | |
ZipFile: | | |
import boto3 | |
import json | |
import time | |
import os | |
from botocore.exceptions import ClientError | |
from datetime import datetime | |
def send_to_reporter(event_bus_arn, event): | |
events_client = boto3.client('events') | |
events_client.put_events( | |
Entries=[{ | |
'Time': datetime.utcnow(), | |
'Source': 'custom.backup.automation', | |
'DetailType': 'RestoreTestEvent', | |
'Detail': json.dumps(event), | |
'EventBusName': event_bus_arn.split('/')[-1] | |
}] | |
) | |
def run_ssm_validation(instance_id, volume_id, device_name): | |
ssm_client = boto3.client('ssm') | |
try: | |
response = ssm_client.send_command( | |
InstanceIds=[instance_id], | |
DocumentName=os.environ['SSM_DOCUMENT_NAME'], | |
Parameters={ | |
'VolumeId': [volume_id], | |
'DeviceName': [device_name], | |
'Action': ['validate'] | |
} | |
) | |
command_id = response['Command']['CommandId'] | |
time.sleep(10) | |
output = ssm_client.get_command_invocation( | |
CommandId=command_id, | |
InstanceId=instance_id | |
) | |
return output['Status'] == 'Success' | |
except ClientError as e: | |
print(f"SSM validation failed: {str(e)}") | |
return False | |
def lambda_handler(event, context): | |
sns_message = json.loads(event['Records'][0]['Sns']['Message']) | |
job_status = sns_message.get('status', 'Unknown') | |
job_type = sns_message.get('jobType', 'Unknown') | |
recovery_point_arn = sns_message.get('recoveryPointArn', None) | |
volume_id = sns_message.get('resourceId', None) | |
ec2 = boto3.client('ec2') | |
backup = boto3.client('backup') | |
reporter_event_bus_arn = f"arn:aws:events:{os.environ['REPORTER_REGION']}:{os.environ['REPORTER_ACCOUNT_ID']}:event-bus/GlobalBackupJobStatusEventBus-{os.environ['REPORTER_ACCOUNT_ID']}" | |
max_retries = 3 | |
retry_count = 0 | |
while retry_count < max_retries: | |
try: | |
if job_status in ['COMPLETED', 'FAILED'] and job_type == 'RESTORE_JOB' and recovery_point_arn: | |
print(f"Processing restore job status: {job_status} with recovery point: {recovery_point_arn}") | |
event_data = { | |
'jobStatus': job_status, | |
'jobType': job_type, | |
'recoveryPointArn': recovery_point_arn, | |
'timestamp': time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime()) | |
} | |
send_to_reporter(reporter_event_bus_arn, event_data) | |
if job_status == 'COMPLETED': | |
restore_job = backup.start_restore_job( | |
RecoveryPointArn=recovery_point_arn, | |
IamRoleArn=context.invoked_function_arn.split(':function:')[0] + ':role/' + os.environ['AWS_LAMBDA_FUNCTION_NAME'] + 'Role', | |
ResourceType='EBS', | |
Metadata={'volumeId': volume_id} | |
) | |
restore_job_id = restore_job['RestoreJobId'] | |
print(f"Started restore job: {restore_job_id}") | |
while True: | |
job_info = backup.describe_restore_job(RestoreJobId=restore_job_id) | |
if job_info['Status'] in ['COMPLETED', 'FAILED']: | |
break | |
time.sleep(10) | |
if job_info['Status'] == 'COMPLETED': | |
restored_volume_id = job_info['CreatedResourceArn'].split('/')[-1] | |
print(f"Restored volume ID: {restored_volume_id}") | |
device_letter = 'a' | |
device_name = f"{os.environ['DEVICE_NAME_PREFIX']}{device_letter}" | |
ec2.attach_volume( | |
VolumeId=restored_volume_id, | |
InstanceId=os.environ['EC2_INSTANCE_ID'], | |
Device=device_name | |
) | |
waiter = ec2.get_waiter('volume_in_use') | |
waiter.wait(VolumeIds=[restored_volume_id]) | |
print(f"Attached volume {restored_volume_id}") | |
if run_ssm_validation(os.environ['EC2_INSTANCE_ID'], restored_volume_id, device_name): | |
print("Validation successful") | |
event_data['validationStatus'] = 'Success' | |
else: | |
print("Validation failed") | |
event_data['validationStatus'] = 'Failed' | |
ec2.detach_volume(VolumeId=restored_volume_id, InstanceId=os.environ['EC2_INSTANCE_ID']) | |
waiter = ec2.get_waiter('volume_available') | |
waiter.wait(VolumeIds=[restored_volume_id]) | |
print(f"Detached volume {restored_volume_id}") | |
ec2.delete_volume(VolumeId=restored_volume_id) | |
print(f"Deleted volume {restored_volume_id}") | |
event_data['cleanupStatus'] = 'Completed' | |
else: | |
event_data['restoreStatus'] = 'Failed' | |
send_to_reporter(reporter_event_bus_arn, event_data) | |
break | |
except ClientError as e: | |
retry_count += 1 | |
if retry_count == max_retries: | |
print(f"Max retries reached. Error: {str(e)}") | |
event_data['error'] = str(e) | |
send_to_reporter(reporter_event_bus_arn, event_data) | |
raise | |
time.sleep(5 * retry_count) | |
return { | |
'statusCode': 200, | |
'body': json.dumps(f'Processed {job_type} with status {job_status}') | |
} | |
MemorySize: 256 | |
Timeout: 300 | |
DeadLetterConfig: | |
TargetArn: !GetAtt LambdaDLQ.Arn | |
Environment: | |
Variables: | |
EBS_VOLUME_ID: !Join [',', !Ref EBSVolumeIds] | |
EC2_INSTANCE_ID: !Ref EC2InstanceId | |
DEVICE_NAME_PREFIX: !Ref DeviceNamePrefix | |
REPORTER_ACCOUNT_ID: !Ref ReporterAccountId | |
REPORTER_REGION: !Ref ReporterRegion | |
SSM_DOCUMENT_NAME: !Ref SSMDocumentName | |
LambdaRole: | |
Type: AWS::IAM::Role | |
Properties: | |
AssumeRolePolicyDocument: | |
Version: 2012-10-17 | |
Statement: | |
- Effect: Allow | |
Principal: | |
Service: lambda.amazonaws.com | |
Action: sts:AssumeRole | |
Policies: | |
- PolicyName: LambdaEBSBackupPolicy | |
PolicyDocument: | |
Version: 2012-10-17 | |
Statement: | |
- Effect: Allow | |
Action: | |
- ec2:DescribeVolumes | |
- ec2:DeleteVolume | |
- ec2:AttachVolume | |
- ec2:DetachVolume | |
- backup:StartRestoreJob | |
- backup:DescribeRestoreJob | |
- ssm:SendCommand | |
- ssm:GetCommandInvocation | |
- events:PutEvents | |
Resource: !Sub arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:volume/* | |
- Effect: Allow | |
Action: | |
- logs:CreateLogGroup | |
- logs:CreateLogStream | |
- logs:PutLogEvents | |
Resource: !Sub arn:${AWS::Partition}:logs:*:*:* | |
- Effect: Allow | |
Action: | |
- kms:Decrypt | |
Resource: !GetAtt BackupVaultKMSKey.Arn | |
StepFunctionRole: | |
Type: AWS::IAM::Role | |
Properties: | |
AssumeRolePolicyDocument: | |
Version: 2012-10-17 | |
Statement: | |
- Effect: Allow | |
Principal: | |
Service: states.amazonaws.com | |
Action: sts:AssumeRole | |
Policies: | |
- PolicyName: StepFunctionPolicy | |
PolicyDocument: | |
Version: 2012-10-17 | |
Statement: | |
- Effect: Allow | |
Action: | |
- lambda:InvokeFunction | |
Resource: !GetAtt LambdaFunction.Arn | |
StepFunction: | |
Type: AWS::StepFunctions::StateMachine | |
Properties: | |
DefinitionString: !Sub | | |
{ | |
"Comment": "Process multiple EBS volume restores", | |
"StartAt": "ProcessVolumes", | |
"States": { | |
"ProcessVolumes": { | |
"Type": "Map", | |
"ItemsPath": "$.volumes", | |
"MaxConcurrency": 5, | |
"Iterator": { | |
"StartAt": "InvokeLambda", | |
"States": { | |
"InvokeLambda": { | |
"Type": "Task", | |
"Resource": "${LambdaFunction.Arn}", | |
"Parameters": { | |
"volumeId.$": "$$.Map.Item.Value" | |
}, | |
"End": true | |
} | |
} | |
}, | |
"End": true | |
} | |
} | |
} | |
RoleArn: !GetAtt StepFunctionRole.Arn | |
SNSToLambdaPermission: | |
Type: AWS::Lambda::Permission | |
Properties: | |
FunctionName: !Ref LambdaFunction | |
Action: lambda:InvokeFunction | |
Principal: sns.amazonaws.com | |
SourceArn: !Ref SNSTopic | |
SNSSubscription: | |
Type: AWS::SNS::Subscription | |
Properties: | |
TopicArn: !Ref SNSTopic | |
Protocol: lambda | |
Endpoint: !GetAtt LambdaFunction.Arn | |
BackupNotification: | |
Type: AWS::Backup::BackupVaultNotifications | |
Properties: | |
BackupVaultName: !Ref BackupVault | |
SNSTopicArn: !Ref SNSTopic | |
BackupVaultEvents: | |
- BACKUP_JOB_STARTED | |
- BACKUP_JOB_COMPLETED | |
- BACKUP_JOB_FAILED | |
- RESTORE_JOB_STARTED | |
- RESTORE_JOB_COMPLETED | |
- RESTORE_JOB_FAILED | |
SSMValidationDocument: | |
Type: AWS::SSM::Document | |
Properties: | |
Content: | |
schemaVersion: '2.2' | |
description: 'Validate EBS volume data integrity' | |
mainSteps: | |
- action: 'aws:runShellScript' | |
name: 'validateVolume' | |
inputs: | |
runCommand: | |
- | | |
#!/bin/bash | |
VOLUME_ID="{{ VolumeId }}" | |
DEVICE_NAME="{{ DeviceName }}" | |
if [ -b "$DEVICE_NAME" ]; then | |
mount "$DEVICE_NAME" /mnt | |
if [ -f "/mnt/testfile.txt" ]; then | |
echo "Validation successful: Test file found" | |
exit 0 | |
else | |
echo "Validation failed: Test file not found" | |
exit 1 | |
fi | |
umount /mnt | |
else | |
echo "Validation failed: Device not found" | |
exit 1 | |
fi | |
DocumentFormat: YAML | |
DocumentType: Command | |
Outputs: | |
BackupVaultArn: | |
Value: !GetAtt BackupVault.Arn | |
SNSTopicArn: | |
Value: !Ref SNSTopic | |
LambdaFunctionArn: | |
Value: !GetAtt LambdaFunction.Arn | |
SSMDocumentArn: | |
Value: !Ref SSMValidationDocument | |
StepFunctionArn: | |
Value: !Ref StepFunction |
Comments are disabled for this gist.