The problem (AKA don't waste money)

One of my clients is a big company that has a very variable but predictable database usage pattern. In the weekdays morning they come up with a huge amount of INSERTs and UPDATEs while for the rest of the day, form noon on, the usage is fairly limited to a decent amount of SELECTs with very low writing activity. The smallest class type that can handle the morning's volume is db.r3.large but having such an instance running all day long and in the weekends is just a big waste of ~~resources~~ money (or a big favour we do to Amazon, from their point of view).

The solution

So I was wondering if there's some autoscaling mechanism for Aurora writers (there is, but only for replicas), and not finding any, I came up with this solution that so far showed no downtime (it can have some, as I'll show later).

Prerequisites

Let's say we have our Aurora cluster called aurora-demo composed by a writer, aurora-demo-writer-large, and a replica aurora-demo-replica. Let's say also that we are in the eu-central-1 region and our endpoints are:

Read-Write: aurora-demo-cluster.cluster-qu4cchec0s4.eu-central-1.rds.amazonaws.com

Read-Only: aurora-demo-cluster.cluster-ro-qu4cchec0s4.eu-central-1.rds.amazonaws.com

And finally let's say we are on a unix box with the last version of AWS CLI installed and configured.

It's almost noon, and it's time to scale in the cluster...

The script

#!/bin/bash
set -Eeuo pipefail
##
## cluster_scale_in.sh
##

# New writer's name - note that the first part of the identifier is cluster's name -
IDENT=aurora-demo-writer-small
# Cluster's name
CLUST=aurora-demo
# Current writer's name - works because I always set the first part of the identifier as the cluster's name -
WRITER=`aws rds describe-db-clusters --output text | grep DBCLUSTERMEMBERS | grep aurora-demo | grep True | awk '{print $3}'`

# Create a replica of desired size 
aws rds create-db-instance --db-instance-identifier $IDENT --db-cluster-identifier $CLUST --engine aurora --db-instance-class db.t2.small --promotion-tier 0 --output text
# Wait until it comes up
until aws rds describe-db-instances --db-instance-identifier $IDENT --output text | grep available; do sleep 15; echo Waiting for $IDENT to come up...; done
# Set the cluster in failover state, forcing the promotion of the new replica
aws rds failover-db-cluster --db-cluster-identifier $CLUST --target-db-instance-identifier $IDENT --output text
# Wait until the new replica gets promoted
until aws rds describe-db-clusters --db-cluster-identifier $CLUST --output text | grep $WRITER | grep False; do sleep 5; echo Waiting for $IDENT to be promoted...; done
# Delete the old expensive writer
aws rds delete-db-instance --db-instance-identifier $WRITER --output text
# Wait until the deletion is complete
while aws rds describe-db-instances --db-instance-identifier $WRITER --output text | grep deleting; do sleep 15; echo Waiting for $WRITER to be deleted...; done

Pitfalls

The whole process takes about 15 minutes to complete.
So far I never experienced any downtime even if some collegues pointed out that in the promotion phase there can be some DNS adjustment.
The most worrying part of the process is in the help page of failover-db-cluster:

You can force a failover when you want to simulate a failure of a primary instance for testing.

So this command was never actually meant to be used in this scenario.
I haven't figured out the volume of data tranfer involved in this process, I suppose it's not irrelevant.

Tomorrow morning?

We scale the other way around:

#!/bin/bash
set -Eeuo pipefail
##
## cluster_scale_out.sh
##

# New writer's name - note that the first part of the identifier is cluster's name -
IDENT=aurora-demo-writer-large
# Cluster's name
CLUST=aurora-demo
# Current writer's name - works because I always set the first part of the identifier as the cluster's name -
WRITER=`aws rds describe-db-clusters --output text | grep DBCLUSTERMEMBERS | grep aurora-demo | grep True | awk '{print $3}'`

# Create a replica of desired size 
aws rds create-db-instance --db-instance-identifier $IDENT --db-cluster-identifier $CLUST --engine aurora --db-instance-class db.r3.large --promotion-tier 0 --output text
# Wait until it comes up
until aws rds describe-db-instances --db-instance-identifier $IDENT --output text | grep available; do sleep 15; echo Waiting for $IDENT to come up...; done
# Set the cluster in failover state, forcing the promotion of the new replica
aws rds failover-db-cluster --db-cluster-identifier $CLUST --target-db-instance-identifier $IDENT --output text
# Wait until the new replica gets promoted
until aws rds describe-db-clusters --db-cluster-identifier $CLUST --output text | grep $WRITER | grep False; do sleep 5; echo Waiting for $IDENT to be promoted...; done
# Delete the old cheap writer
aws rds delete-db-instance --db-instance-identifier $WRITER --output text
# Wait until the deletion is complete
while aws rds describe-db-instances --db-instance-identifier $WRITER --output text | grep deleting; do sleep 15; echo Waiting for $WRITER to be deleted...; done

mezzomondo/aurora_vertical_scaling.md