Skip to content

Instantly share code, notes, and snippets.

@zmccoy
Created October 23, 2024 20:25
Show Gist options
  • Save zmccoy/31b7fedbf1d1dd325855368240ceb002 to your computer and use it in GitHub Desktop.
Save zmccoy/31b7fedbf1d1dd325855368240ceb002 to your computer and use it in GitHub Desktop.
Why are we talking about Active/Active so much and how does it affect our goals as a business and stability to our customers? Active-active is a system posture where multiple nodes in a network, such as multiple GCP regions, are active and operational at the same time. These nodes can process requests and handle traffic independently of each other. Geographically closer individuals would end up hitting their closest region through GC Load Balancers which makes their experience faster.
This stance allows us to more easily meet our regulatory requirements resilience or how long we can be unavailable as a system. Our regulators look at our meantime up comparing against a regulatory need of much less than the 99.5 in our SLA. Jack Henry is also required to run in alternate data centers or regions once per year. We translate that as RTO (Recovery Time Objective) criticality of 72 hours meaning zonally redundant is fine. With a requirement of less than 24 hours regionally redundant is needed. Banno is currently 4 hours for our RTO. RPO (recovery point objective) is how much data we can lose. Currently the RPO is 1 hour. That currently is the lag of streaming replication in the DB. This puts us in a tighter spot that we’d like.
Incidents such as region wide failures in azure had only one out, a full disaster recovery failover. With that setup we needed to wait for the streaming replication of Postgres databases to finish until we were able to serve traffic in another region. For example, this took about an hour during the GCP move. That’s 1 hour of downtime for the DB master to recover _after_ we ack that we need to do a DR. Our Dobby team is working until January to currently get our DR posture into an acceptable state for the Banno GCP project.
In a world where we’re fully active/active a region failing means we direct traffic to the other already active region that is serving traffic and scale up. This cuts our downtime tremendously, and disaster recovery becomes much simpler to perform and enable. It’s good to remember that our downtime for maintenance also counts against us for our regulatory needs and also from our customer’s perspective. Making DR faster, and easier, is table stakes because of how much DB related downtime affects us. DB related downtime maintenance is the current majority of our downtime.
In the current day, looking at GCP, we have a few choices for databases that can perform reads and writes across regions and achieve consistency. In addition to their data consistency story these databases are managed which fulfills regulatory requirements around patching and critical vulnerability patching. These choices being MongoDB Atlas for Documents and Spanner for relational needs. Spanner has a lot of other pieces like GraphDB but we won’t get into it here. Both RTO and RPO for Spanner are theoretically 0. Spanner’s uptime as defined by Google is 5 9’s (99.999%). This gives us a healthy starting point for our own reliability as we are only going to be able to be as good as what we’re standing on.
Spanner enables federated queries from Big Query, which allow us to send a query statement to Spanner (also AlloyDB and Cloud SQL) and get the result back as a temporary table. Why this matters to us is our product Data-Broker. Data-Broker is a Big Query instance per customer that they own. This instance is fed by our data via many means, but one of those means could be federated queries. The benefit of this data being within Big Query or Spanner for us is that Vertex AI is able to utilize data with federated queries or Spanner. This enhances what data we have available to our AI products at Jack Henry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment