How Amazon RDS Replication Works and Why the FAA’s Database Problem Won’t Happen in AWS

January 30, 2023
  • Bohan Zhang

  • Andy Pavlo

On January 11, 2023, all flights in the US were grounded because of the Federal Aviation Administration (FAA) NOTAM system outage. According to the FAA, the outage was due to a corrupt database file. An engineer then tried to replace the file with a backup but it turned out that the backup file was busted too. FAA officials said that it was an “honest mistake that cost the country millions [of dollars].”

This is a big screw up. And it shows the importance of high-availability and reliability for mission-critical systems. It’s 2023 afterall. Such a failure shouldn’t happen for such a service. But the FAA’s NOTAM system is 30 years old. We don’t know the technical details about NOTAM and how the file got corrupted (e.g., bit rot, software error), but clearly the FAA’s database street game was weak.

Since databases are the most important things in the world, we want to discuss how replication works in Amazon’s Relational Database Service (RDS) for both PostgreSQL and MySQL to ensure that your database continues to operate even when one of its parts fails or you get locked up. But as we describe below, we see people that don’t set up enable replication properly for their production databases. That means that they could also got lit up with an outage like the FAA.

For OtterTune’s customers, 70% of production Aurora clusters have read replicas. But only 32% of production non-Aurora RDS instances are replicating to at least one standby instance.

There are two parts of this story. The first is how to ensure that the database’s storage is replicated. Amazon RDS stores Elastic Block Storage (EBS) to maintain multiple copies of the files. Using EBS this way is not that interesting from a database perspective because the DBMS does not know about this replication. Furthermore, it does not solve the downtime issue that hit the FAA. If the DBMS instance crashes, you have to either wait for the instance to come back online (or start a new one) and then recover the database.

The more interesting challenge is how to ensure that if the DBMS crashes, the system can still stay online with minimal disruption and without any data loss. We’ll discuss how regular RDS databases achieve this first, then we’ll talk about how Aurora is different.

How Amazon RDS Replication Works

When you delete a file on your laptop by accident, you first see whether there is a backup of that file. The idea behind high-availability database systems is similar, and redundancy can make a system more robust and fault-tolerant against unexpected failures. A primary instance of the DBMS replicates the database to a standby replica instance. Some part of the system designates the replica as the new primary if the primary crashes or becomes unavailable. This failover process is either initiated from the replicas themselves or an external service. In the case of AWS, RDS coordinates this failover process.

With an Amazon RDS instance with Multi-AZ, the primary runs in one availability zone (AZ) and then replicates data to a standby instance located in a different AZ. Amazon’s Availability Zones are locations within a region that are isolated from each other (e.g., us-east-1a vs. us-east-1b). Otherwise, an issue affecting an entire AZ can simultaneously cause outages for both instances. When AWS detects a failure in the primary, it automatically fails over to the standby instance.

To understand how the replication works, let’s consider read-only workloads first. In this case, data in the primary instance does not change, so we only need to copy all the data to the standby instance once. For workloads with writes (e.g., INSERT, UPDATE, DELETE queries), the primary instance must propagate changes to the replica. MySQL and PostgreSQL record these changes in redo logs. These DBMSs use different names for these logs, but they are essentially the same thing: MySQL calls them the binary log (binlog), and PostgreSQL calls them the write-ahead log (WAL). During the replication, the primary continuously sends the latest redo log entries to the standby instance. When the standby receives a new redo log entry, it applies the changes to its local database.

Amazon RDS Multi-AZ Deployment

In a Multi-AZ DB instance deployment, this replication process is synchronous. Suppose the application starts a transaction that changes the database on the primary. When the application attempts to commit (i.e., save) those changes, the primary waits until all standbys acknowledge that they applied a transaction’s redo log entries to their copy of the database before it tells the application that the changes are saved (i.e., committed). This waiting ensures that all the primary and replicas are always in sync. Now, if the primary crashes and a replica becomes the new primary, the database will not lose any changes from previously committed transactions (i.e., the Durability guarantee in ACID). Of course, transactions will take longer to commit because of the extra communication costs.

PostgreSQL RDS goes a little further by also storing redo log entries in an S3 bucket. Replicas can then retrieve log entries that they are missing from S3 in case they cannot keep up with the primary.

How Amazon Aurora replication works

Aurora MySQL and Aurora PostgreSQL replicate data slightly differently than their RDS counterparts. For this discussion, we will refer to both systems as “Aurora” as they use the same replication procedure in their storage layer. Each Aurora storage node includes a computational component that natively processes redo log entries and applies them to database pages. This is the “secret sauce” that Amazon created for this service.

Aurora stores each database in a cluster volume that maintains multiple copies across different AZs. In an Aurora cluster, there is one primary instance that supports read and write operations, and multiple read-only replica instances. All instances connect to the same cluster volume in the storage layer. If a primary instance fails or crashes, one of the replica instances takes over as the new primary.

Amazon Aurora Replicated Deployment

When handling a write request, Aurora sends data to six storage nodes across three AZs and waits for the acknowledgment from at least four nodes before the write is committed (i.e., quorum writes). Replicating the data to six nodes can generate huge network IO, but the system only sends redo log records to minimize network traffic. Each storage node asynchronously applies log records to data pages.

By maintaining six copies of the data, Aurora detects failures in the disk volumes and repairs them using the data in other volumes. It can tolerate at most three damaged copies out of six copies. Even if three copies are damaged, Aurora can still rebuild them from other intact copies. This repair process is automatic. You do not need to hit up somebody on their pager at 2:00am to get them back into the cut to manually copy a file like the feds tried.

Why the FAA’s Problem Can’t Happen in Amazon RDS

Given all this, there are several reasons why we are so sure that you are not likely to have the NOTAM-style problem for your MySQL and PostgreSQL databases in RDS if you ride with replicated instances.

  • First, Amazon does not allow you to modify database and log files manually. This means nobody can log into nodes and manually overwrite files like the FAA’s engineer, which is a common practice for disaster recovery in on-prem databases. This restriction avoids many human errors.
  • RDS also replicates data to multiple locations at the storage layer to protect it from the failure of any single component. Even in the rare event of the system going off the chain accidentally deleting or corrupting a file, the system can recover the data from other copies or historical copies (e.g., EBS snapshots).
  • Lastly, Amazon RDS has replicated DB instances and supports failover. It synchronously replicates data from the primary instance to other instances deployed at different AZs. If the primary DB instance crashes due to any corrupted file, it will fail over to the standby or replica instance to continue operating correctly.

But it is important to understand that while Amazon makes sure that your RDS database files are safe (i.e., you are unlikely to face a file corruption problem), they do not provide standby replicas by default. You have to manually create replicas in AWS to get this capability. Without this, then if (when) your DBMS instance crashes, your datatabase will be unavailable until it comes back online. In the best case scenario, the DBMS will come back in under a minute. But depending on the size of the log and the last time the DBMS took a checkpoint, it could be hours or even days!

Not Everyone Uses Replication for their Production Databases!

When we examine the databases of OtterTune’s customers, we were surprised to see that not everyone is running their prodution databases with replicated instances. This is not an exact survey. We just checked whether the name or tag of the RDS instance contained the word prod (roughly 40% of all OtterTune databases). For Aurora, 70% of production Aurora clusters using OtterTune have read replicas (it was roughly the same for MySQL and PostgreSQL). But only 32% of production non-Aurora RDS instances are running with at least one standby instance. The following graph shows the percentage of these databases by instance size:

Percentage of RDS databases per instance type that do not have replication enabled.

As you can see, the majority of RDS databases are large or smaller. There is a surprisingly a few of the larger instance sizes (e.g., 24xlarge) that are running without replication! If the database is so important to run on an instance that costs roughly $40,000 a year, you would think that they would want to make sure that it is always available. But it doubles the costs of the deployment. So people choose to go without database replication like driving without a seatbelt.

Other Ways to Protect Your Database

OtterTune cannot magically make replicated instances for you. There are a lot of external costs in such a decision (e.g., latency SLAs, cloud spending) that our machine learning algorithms cannot divine on their own. We may get there at some point. But there are other things that OtterTune ensures are running correctly and smoothly in your RDS PostgreSQL and MySQL databases, like configuration knobs, indexes, table settings, and queries. Try it out on your first database for free!

Ready to put your database optimization on autopilot?

Use OtterTune to automatically check the health of your Amazon RDS MySQL and PostgreSQL databases.

Connect your first database for free