Multi-region S3 failover /w Route53

A couple weeks before writing this post, AWS had a single-region failure of S3. It was the worst failure of S3 ever, and it took down many services. We at IOpipe survived fairly well, with our dashboard offline, but our APIs and metrics ingestion were unaffected.

Still, we’d like to avoid these problems from happening in the future, and it turned out that resolving this was really, really easy. Yet, I didn’t find much documentation, tutorials, or how-tos on configuring S3 multi-region failover.

This tutorial will show you how to configure a URL which provides multi-region failover for S3 buckets, protecting against regional failures in S3. In this, we use Route 53. If you do not use Route 53 for your authoritative DNS servers, you may setup a ternary domain and utilize CNAMES to delegate specific records to Route 53. This tutorial does not cover the remediation of potential Route 53 failures.

Cross-region Replication

The thing we need to do is have our S3 data replicated. Amazon offers a replication feature out of the box, but it only takes effect for new or updated objects. This replication comes with several downsides:

  • Primary-secondary architecture, there cannot be 3-4 copies
  • Versioning is enabled on the bucket, so old copies are archived and maintained in the bucket

For many websites, a simple two-region replication will be sufficient, and the size of objects will be small enough that versioning will not be an issue. However, it would be relatively easy to use S3 Lambda trigger instead, allowing non-versioned copies to multiple regions, as long as the data can be copied within the 5 minute Lambda execution window.

At IOpipe, we choose to use the built-in replication.

S3 Bucket Replication Configuration:

Cloudfront

Each S3 bucket is put behind Cloudfront. This provides TLS termination, caching, and other resiliency features. This is really optional, but we use it at IOpipe. If eliminating Cloudfront, you will simply need fewer health checks.

It is also possible to perform failover between one bucket fronted by Cloudfront, and another directly accessed, guarding against Cloudfront outages.

Route53 Health Checks

This is where things get interesting. We setup three health checks per region, one checking the health of the S3 bucket, another checking the health of the Cloudfront distribution, and a calculated health check requiring that BOTH of the previous checks be green. This latter health check is what we monitor from Route53.

We use simple “Basic” health checks of AWS endpoints. As configured, each health check costs $0.50/mo. With 6 checks, this is a total of $3/mo per multi-region bucket. (If not using Cloudfront, this would be $1/mo)

Health Check Configuration:

Route53 Routing Policies

For the DNS name pointing to these S3 buckets, we created a CNAME pointing to each Cloudfront distribution, then enabled a “Failover” Routing Policy. One Cloudfront distribution (and underlying bucket) became Primary, and the other Secondary.

For each CNAME, we enabled “Associate with Health Check”, specifying the appropriate calculated health check.

Primary:

Secondary:

Conclusion

This configuration might, honestly, be overkill for the frequency at which S3 has serious outages — but it’s also not too difficult or costly to configure and maintain. The $3/mo extra we now pay to AWS has bought us some piece of mind.

Leave a Reply

Your email address will not be published. Required fields are marked *