AWS AZ Failure Simulation

Another not sexy but actually cool new feature – AZ failure simulation. What and how and why?

AWS’s Fault Injection System has been around a while, allowing you to trigger system issues and failures. This is the sort of thing Netflix made famous with their Chaos Monkeys – we can randomly terminate instances, max out their CPU, interrupt connectivity.

Some new features were unveiled in re:Invent, to allow us to test resilience against entire AZ failures (e.g. power or network) and cross-region connectivity failures: https://aws.amazon.com/about-aws/whats-new/2023/11/aws-fault-injection-service-two-requested-scenarios/

Right now I’m more interested in AZ failure as most work on my current project is single region, but we do use multiple AZs as normal for HA. We assume that the HA will work as we expect, and we can test by terminating resources, but until now it’s not been simple to test the entire AZ (including networking) failing.

So how do we do this?

While a lot of simulations can be executed programatically, there are scenarios which are purely console based, and AZ failure is one of them. It looks relatively straightforward, but my lack of IAM right now is stopping me getting far with the account I’m using. Basically we specify the arns/tags of resources we want hit by the various scenarios, press the burton and watch the world burn. Or rather, watch the simulation of the AZ power fail for 30 minutes and then have intermittent issues for the following 30.

The sad thing for me is that this does not cover Amazon ECS tasks running on AWS Fargate, as these are the workloads my current project is moving toward, and I’d love to prove their availability. At least I’ll be able to demonstrate the EC2 and RDS failover.

I’ve not tried AWS FIS at all before now, but now I really want to – it gives me a chance to have a bit of fun testing what happen when things go wrong (beyond just terminating an instance and watching the ASG bring up another in possibly the same AZ). Obviously not something I’ll try on a customer account on a whim, but I’ll propose investigating this in a future sprint.

Leave a Reply

Your email address will not be published. Required fields are marked *