A massive Cloud Front hit Sydney bringing a storm that took one of AWS’ Sydney region’s availability zones offline for sometime on Saturday 4pm AEST. This affected major websites who rely on AWS for their hosting infrastructure.
Updated: The issues that customers experienced during this outage are outlined at the end of this article which I forgot to provide as an overview and better context. The purpose of this article is to help give you some tips on how to design a “fail safe” solution on AWS, but of course it is interesting to note how the outage affected specific AWS services, specifically the managed services such as RDS, Elasticache and Redshift.
What better opportunity to review how you run and design applications in AWS than after an outage? Hindsight is a beautiful thing and ideally you will always take the approach of designing the solution right from the start, but with customer and business demands I understand the pressure to deliver a website and then continuously improve it.
If several of the high profile websites that suffered outages had taken on board AWS architecture best practices and design patterns they most likely wouldn’t have suffered outages to their web sites, mobile apps or other critical applications. They may have had mild performance degradation, but only for a short amount of time.
Teem’s website (http://teem.nz) only suffered a few seconds of outage for example as the storms swept across Sydney, Australia. We host our website on S3 and use AWS Lambda for dynamic interaction — both services rely on all Availability Zone’s in the Sydney region to meet SLA’s such as 99.99999…% availability. So we didn’t really have the problem that higher profile sites would have that will be running on EC2 instances.
For websites that have custom website code on EC2 or use 3rd party tools such as Drupal, WordPress or Sitecore the risk of outage is a lot higher. This is because it depends entirely on the “decoupling” of all services associated with the website and how well your AWS infrastructure is designed for failure.
Tip #1 Design for failure
This is one of the key concepts of Cloud Computing with AWS and probably the most important one. If you design for failure correctly you are inherently taking advantage of all the services AWS offer to provide features such as High Availabillity, Failover based on health checks, Scaling number of servers on traffic spikes and load balancing.
This tip covers using:
· Elastic Load Balancers
· Multiple Availability Zones
· Cloud watch monitoring
· Decoupling to HA services such as AWS RDS.
Use the tools at your disposal — they don’t cost much to implement and will save your business $1,000s in costs/lost revenue from an outage.
If you decouple your sites code, content, caching and databases to separate services such as RDS, Elasticache, CloudFront/S3 you are ensuring both availability and performance is increased — also refer to tip # 3.
Tip #2 Bake your “golden” image
I know this word is thrown around a lot, but this is your exact copy of your production build. It is the EC2 AMI you use for your OS, app configuration and code deployment and can rebuild based on new code releases and configuration changes and patches. This image should do enough of the grunt work that when you launch it, it only needs to do a minimal amount of “bootstrapping” or change to the server such as modifying a database config file or app setting. Each environment is different so the amount of “pre baking” you do depends on your requirements.
Tip #3 Media content distribution
If at all possible, shift all of your content distribution off servers and on to S3.
Ensure you use AWS CloudFront. This is a must for High Traffic and global sites and ensures content is cached as well as being available quickly to end users/ website visitors.
Tip #4 Automation and Continuous Integration
If you automate your deployments and test code before it hits production you will ensure that any configuration changes across infrastructure and app are maintained in a code repository.
Start by ensuring you use GIT/SVN or some other version control software. Also make use of either AWS provided automation tools such as CloudFormation (appropriate for the storm) or 3rd party tools — my personal favorite being Ansible.
Tip #5 Monitoring
Availability and application performance monitoring provides your web team with the tools to troubleshoot faster and become aware of problems even before your users do. You can use Cloudwatch, Kinesis for realtime data feeds, Syslogging, 3rd party tools such as New Relic and various other tools such as Elastisearch, Sumo Logic and Splunk for ensuring uptime and visibility into what’s going on in your environment from the network layer up to the web front end.
This is just a quick summary of things you can do on AWS to improve website stability. There is obviously more to it than a few tips and a simple checklist, but hopefully this get’s you started and if you need some advice please get in touch — I’m passionate about helping businesses succeed on AWS.
Ben Fellows is Head of Engineering @ Teem. He leads the development of solutions for cloud automation that reduce the time it takes to provision applications on cloud infrastructure and reduce complexity with cloud automation.
Updated: Some issues encountered during the outage.
EC2 instances and EBS Volumes
Instances in a single Availability Zone were having connectivity issues. This means servers would showed failed status checks in the AWS console and in Cloudwatch monitoring.
EBS Volumes and Instances were affected by a loss of power. This caused the above problem.
EC2 (not only EC2) API calls in the AP-SOUTHEAST-2 Region experienced increased error rates and latencies as well as delays in propagation of instance state data in the affected Availability Zone. This was a whole of region issue, but mainly to do with the fact that the API calls are done on endpoints that are load balanced across all 3 Availability Zones in Sydney. Naturally with 1 AZ having issues with connectivity and new instance launches is going to give API error rates.
RDS (Amazon Relational Database Service)
Because RDS is fundamentally EC2 with a DB management abstraction on top it was also affected in the same AZ. This caused customer’s Single-AZ databases (note, Multi-AZ would have meant automatic failover) to lose connectivity.
Amazon Redshift and Elasticache
Redshift service manages petabyte scale databases and is another piece of technology that relies heavily on underlying EC2/EBS volume infrastructure as well. Much like Redshift, Elasticache was also affected in the same manner due to it’s underlying infrastructure. A number of clusters in both serviceswere unavailable to the power issue.
Other services affected
CloudHSM, VPC, Route53 and CloudFormation were also affected.