Amazon Web Services Outage: Cloud Computing Proof of Concept


In late April 2011, a huge outage amongst the cloud computing network from Amazon, called Amazon Web Services (AWS), happened after massive failures in the company's Virginia-based facilities literally shut down many businesses operating on the network. Opponents of the cloud computing paradigm were all over the outage as proof that cloud computing is bad for business.
Except it isn't and the outage proved exactly the opposite.
Quick Look at How Cloud Computing Works
The idea behind cloud computing is to diversify resources so that they are better managed and put to more use. Basically, it spreads resources and at the same time allows them to be used by more users, thus lowering costs.
There are two basic models for cloud computing: 'design for failure' and 'traditional.' Contrary to the title, the traditional model is actually becoming outmoded by the design for failure (DFF) model. The AWS was a mixture of both.
In traditional cloud architecture, geographic limitations mean that while the workload is spread amongst various systems, they must all be within a relatively small geographic area. It puts most of the weight of availability of resources on the infrastructure and redundancy within it. The down side to this model is that when an area-wide failure happens, the traditional cloud is likely to go with it.
In the DFF model, redundancy is removed from the infrastructure and spread the availability to a combination of software management and physical design. This allows for failures of single or multiple parts of the cloud to happen without destroying the cloud's availability and the applications and data on it. In the best DFF setups, data is spread to multiple locations and mirrored (saved in copies) in multiple locations geographically to avert catastrophic failure and data loss.
Again, the Amazon system was based on a hybrid of both models.
The Five Levels of Redundancy
Five types of redundancy can happen in cloud computing and having all of them is optimal. These five are: physical, virtual resource, availability zones, regions, and the cloud itself. Having redundant resources and facility for all five of these means the cloud will be stable even in a major shakeup. AWS didn't provide for all five. The Virginia outage showed that their regional redundancy was lacking and their physical backups of some data were inadequate.
The trouble with DFF systems is that they must be designed from the ground up to be DFF systems. The AWS was not designed this way because it began before DFF was really mainstream. So, like many cloud computing platforms today, it was retrofitted rather than designed from scratch to be designed for failure.
The Good News, Even for Amazon
The good news here is that the AWS failure proved the DFF model works well, if applied correctly. For Amazon, the news remains good because the provider learned a valuable lesson and has gained the opportunity to rebuild their system to be more fail-proof.
Amazon has stated that they are now developing more dynamic systems that will better allow for load balancing and redundancy between their Virginia and their California networks. This should solve most of the problems shown to be inherent when the Virginia facility failed

Comments

Popular Posts