Analyzing the Amazon Outage with Kosten Metreweli of Zeus

Analyzing the Amazon Outage with Kosten Metreweli of Zeus

By Dan Kusnetzky


If organizations had done the right things, Amazon’s outage would have been a momentary irritation not a disaster. Why didn’t Amazon customers have a plan “B”?

During the marketing feeding frenzy after Amazon’s small outage a while ago, I had the opportunity to speak with Kosten Metreweli, Chief Strategy Officer for Zeus, about what happened, how folks were hurt and what can be done to prevent such occurrences from causing pain in the future.

Even though Amazon offered ways for customers to set up shop in several different data centers, or Zones as Amazon calls them, many didn’t have plans that included use of alternative data centers; back up and recovery of critical data; and methods to detect a failure and redirect traffic to other resources. Since the technology to manage such outages has been available for ages, why didn’t we see evidence of planning for an Amazon outage.

Here is a summary of some of the steps that could have prevented a great deal of the pain (and opportunity for suppliers to market their products and services):

  • Organizations could have taken a page out of the planning and operational processes they already use in the mainframe, midrange and X86-based system workloads and have hosted critical applications in several places with a workload manager routing traffic to systems having the most available capacity.
  • They could have done the same thing with cloud-based storage.
  • They could have routinely tested failover processes by “unplugging something” to see if their processes really worked.

Since we heard so many stories about companies losing access to critical applications and data during Amazon’s outage, it is clear that the IT planners must not have been involved in the use of Infrastructure as a Service products.

  • This may have been because business, not IT, decision makers made the choice to use Amazon and didn’t ask for help. This may have been due to these decision makers purposely going around IT to get things done that were always “2 Years” away in IT’s development plans.
  • It may also be due to them not knowing how to read Amazon’s terms and conditions. They might have believed that Amazon was going to do more to backup their data and have disaster recovery plans and procedures in place even though the Ts and Cs state those things are a customer’s responsibility (unless they purchase specific Amazon services.)
  • They didn’t know anything about redundancy, workload management, back up servers, multi-tier storage and the like and so didn’t think about it.

All in all, this incident showed that Cloud computing environments, like on-premise IT infrastructure, needs to be carefully architected, implemented and operated. Tools such as those offered by Zeus and others should have been baked in from the beginning.