Talk about a cloud raining on your parade. Early Thursday morning, around 1:41am PDT, Amazon suffered a major outage in one of its datacenters that services their EC2, or Elastic Cloud Computing services. Many people woke up to find popular services such as Reddit and FourSquare out of commission. As of 6:00pm PDT, many of these impacted services were still either down completely or operating in a degraded state. Just in time for a good Easter analogy, this is the danger in putting all your eggs in one basket.
Many companies outsource their data infrastructure, from smaller companies like Quora and SCVNGR (both taken down by the EC2 outage) to Fortune 500 businesses. When done right, it’s a great deal for a company – the business doesn’t have the operating cost and capital purchases of servers, they can scale easily when needed by simply purchasing additional cloud resources, and they don’t have the added support personnel to maintain. The service provider (Amazon in this case) would guarantee a level of service, in this case it’s 99.95% (which equates to 4.38 hours of downtime per year) and build their infrastructure to support those guarantees. This should include plenty of resilience and disaster recovery options on Amazon’s part so that they can keep the service up.
On the surface, it doesn’t look like Amazon built the proper redundancies in for their customers to support their 99.95% guarantee since we’re well over a 4 hours outage at this point. This would mostly incur a service penalty payout to the impacted companies, but most likely that’s build into Amazon’s cost model as well. There’s a delicate balancing act that many companies will use when determining contracted Service Level Agreements. They might promise 99.99%, but only build out to 99.95% to save money and put some of the savings aside to pay for potential SLA misses like this. It gets REALLY expensive for a business to add a “9” to the reliability numbers, so in some cases it’s cheaper to just pay for the SLA miss than to build the infrastructure.
This might start to make people look closer at these cloud services that are coming into the limelight these days, such as the Amazon Cloud Player and Cloud Drive offerings. There’s great benefits to cloud storage and computing, but in the event of an outage the consumer is left high and dry with no way to access anything in the cloud. We also instill a large amount of trust in those services thinking that, when they do come back up after a crash, all of our data is still there. What would you do if you put all your music into the Amazon Cloud Drive, there was a crash like this that took the drive down for a period of time (separating you from your music or other files), and when the drive came back all your files were gone? As people start to forgo local storage for that virtual external hard drive that services like Amazon Cloud Drive and Dropbox provide, that could be a problem.
So, lessons learned today, don’t put all of your resources into one cloud. Disaster recovery options are always like an insurance policy, you have to measure the cost of additional redundancy against the risk and cost of an extended outage. If you’re a company who is outsourcing your network infrastructure, ask your provider what disaster recovery and contingencies they have in place to support you in the case of a data center crash. Make sure you know the service levels you signed up for, and consider even using mutliple cloud providers, balancing or redirecting services in the case of a crippling event like this.