Opinion on the Amazon S3 Outage; Checklist for Dealing with Outages

My journalist colleagues at Wired.com published some of my comments related to Amazon S3.1 Wired also posted another article titled Customers Shrug Off S3 Service Failure. I agree with the views of many of the customers expressed in the article. Don MacAskill, CEO of the popular photo hosting site Smugmug, wrote an understanding post about it.

My entire career working for media companies, I’ve held firm the belief that the uptime, reliability, performance, scalability, performance and security of commercial Web sites is of paramount importance. When sites that I’ve been responsible for have had issues, my colleagues and I have given our personal time and energy to resolution. With my teams, I spend considerable time on proactive measures. I’ve had the honor of working closely with and learning from some who do an excellent job running technology operations.

Experience has taught that things can and sometimes do go wrong. Sometimes calculated risks don’t pan out. Sometimes mistakes cause problems. We are human. We should strive for perfection; we can get close to it, but not fully attain it. We should be prepared for such scenarios. When they happen, we should work diligently and expeditiously on resolution and have frequent and honest communications with stakeholders and customers. Such communications during the incident should include:

During-Incident Communication Checklist

  • Current status
  • What is the full impact?
  • Estimated time to resolution
  • Any recommended workarounds until resolution, if practical
  • Assurance that it is being worked on
    • It often helps to mention who all are working on it and what they are doing

The post-incident communications to stakeholders and customers should include:

Post-Incident Communication Checklist

  • Summary
  • What happened, how and why it happened?
    • Including full description of all impact
    • Do not blame2 third-parties or say things like “beyond our control”. A technology leader takes responsibility equally for both insourced and outsourced products and services.3
  • How it was resolved
    • If the resolution is temporary or long-term
  • Next steps
  • Plan for eliminating or minimizing this and similar incidents from happening again
  • Thank all those who helped resolve and the customers for their understanding
  • Mention the monetary credits you plan to give as per the Service Level Agreement (SLA)
    • Specify any additional ‘make goods’ or returns you plan to make to the customers above and beyond the credits as per SLA, if appropriate.
  • Double check each recipient’s email address to make sure you are sending this memo which may contain confidential information to the correct person and not someone else with a similar name in your address book. You don’t want your memo published on Gawker.
  • Speaking of Gawker, in the event someone does leak your memo outside the beyond the intended recipients, take care to not say anything in it that would be an embarrassment. That’s another reason to be honest, own the problem and solution, and not pass the blame.

Stakeholders and customers here refer to internal customers of the technology operations team (e.g. the concerned folks in editorial, marketing, sales, finance, legal and other departments). External communications to the public Internet should be handled in consultation with legal and public relations.

S3’s outage (or any outage) isn’t to be taken lightly, but I have faith Amazon and their customers will learn from it.

Disclaimers:

  • As explained in the terms of use of this site, any opinions expressed on my personal Web site do not reflect those of any employer, past or present. My Web site and I in my personal life neither represent nor speak for any corporation.
  • I have no affiliation, financial or otherwise with Amazon.com. I happen to be a user of their products and services, some of which I like and some that I don’t.
  • Personal Web sites like this are exempt from the performance requirements of corporate Web sites 🙂 My personal Web site is for expressing, learning and R&D. It also happens to be hosted on Amazon EC2 and S3.
  1. Silicon Alley Insider and ValleyWag have amusing spins on it. 🙂 []
  2. There may be extreme instances, especially when criminal activity or malicious wrongdoing was the cause where it would be appropriate to blame someone. []
  3. It is ok to mention service providers, or describing external events for explaining what happened, but don’t do it in a “it was their fault, not ours” tone. The technology leader should factually describe what happened and take responsibility. []

By Rajiv Pant

Rajiv Pant राजीव पंत 潘睿哲