The Outage Aftermath

Greg Thomas

--

The hardest part about an outage isn’t when things are going wrong, customers are calling you non-stop and your app is unresponsive to anyone and anything that breathes on it.

Believe it or not, this is the easy part — primarily because everyone who has ever been near an emergency of any kind in their life knows what to do.

  1. Stop the Bleeding.
  2. Triage.
  3. Patch.
  4. Next.

In software speak…

  1. Reduce the impact on other customers.
  2. Identify what we have to do now to get through this.
  3. Make and apply patches to the system.
  4. Move on to the next issue that you know is going to arise and hit you smack dab in the face the moment you deploy the first patch.

And that is the easy part, everyone is gathered around, they have been told to drop everything, come together as a team, work on whatever areas they have been assigned, triage, patch, next, move on.

The hard part comes after you’ve deployed that last fix and your customers have started to calm down, the cases are dropping, things are getting back to normal and you are trying to figure out what exactly happened, what went wrong, and most importantly what do you need to avoid this from happening…

--

--

Greg Thomas

Software Architect, Developer, Author and Leader helping organizations build scalable software delivery teams and implement cloud-based solutions