How to properly manage critical systems

It never fails to happen… something always fails.

Even though everyone fights to avoid them, having a solid grasp of how to deal with an unplanned outage or incident can be extremely beneficial. With proper planning and forethought, you can mitigate the overall impact and maintain control of the situation through the toughest of times.

There are some companies who constantly have problems and suffer from them unnecessarily. Regular outages are unacceptable but if you adopt a few key principles and design your systems properly, the few times when you do have service incidents will be much less intrusive on your operations.

Planning

Step 1. Know your weaknesses

It’s impossible to plan for every circumstance and impossible to put contingencies in place to counter them all. Google and Microsoft each spend billions of dollars each year and even they still have outages and incidents. Focus on the technological aspects of your operations that are critical to maintaining service and what could affect them.

Step 2. Write it out

Not every company has the budget to enact lots of precautionary tools and services to mitigate their weaknesses. What costs nothing and will be invaluable to minimizing the impact? Well thought out documentation outlining how to respond to various outages, incidents and downtime with mission critical systems. There are a number of things you can do in advance to ensure that when something does go wrong, everyone on your team knows what they should be doing.

Step 3. Be Proactive

Don’t wait for things to happen and then react to them. By actively keeping track of vulnerable services and processes, you can stop incidents before they happen.

Responding

Step 1. Don’t panic

As bad as it may seem, there is absolutely nothing to be gained by reacting emotionally to an outage. Taking the time to slow things down will help you stop from spinning your wheels. You’re in trouble, but you have a plan to deal with it. This is the time to calmly implement the documented responses and start moving forward.

Step 2. Assess the situation

It’s extremely important to understand what actually happened. Understanding the cause, the source and the trigger of the incident is imperative to stopping it from happening again. Having a defined process documented in a checklist makes it easy to run through what’s happening and communicate it to others easily in an incident postmortem.

Step 3. Learn from it

Regardless of whether the incident was resolved in a timely fashion or caused lasting damage, you still need to take a hard look at how you responded. Was there any point when things felt out of control or hopeless? Did it affect your staff or clients needlessly? Can you become more efficient in how incidents get handled in the future?

If you follow the steps above and continually learn and adapt, you can prevent it from happening again.