At the end of February, Amazon cloud service, AWS had a major outage. The outage caused some very high profile sites to go down, including Netflix, Reddit, Quora, Medium and even some government sites.
Overall, AWS service is extraordinarily reliable, and no service has 100% uptime. It’s not surprising that AWS eventually had an outage. What is surprising is the initial cause of the outage, and why recovery took so long.
The outage itself was caused by simple human error. An engineer working on a bug in the billing system took more servers offline than were needed. Unfortunately, like a set of dominos, the additional servers going down took more, and then more.
But that’s not where the story ends. Because so much of the system went down, the systems required full restarts to recover. It was these full restarts that had the systems down for multiple hours.
Amazon has said that they will be adding in additional safeguards to prevent this kind of issue from occurring again, certainly at this magnitude. They are taking the healthy and most appropriate path out of the problem. They are looking closely at the issues and finding ways to correct them so they never end up in the same place again.
And while Amazon should be praised for facing the problems that set them up for the massive outage, they are Amazon, the premier cloud service provider with a global presence. Thanks to their near constant up time, they can take a hit like this and not see significant damage to their bottom line.
In the same situation, though, you may not be so lucky. A time like this is the perfect time to think about your own system maintenance and outage plans.
Less than Disaster
Typically when we in IT talk about outage planning, we talk about disaster recovery. Don’t get me wrong, disaster recovery is a good thing and a worthwhile investment. If you don’t have a disaster recovery plan, you’re rolling the dice.
But we usually equate disaster recovery with exactly that – a disaster. Hurricanes, earthquakes, massive cyberattacks that cost you terabytes of data – these are the kinds of things mentioned in many disaster recovery documents and presentations.
Not to be dramatic, but for smaller companies, and even mid-sized enterprises, being down for a few hours at the wrong time is a disaster for your organization. If you’re a retail organization and you go down for half a day on Cyber Monday, well, that’s a disaster. If you’re a university and your registration system goes down just as registration opens, that’s a disaster.
Any outage during a peak business time can mean significant trouble. It’s why large IT organizations have blackout periods for new software releases during critical business periods. It’s not worth the risk.
And that’s what we’re really talking about here, risk management versus disaster recovery. For instance, think for a minute about driving your car. Risk management is like obeying traffic laws and driving defensively. You’re doing what you can to avoid getting into an accident. Disaster recovery is like car insurance. When the unexpected happens, you’re glad you have it.
Having a disaster recovery plan but not a risk management plan is like driving recklessly, all the time because you have car insurance.
Managing the Risks
Depending on the special needs of your organization, risk management can mean a number of things. That’s because it’s specific to your business risks. That retail organization in the above example will have some risks that are different than the school, and some that are the same.
What’s really important when looking at risk management is acknowledging that it’s a process that involves problem identification, fixing what you can and planning for what you can’t.
Those in an ITIL or COBIT managed organization are probably familiar with the problem identification piece, and likely even participated in fixing some issues. But organizations shouldn’t stop there and hope that they’ll never have to deal with the problems associated with something you can’t fix.
Let’s take a quick look at an, admittedly, forced example.
In our example, your teams are evaluating the potential risks of their systems going down. They identify a system that takes 5 hours to completely reset, based on all of the server dependencies and additional processes needed to restart everything involved with that system. This is the identification phase.
The teams go through and find ways to reduce that reset time by removing out of date dependencies and better aligning parts of the systems that can be reset in tandem. Maybe they need to update old software or perform patches that were slowing things down. They have fixed part of the risk.
But this leaves a 3-hour window where your system may be unavailable in the event of an unexpected system restart. Maybe that’s fine if it’s in the middle of the night. But that never seems to be when critical systems go down.
Some companies stop here and just assume they have done what they can. Instead, take the time to consider any potential workarounds. Is there another system that can take up the slack? Can customers be offloaded to your call center during the outage? These may be quick and easy ways to address the downtime.
Perhaps it’s a more critical system than that. If you’ve got regional redundancy in your systems, through AWS for instance or even your own, private network, can the workload be shifted to the same system in another region? It might be slow, but slow is better than down. Think through your alternatives, including redundancy, to identify issues on systems that are critical for business continuity.
The last piece of risk management is as important as the first three. When an outage happens, take the time to do a root cause analysis, much like AWS did with their systems. Understand what went wrong and look at what can be fixed or what checks can be put in place to prevent that problem from happening again. And then implement those fixes. It might seem overwhelming at first, but over time it will become part of your regular workflow.
Fixing problems associated with risk might seem like adding additional burden to your already overloaded IT teams. Bringing in a partner that can work through remedies to your biggest issues can relieve some of the stress on your teams, while still providing your organization the protection it needs to keep the business running smoothly. Regardless how you cope with the additional work, risk management is one of the most important steps you can take to ensure your business can effectively operate when the inevitable happens.
The post How to Avoid an Epic Outage Like AWS’s originally appeared on the Curotec Blog
No comments:
Post a Comment