I remember a time when I was running data centers and had absolute responsibility and accountability for critical business systems. I also remember some cringe-worthy times when things would go down and I would be very quickly introduced to the cost of downtime. Dollars and cents, not to mention reputation, forever lost because of what amounted to human errors. Notice that I didn’t blame the technology.
We in the technology sector often spout off about all of our features and capabilities for redundancy, availability, security, automation, and the like. We are so proud of our terms and capabilities, that we often neglect to look at what it really means for a business to implement these capabilities into their operations and culture.
On August 8, 2016, Delta Airlines had a major computer system failure that blacked out flights and caused issues for many days after. The financial impact of this outage not only spills over from just the cost to Delta and their capability to service customers, but also in the impact to existing customers missing flights, cargo not making it out, and on and on.
In an interesting article, “The Real Reason Airline Computers Crash,” Thom Patterson from CNN notes that the real reason this kind of thing happens is due to “good old-fashioned screw-ups.” And I couldn’t agree more. This happens because of the massive amount of complexity in enterprise systems, infrastructure, and the ecosystem of siloed services that has made it almost impossible to test systems and build contingency plans.
In the article, Patterson cites “airline experts” who state that there are three reasons why systems go down: no redundancy, hacking, and human error. I think it’s a little naïve to choose just those three, so I want to caveat something: every system today has the capability to have redundancy or high availability. The problem comes from not using it, or not testing it – or cutting budget and human resources.
So now that we’ve nailed that culpability down, what do we do about it? I would argue that there are some current trends in IT in general that are contributing to these kinds of catastrophic events. Let’s talk financial first: Our pursuit for the lowest possible cost systems and rampant commoditization is forcing one-time reliable vendors into the dumps with quality and interoperability. The cloud is offering infrastructure SLAs and hiding culpability for reliability. “It’s no longer my fault, it’s my vendor’s fault.” And in IT, the desperate adherence to “the way we’ve always done it,” stanches innovation from product choice through operational controls.
Let me tell you how I know this. I worked for a rocket motor manufacturer where the disaster plan was “Run and don’t look back or your face might catch on fire, too.” I had to test data recoverability every week, system recoverability for critical systems monthly, and yearly do full data center failure drills. At least, that’s what my process docs told me. Did I do it? Of course, I did! We would call an actual failure a “drill” and so on. Am I admitting to anything horrible that my former employer should ask for my paycheck back? Heck no! Time, resources, money, and impact to production are real reasons why you don’t “play” with systems to imagine a failure. Failures are all too real and close-at-hand.
Before we had ITIL and DevOps initiatives, we had CMMI and OSHA regulations, combined with oversight by the missile defense agency and SARBOX. My auditors would give me a week to show them validation of process, and I could show them full datasets within minutes from my desk. I’m sure Delta had similar oversight. So how does a bonafide disaster happen when everyone can state a process, can execute to a process, and has all the tools to have near-zero-downtime systems?
The article concludes that for airlines to avoid these things in the future, “They can install more automated checkup systems. They can perform emergency drills by taking their systems offline during slow periods and going to their secondary and backup systems to make sure they are working properly.”
I need to call these “experts” out. There is no “automated checkup system.” That’s a particular fantasy for homogenous systems that these “all-in-one” converged and hyper-converged systems tout. There are too many interconnects and systems outside of the purview of these mini wonder-boxes.
I do like the emergency drills concept, but how would one do this without impacting production? I know how I can do it, but I also know that many of my prospects have tools that “almost” do this today with their preferred vendors, but can’t quite test DR in systems mimicking production. I had a prospect telling me that they had 4 petabytes of DR redundancy that they had no idea if they could actually recover to if push came to shove.
Note that the experts in the article stated, “during slow periods.” There are no slow periods. The Delta disaster occurred at 2:30 in the morning – it shouldn’t get more “slow” than that! Airlines and most modern business are 24/7 with almost zero slow-down. Legacy systems are used to having to shut down to test what would happen if a system failed.
If Delta Airlines calls me, asking to share my experience and offer suggestions, I would share this with them:
- Review your technology with an open mind and heart: does it do what you think it does? If not – look for a new vendor and way to accomplish what you need. The old relationships won’t save you when disaster strikes because the commission check has been cashed.
- Look for a platform, not just a tool suite, that commoditizes the technology by optimizing interoperability and consistency in operations.
- Find a platform that will allow testing of data recovery, system recovery, service recovery, and data center recovery all without impacting production. If you can’t test your DR, you can’t provide an SLA that is real.
- Use an operational platform that takes historical and real-time data and correlates it through modern analytics to find weak spots and utilization patterns to build appropriate SLAs and understand the real requirements of your business.
- Find an IT analytics platform that can correlate across ALL infrastructure systems, including cooling, power, security, etc… to get a holistic view of your environment.
- Don’t offload responsibility to the lowest-cost provider without understanding the underpinnings of their environment as well. If you can’t find a good DR plan for your object store, there is a good chance your cloud vendor can’t either.
- Replication and redundancy are NOT high-availability. Replication, yes, even synchronous replication, means a failover. Redundancy means you can fail twice or more before the dark settles in. If you need to be always-on, then you need an always-on technology.
- Finally, if you have a tool, use it. If the tool doesn’t work, find a new tool. I can’t tell you how many times I have heard from friends and customers, “We standardized on this tool, but it doesn’t work. Can you fix it?” Drop the politics and admit when choices weren’t so great. Get the right solution in place to support your business, not your aspirations and relationships.
If nothing else, seeing a true system outage occur to a massive company like Delta should frighten the living daylights out of everyone else. We buy technology as if it is an insurance policy where we only find out our levels of coverage when a disaster happens. Inevitably, someone pretty low on the totem pole will get fired over what happened to Delta’s systems. Some poor schmoe hit a button, flipped a switch, blew a circuit. Or even crazier, someone was testing the system at a “slow” period and found a critical problem in what was an otherwise well-architected system. We might not ever know for sure. But the call to action for the rest of us sitting in the airports writing blogs while waiting for our flights is this: rethink everything. Break the mold of buying technology that serves itself. Find and implement technology that serves your business. And most importantly: test that insurance policy regularly so that you know what your exposures are when the worst comes to pass. Because if it happened to Delta, chances are, it will happen to you.
About the Author
Peter McCallum has been FalconStor’s VP of Datacenter Solutions since 2011. He has held various leadership roles in enterprise data center optimization, business development, strategic planning, and service delivery. Peter leverages over 20 years of system administration and infrastructure architecture experience to develop Software-Defined Storage and Cloud offerings, including pre-sales design tools, pricing, marketing strategies, and product development requirements. In addition, he provides guidance across a global sales team in storage virtualization, multi-site and cloud disaster recovery and business continuity solutions. Peter holds a B.S. degree from Penn State University and currently lives in Texas with his family.