Lori MacVittie, senior technical marketing manager at F5 Networks (www.f5.com), says:

…Please Press 1 for Product, 2 for OS, 3 for Hypervisor, or 4 for Management Troubles.



When there’s a problem with a virtual network appliance installed in “the cloud”, who do you call first?

An interesting thing happened on the way to troubleshoot a problem with a cloud-deployed application – no one wanted to take up the mantle of front line support. With all the moving parts involved, it’s easy to see why. The problem could be with any number of layers in the deployment: operating system, web server, hypervisor or the nebulous “cloud” itself. With no way to know where it is – the cloud has limited visibility, after all – where do you start?

Consider a deployment into ESX where the guest OS (hosting a load balancing solution) isn’t keeping its time within the VM. Time synchronization is a Very Important aspect of high-availability architectures. Synchronization of time across redundant pairs of load balancers (and really any infrastructure configured in HA mode) is necessary to ensure that a failover even isn’t triggered by a difference caused simply by an error in time keeping. If a pair of HA devices are configured to failover from one to another based on a failure to communicate after X seconds, and their clocks are off by almost X seconds…well, you can probably guess that this can result in … a failover event. Failover events in traditional HA architectures are disruptive; the entire device (virtual or physical) basically dumps in favor of the backup, causing a loss of connectivity and a quick re-convergence required at the network layer. A time discrepancy can also wreak havoc with the configuration synchronization processes while the two instances flip back and forth.

So where was the time discrepancy coming from? How do you track that down and, with a lack of visibility and ultimately control of the lower layers of the cloud “stack”, who do you call to help? The OS vendor? The infrastructure vendor? The cloud computing provider? Your mom?

We’ve all experienced frustrating support calls – not just in technology but other areas, too, such as banking, insurance, etc… in which the pat answer is “not my department” and “please hold while I transfer you to yet another person who will disavow responsibility to help you.” The time, and in business situations, money, spent trying to troubleshoot such an issue can be a definite downer in the face of what’s purportedly an effortless, black-box deployment. This is why the black-box mentality marketed by some cloud computing providers is a ridiculous “benefit” because it assumes the abrogation of accountability on the part of IT; something that is certainly not in line with reality. Making it more difficult for those responsible within IT to troubleshoot and having no real recourse for technical support makes cloud computing far more unappealing than marketing would have you believe with their rainbow and unicorn picture of how great black-boxes really are.

The bottom line is that the longer it takes to troubleshoot, the more it costs. The benefits of increased responsiveness of “IT” are lost when it takes days to figure where an issue might be. Black-boxes are great in airplanes and yes, airplanes fly in clouds but that doesn’t mean that black-boxes and clouds go together. There are myriad odd little issues like time synchronization across infrastructure components and even applications that must be considered as we attempt to move more infrastructure into public cloud computing environments.

So choose your provider wisely, with careful attention paid to support, especially with respect to escalation and resolution procedures. You’ll need the full partnership of your provider to ferret out issues that may crop up and only a communicative, partnership-oriented provider should be chosen to ensure ultimate success. Also consider more carefully which applications you may be moving to “the cloud.” Those with complex supporting infrastructure may simply not be a good fit based on the difficulties inherent not only in managing them and their topological dependencies but also their potentially more demanding troubleshooting needs. Rackspace put it well recently when they stated, “Cloud is for everyone, not everything.”

That’s because it simply isn’t that easy to move an architecture into an environment in which you have very little or no control, let alone visibility. This is ultimately why hybrid or private cloud computing will stay dominant as long as such issues continue to exist.