– Lori MacVittie, senior technical marketing manager at F5 Networks (www.f5.com), says:
Last time we dove into a “Load Balancing 101” discussion we looked at the difference between architected for scale and architected for fail. The question that usually pops up after such a discussion is “why can’t I just provision an extra server and use it. If one fails, the other picks up the load”?
We call such a model N+1 – where N is the number of servers necessary to handle load plus one extra, just in case. The assumption is that all N+1 servers are active, so no resources are just hanging out idle and wasting money. This is also sometimes referred to as “active-active” when such architectures include a redundant pair of X (firewalls, load balancers, servers, etc… ) because both the primary and backup are active at the same time.
So it sounds good, this utilization of all resources and when everything is running rosy it can benefit in terms of improving performance, because utilization remains lower across all N+1 devices.
The problem comes when one of those devices fails.
HERE COMES the MATH
In the simplest case of two devices – one acting as backup to the other – everything is just peachy keen until utilization is greater than 50%.
Assume we have two servers, each with a maximum capacity of 100 connections. Let’s assume clients are generating 150 connections and a load balancing service distributes this evenly, giving each server 75 connections for a utilization rate of 75%.
Now let’s assume one server fails.
The remaining server must try to handle all 150 connections, which puts its utilization at … 150%. Which it cannot handle. Performance degrades, connections time out, and end-users become very, very angry.
Which is why, if you consider the resulting impact of performance and downtime on business revenue and productivity, redundancy is considering a best practice for architecting data center networks. N+1 works in the scenario in which only 1 device fails (because the idle one can take over) but the larger the pool of resources, the more likely it is that more than one device will fail at relatively the same time. Making it necessary to take more of an N+”a couple or three spares” approach.
Yes, resources stand idle. Wasted. Money down the drain.
Until they’re needed. Desperately.
They’re insurance, they always have been, against failure. The cost of downtime and/or performance degradation was considered far greater than the operational and capital costs associated with a secondary, idle device.
The ability of a load balancing service to designate a backup server/resource that remains idle is paramount to enabling architectures built to fail. The ability of a load balancing service in the cloud to do this should be considered a basic requirement. In fact, much like leveraging cloud as a secondary “backup” data center for disaster recovery/business continuity strategies, having a “spare” resource waiting to assure availability should be a no-brainer from a cost perspective, given the much lower cost of ownership in the cloud.