– Lori MacVittie, Senior Technical Marketing Manager at F5 Networks (www.f5.com), says:
Like urban legends, every few years this one rears its head and makes its rounds.
It is certainly true that everyone who has an e-mail address has received some message claiming that something bad is going on, or someone said something they didn’t, or that someone influential wrote a letter that turns out to be wishful thinking. I often point the propagators of such urban legends to Snopes because the folks who run Snopes are dedicated to hunting down the truth regarding these tidbits that make their way to the status of urban legend.
It would nice, wouldn’t it, if there was such a thing for technical issues; a technology-focused Snopes, if you will.
But there isn’t, and every few years a technical urban legend rears its head again and sends some folks into a panic. And we, as an industry, have to respond and provide some answers.
This is certainly the case with Global Server load balancing (GSLB) and Round-Robin DNS (RR DNS).
CLAIM : DNS Based Global Server Load Balancing (GSLB) Doesn’t Work
STATUS : Inaccurate
ORIGINS
The origins of this skeleton in GLSB’s closet is a 2004 paper written by Pete Tenereillo, “Why DNS Based Global Server Load Balancing (GSLB) Doesn’t Work.” It is important to note that at the time of the writing Pete was not only very experienced with these technologies but was also considered an industry expert. Pete was intimately involved in the early days of load balancing and global server load balancing, being an instrumental part of projects at both Cisco and Alteon (Nortel, now Radware). So his perspective on the subject certainly came from experience and even “inside” knowledge about the way in which GSLB worked and was actually deployed in the “real” world.
The premise upon which Pete bases his conclusion, i.e. GSLB doesn’t work, is that the features and functionality over and above that offered by standard DNS servers are inherently flawed and in theory sound good, but don’t work. His ultimate conclusion is that the only way to implement true global high-availability is to use multiple A records, which are already a standard function of DNS servers.
An Axiom
The only way to achieve high-availability GSLB for browser based clients is to include the use of multiple A records.
It would be easy to dismiss Pete’s concerns based on the fact that his commentary is nearly seven years old at this point. But the basic principles upon which DNS and GSLB are implemented today are still based on the same theories with which Pete takes issue. What Pete missed in 2004 and what remains missing from this treatise is twofold. First, GSLB implementations at that time, and today, do in fact support returning multiple A records, a.k.a. Round-Robin DNS. Second, the features and functionality provided over and above standard DNS do, in fact, address the issues raised and these features and functionality have, in fact, evolved over the past seven years.
Is returning multiple A records to LDNS the only way of achieving High Availability? How is advanced health checking an important component of providing High Availability?
Many people misuse the term ‘high availability’ by indicating that it only equates to when a site is either up or down. This type of binary thinking is misguided and is purely technical in focus. Our customers have all indicated that high availability also includes performance of the application or site. The reason is that by business definitions if a site or application is too slow it is unavailable. Poor performance directly impacts productivity, one of the key performance indicators used to measure the effectiveness of business employees and processes.
As a result, high availability can be achieved in a number of different ways. Intelligent GSLB solutions, through advanced monitoring and statistical correlation, take into account not only whether the site is up or down, but such detail as hop count, packet loss, round-trip time, and data-center capacity to name a few. These metrics then transparently provide users with most efficient and intelligent way of steering traffic and achieving high availability. Geolocation is another means of steering traffic to the appropriate service location, as well as any number of client and business-specific variables. Context is important to application delivery in general, but is a critical component of GSLB to maintain availability – including performance.
The round robin handling of the A records by the Local DNS (LDNS) is a well known problem in the industry. When multiple A records are handed back to the LDNS for an address resolution, the LDNS shuffles the list and returns the A records list back to the client without honoring the order in which it received it. The next time the client requests an address, the LDNS responds with a different ordered list of A records. This LDNS behavior makes it very difficult to predict the order in which A records are being returned to the client. In order to overcome this problem, many prefer to configure a GSLB solution to send back one A record to the LDNS. When compared with just a ‘plain’ DNS server that would send back any one of the site addresses with a TTL value, an intelligent GSLB sends back the address of the best performing site, based on the metrics that are important to the business, and sets the TTL value.
A majority of the LDNS that are RFC compliant will honor the TTL value and resolve again after the TTL value has expired. The GSLB performs advanced health checking and sends back the address of the best performing site taking into account metrics like application availability, site capacity, round trip time, hops and completion rate hence providing the best user experience and meeting applicable business service level agreements.
In the event of a site failure (when the link is down or because of a catastrophic event), existing clients would connect to the unavailable site for the period of time equal to the TTL value. The GSLB sets a TTL value of 30 seconds when returning an A record back to the LDNS. As soon as the 30 second time period expires, the LDNS resolves again and the GSLB uses its advanced health checking capability and determines that one of the multiple sites is unavailable. The GSLB then starts to direct users transparently to the best performing site by returning the address of that site back to the LDNS. A flexible GSLB will also provides a Manual Resume option that gives them the option of letting the unavailable site stay down to mitigate the commonly known back end database synchronization problem.
An intelligent GSLB also has the option of sending multiple A records that allows delivery of content from the best performing sites. For example, let’s say an enterprise wants to deliver their content using 10 sites and wants to provide high availability. Using sophisticated health checking, the GSLB can determine the two best performing sites and return their addresses to the users. The GLSB would track each site’s performance and send back the best sites based on current network and site conditions (context) for every resolution. Slow sites or sites that are down would never be sent back to the user.
What about the issues with DNS Browser Caching?
Of all the issues raised by Pete in his seminal work of 2004, this is likely the one that is still relevant. Browser technology has evolved, yes, but the underlying network stack and functionality has not, mainly because DNS itself has not changed much in the past ten years. Most modern browsers may or may not (evidence is conflicting and documentation nailing it down scant) honor DNS TTL but they have, at least, reduced the caching interval on the client side. This may or may not – depending on timing – result in a slight delay in the event of a failure while resolution catches up but it does not have nearly the dramatic negative impact it once had. In early days, a delay of 15 minutes could be expected. Today that delay can generally be counted in seconds. It is, admittedly, still a delay but one that is certainly more acceptable to most business stakeholders and customers alike.
And yet while the issue of DNS browser caching is still technically accurate, it is not all that relevant; the same solution Pete suggests to address the issue – RR DNS – has always been available as an option for GSLB. Any technology, when not configured to address the issue for which it was implemented, can be considered a failure. This is not a strike against the technology, but the particular implementation. The instances of browser caching impacting site availability and performance is minimal in most cases and for those organizations for which such instances would be completely unacceptable it is simply a matter of mitigation using the proper policies and configuration.
SUMMARY
What it comes down to is that Pete, in his paper, is pushing for the use of Round-Robin DNS (RR DNS). Modern Global Server Load Balancing (GSLB) solutions fully support this option today and generally speaking always have. However, the focus on the technical aspects completely ignores the impact of business requirements and agreements and does not take into consideration the functions and features over and above standard DNS that assist in supporting those requirements.
For example, health-checking has come a long way since 2004, and includes not only simply up-down indicators or even performance-based indicators but is now able to incorporate a full range of contextual variables into the equation. Location, client-type, client-network, data center network, capacity… all these parameters can be leveraged to perform “health” checks that enable a more accurate and ultimately adaptable decision. Interestingly, standard DNS servers leveraged to implement a GSLB solution are not capable of nor do they provide the means by which such health checks can be implemented. Such “health monitoring” is, however, a standard offering for GSLB solutions.
NEW FACTORS to CONSIDER
Given the dynamism inherent not only to local data centers but global implementations and the inclusion of cloud computing and virtualization, GSLB must also provide the means by which management and maintenance and process automation can be accomplished. Traditional DNS solutions like BIND do not provide such means of control; they are enabled with the ability to participate in the collaborative processes necessary to automate the migration and capacity fulfillment functions for which virtualization and cloud computing are and will be used. Thus a simple RR DNS implementation may be desirable, but the solution through which such implementations will be implemented must be more modern and capable of addressing management and business concerns as well. These are the “functions and features” over and above standard DNS servers that provide value to organizations regardless of the technical details of the algorithms and methods used to distribute DNS records.
Additionally, traditional DNS solutions – while supporting new security initiatives like DNSSEC – are less able to handle such initiatives in a dynamic environment. A GSLB must be able to provide dynamic signing of records to enable global server load balancing as a means to support DNSSEC. DNSSEC introduces a variety of challenges associated with GSLB that cannot be easily or efficiently addressed by standard DNS services.
Modern GLSB solutions can and do address these challenges while enabling integration and support for other emerging data center models that make use of cloud computing and virtualization.
This skeleton is sure to creep out of the closet yet again in a few years, primarily because DNS itself is not changing. Extensions such as DNSSEC occasionally crop up to address issues that arise over time, but the core principles upon which DNS have always operated are still true and are likely to remain true for some time.
What has changed are the data center architectures, technology, and business requirements that IT organizations must support, in part through the use of DNS and GSLB. The fact is that GSLB does work and modern GSLB solutions do provide the means by which both technical and business requirements can be met while simultaneously addressing new and emerging challenges associated with the steady march of technology toward a dynamic data center.