Requirements for Capacity Management in Data Center Virtual Environments

– Bryan Semple, CMO at VKernel (http://www.vkernel.com/), says:

Any data center with a virtualized environment has a real need for effective capacity management. This white paper discusses the reasons why capacity management is critical to achieving the benefits of server virtualization and outlines the three key requirements to consider when evaluating capacity management systems.

Why Capacity Management in Virtualized Environments
A major advantage of virtualized environments is their ability to improve resource utilization by running multiple virtual machines (VMs) on the physical servers in a shared infrastructure. With such an architecture, utilization can increase from as low as 10% for dedicated servers to 60% or more for virtualized servers. The enhanced resource efficiencies make it possible to more fully utilize ever-increasing server power and provide significant savings in capital expenditures, power consumption, rack space and cooling.

This concept of greater efficiencies through resource-sharing is not new. Mainframe systems have long employed time-slicing to enable multiple applications to run concurrently. With mainframe systems, the dedicated and quite sophisticated “capacity planning” is performed by the operating system, which ensures that no application can cause any others to suffer from resource contention issues. The high cost of mainframes created a strong incentive for IT departments to maximize mainframe resource utilization by running as many concurrent applications as physically possible.

Today’s server virtualization solutions operate in a similar manner. Hypervisors enable multiple virtual applications to run on the same physical x86 server, with all sharing the common CPU, memory, storage and networking resources. Through the magic of the hypervisor, each application operates as if running alone on a dedicated server. As with the mainframe, however, each virtual machine is actually sharing resources with other virtual machines. And as with the mainframe, the multiple applications sometimes contend for shared resources causing performance to degrade, especially during peak periods.

The goal with both mainframes and virtualized servers is the same: optimize resource utilization without degrading performance to maximize cost-saving efficiencies. Organizations undertaking server consolidation projects invariably experience such savings—at least initially. Where their data centers had been filled with row after row of underutilized servers, each running a single application, the post-consolidation data center may have seemed almost deserted with the reduction in the number of racks required. A very successful consolidation effort, for example, might be able to run as many as 10 or 20 different applications on each server, thereby requiring only 1/10TH the number of servers. Fewer servers consuming less space and power and requiring less cooling led to significant savings.

This dream of dramatic savings through consolidation and virtualization has the potential to become a real performance nightmare, however, without good capacity planning and management. The key to successful capacity management, therefore, is to ensure satisfactory application performance (prevent the nightmare) while maximizing efficiencies and savings (preserve the dream).

At a high level, managing virtualized server capacity is not that much different from managing mainframes, which also have shared CPU, memory and storage resources. But looking deeper at the details reveals some dramatic differences that might make the mainframe’s systems engineer (read: capacity manager) feel completely unqualified to deal with the complexity inherent in open systems capacity management in virtualized environments.

Clearly, for an organization to benefit the most from its virtualized infrastructure, robust capacity management must be an integral component of that infrastructure. Leading analyst firms Gartner, Forrester and others all concur on this need. In Jean-Pierre Garbani’s report titled I&O’s New Capacity Planning Organization, for example, the Forrester analyst states emphatically: “Capacity management and planning are the keys to virtualization.”

Distributed Resource Schedulers Are Not Capacity Managers
An obvious question to ask here is: “Don’t applications like distributed resource schedulers solve the capacity management problem?” And the answer is an emphatic no. DRS applications are intended to balance the load virtual machines place on a hardware cluster. So just as a hypervisor provisions and balances the resources a virtual machine is able to consume on a single host, distributed resource schedulers perform the same provisioning and balancing across a cluster of hosts.

Balancing load across resources in an environment is important. But a balanced environment can still lack sufficient capacity or have too much capacity. In addition, virtual machines in a balanced environment can still be impacted by performance problems caused by the noisy neighbor problem, or the underlying resource availability of the host it is running on. So while operating distributed resource schedulers is good practice, system administrators need more management capabilities at the host, cluster and data center to fully and effectively plan and manage capacity.

Capacity Management Challenges will Only Increase
While daunting already for some today, the capacity management challenges faced by most IT organizations are certain to increase. The following trends are driving the need for more sophisticated capacity management solutions:

• Environment scale – Relentless growth in applications will cause virtualized environments to become increasingly larger and denser, making capacity management more complex.
• Mission-critical applications – The growing number of critical applications will all require enhanced performance monitoring.
• Multi-hypervisor deployments – The use of multiple hypervisors will require an agnostic approach to capacity and performance management.
• Cost optimization – With the “low fruit” savings from initial server consolidation projects now in the past for most organizations, future savings will need to come from cost optimization initiatives. And while chargeback is not as effective as initially thought at curbing waste, CFOs will continue to demand annual improvements.
• VM mobility – Mobility among private, public and hybrid clouds and even among development, production and DR environments, will add complexity to the decision-making process for determining the optimal allocation of VM capacity and workloads.

All of these trends combine to make maintaining the control over and the predictability of virtualized capacity increasingly challenging.

The Cost of Delay
Many organizations do not yet have a purpose-built capacity management system for their virtual environments, relying instead on other tools to somehow perform this essential function. Without dedicated and sophisticated capacity management, however, one of two scenarios inevitably unfolds: either the environment is so over-provisioned that there are no performance issues (and no one has yet caught on to the tremendous waste!); or administrators are using spreadsheets and other manual procedures in a daily struggle to maintain service levels by constantly reallocating an increasingly complex array of resources (often by trial-and-error!).

Delaying the inevitable need to implement fully-effective capacity management has real costs to an organization that often manifest as:

• Application performance problems as VMs contend with each other for resources
• Hours spent firefighting either perceived or actual problems throughout the virtual environment
• Loss of confidence in virtual infrastructure performance
• Wasteful resource allocations that undermine the cost-saving advantages of virtualization
• Over-purchasing of server hardware, memory or storage on a routine basis
• Hours of staff time spent maintaining spreadsheets for management reporting (time spent not being able to work on more productive projects)
• Incorrect sizing of new servers during a hardware refresh by paying a premium for:
     o Expensive scale-up systems when scale-out systems are more efficient
     o Excessive support for scale-out systems where scale-up systems are more appropriate
     o Purchasing the latest CPUs for maximum clock speed when slower, earlier generation (and far less expensive) CPUs will suffice
     o Purchasing the latest, highest-density memory when far more economical lower density memory is sufficient for the actual VM load.

But perhaps the greatest cost of delay is not getting started aligning IT services with costs. Public clouds now provide alternatives for internal IT consumers to shop for services. These data points create the perception that public cloud services are “cheaper” and these beliefs are difficult to counter when IT has yet to develop a workable cost model for its services.

Even ignoring the public cloud “competition”, few IT executives are currently not focused on maximizing resource utilization to drive down capital and operational expenditures. Virtualization provides the ability to begin aligning IT costs to the services provided. But it is critical to begin this journey with a full understanding of the linkages among capacity, performance and cost. And this is perhaps the biggest reason not to delay implementing a genuine and capable capacity management system. Senior IT management is focusing on the problem throughout IT. Solving the problem sooner rather than later in virtualized infrastructure just makes sense.

Requirements for Capacity Management Solutions
What are the requirements for capacity management in a virtualized environment? At a high level, a capable capacity management solution must:

• Offer enterprise-wide visibility into performance, capacity, cost and resource efficiency of the entire virtualized infrastructure
• Provide actionable intelligence from this information
• Be simple to deploy, operate and maintain

Enterprise-Wide Visibility
Performance, capacity, cost and resource efficiency are all intertwined in a virtual environment. Without sufficient capacity, performance suffers. With too much capacity, infrastructure costs soar. Even with the right amount of capacity, efficiency can still suffer if virtual machines are consuming more expensive resources than are required.

Therefore, performance, capacity, cost and resource efficiency must be viewed across the enterprise in a holistic fashion to provide visibility for the administrator, as well as to provide information that is both sufficient and accurate enough to facilitate fully-informed decision making. Such visibility requires roll-ups of information across:

• Data centers
• Different types of hypervisors
• Different resource pools, such as CPU, memory, storage and networking

Simply being able to roll-up information up is not enough, however. As environments scale, functionality must be added to view all of this information in a meaningful way.

Cost
Understanding the virtual environment cost structure is critical as enterprises move toward the cloud. Because the cloud enables self-service portals, end users can quickly drive up operating costs in the absence of a thorough understanding of the underlying costs. Indeed, the sheer ease with which virtual environments enable the deployment of virtual machines has led to virtual machine sprawl. Understanding the cost component of a virtual environment is, therefore, essential to good capacity planning. Support for cost visibility requires:

• Chargeback (or at least “showback”) capabilities by customer for either allocated or utilized resources

• Robust reporting and potential integration with financial management systems

Chargeback/showback may encounter some significant organizational and computational limitations, however. These include:

• Financial systems that lack the ability to integrate chargeback information

• Generally accepted accounting principles that make chargeback difficult

• Budgeting cycles that are based on assumptions of fixed costs, not variable consumption models. IT customers faced with a consumption-based chargeback models must then confront the challenge of estimating uncertain computational demand.

• IT charging back for services may not be “politically palatable” for an organization

• Determining the chargeback amounts is also a non-trivial exercise if the intention is to get an accurate model of pricing

• Finally, chargeback is a measure of the price IT is charging for services and is not necessarily a measure of the actual cost to deliver that service. The actual cost to deliver all services at a high level is the total cost to own and operate all IT infrastructure divided by the number of virtual machines on that infrastructure. IT needs to focus on its actual costs, not the costs charged. This makes chargeback, without cost awareness, less beneficial as a management tool.

The barriers to chargeback are many. Nevertheless, this should not prevent IT from being on the path to understand and manage its cost structure. A key element to this is to implement a cost index that reflects the cost to IT to deploy a VM. Cost indices are a fairly new and advanced tool for IT. Using a cost index, the systems administrators can identify their most expensive virtual machines based on resource consumption, cost of the underlying hardware and density of deployment relative to other virtual machines. By identifying the most expensive virtual machines, actions can be taken to reduce costs or at least understand the impact on overall efficiency. Combining cost indices with cost visibility provides a solid foundation to lower IT costs over time and to understand the main cost drivers throughout the IT infrastructure.

Resource Efficiency
Understanding efficiency in virtualized environments is critical because wasted or under-utilized resources are what drive up capital and operational expenditures. More importantly, since one of the original goals for virtualization was server consolidation and efficiency improvements, poor efficiency of the virtualized environment undermines this fundamental and worthy goal. Of course, IT can perform chargeback or showback, yet still have tremendous inefficiencies throughout the environment. Chargeback can, however, be a tool to help reveal such inefficiencies.

How is virtualized resource efficiency monitored and analyzed?

• The cost index, introduced above, is a way for IT teams to understand the relative costs of operating a virtual machine. While one VM could be expensive to operate relative to others, it could be operating efficiently with the underlying system being the actual culprit driving up the costs. A capacity management system must, therefore, be able to rank the indexed virtual machine costs accordingly.

• Over-allocating VM resources is a major source of inefficiency. Over-allocation occurs when applications have more CPU, memory or storage than needed to perform adequately. The capacity management system must be able to identify over-allocations continuously, preferably by monitoring for peak and average values of resource utilization across CPU, memory and storage. As some hypervisor vendors shift to consumption-based models for licensing, removing over-allocated memory will become an increasingly important aspect of cost control.

• Wasted resources occur in virtualized environments from normal operations. These wasted resources include zombie VMs, abandoned VMs, unused templates and unused snapshots. Capacity management systems must effectively distinguish these wasteful resources from similar resources that are actually in production use.

Actionable Intelligence
While enterprise-wide visibility is the first major requirement for a capacity management, generating actionable intelligence from the data collected is just as important. Capacity management is essentially an analysis problem. Correctly performing capacity and performance management requires the analysis of about 20 different metrics per virtual machine at the VM, host, cluster and data center levels taken in at least five minute intervals. For a simple 100 virtual machine environment, for example, this requires the analysis of 100 VMs x 20 metrics x 12 samples/hour x 24 hours x 4 levels of analysis, yielding about 2 million data points per day. Given the sheer volume of data, it is not difficult to see why manual processes simply fail to scale. The better capacity management solutions are able to perform this multi-variable analysis on a massive scale.

The visibility requirement of capacity management solutions demands a significant amount of computational horsepower simply to make sense of the wealth of data. The need for creating actionable intelligence requires even more computations to enable system administrators to move beyond basic visibility into various problems and efficiency issues to being empowered to take action to address the underlying cause(s), either in a manual or automated fashion.

For performance issues, actionable intelligence involves:

• Root cause analysis of the problem with specific recommendations that do not require any additional analysis for how to clear the problem

• Impact analysis to point out any related virtual objects that might be affected by an issue

• Automated actions to clear performance issues, such as moving a virtual machine to a different cluster, or working with native distributed resource schedulers to accomplish the task

• Automated resizing of a virtual machine within the limitations imposed by the operating system(s) or corporate policies, with or without a restart

For cost accounting and efficiency, actionable intelligence involves:

• Specific recommendations for ways to improve efficiency and lower costs

• Automated zombie destruction, template cleansing and abandoned VM clean up, especially for QA environments that potentially contain thousands of virtual machines

• Automated downsizing of virtual machines within the limitations imposed by the operating system(s)

• Automated reporting of cost and efficiency numbers for key stakeholders in a variety of formats

Visibility without actionable intelligence, while a step in the right direction, leaves the administrators, especially in larger environments, with a significant labor burden to maintain efficiency and performance of their virtualized infrastructures.

Starts to Work Out of the Box
Why does a capacity management system have to be hard to deploy? It doesn’t. Here are some reasons ease of deployment and use are important requirements:

• Lack of dedicated capacity planners – Most virtualization teams, even very large ones, do not have dedicated capacity planners. Any application to manage capacity and performance must, therefore, be useable by all team members without significant training.

• Lack of capacity management skills in the existing team – Capacity management is an analytics problem. It is certainly possible to develop this skill set on a virtualization team without a dedicated capacity planner. But there currently exists no formal certification authority similar to vExpert for managing capacity. Even with sufficient training, the scale of the analytics problem would still require a sizable investment in software (whether bought or built) to mine the raw metrics data generated by the hypervisors.

• Return on investment – Long deployment times and high training costs both diminish the return on the investment in a capacity management system. The better systems are able to recover their purchase price in just a few months.

• Pace of expansion – Virtualized environments are in a constant state of change, which normally involves both reconfiguring and expanding resources. If the capacity management solution cannot keep pace with the expansion, or requires constant configuration changes to do so, it will eventually need to be replaced.

• “Expert in the Box” – Similar to having a vExpert on staff to address technical issues, the capacity management solution needs to function like an expert itself from day one and it should not require an expert operator to get great results—ever.

The specific requirements for a capacity management system to work “out of the box” are:

• Little to no configuration or maintenance required for operation

• A minimal learning curve for basic operation with intuitive interfaces to facilitate usage by part-time capacity planners

• Automation to support repetitive tasks

• Automatic creation of user views to eliminate the need for manual customization

• Pre-configured and easily configurable customizable reporting for different audiences

Conclusion
Capacity management for virtualized environments is an absolute necessity in any infrastructure of reasonable scale. Virtualization has reintroduced the mainframe model of computing, but with significantly more complexity for sharing CPU, memory, storage and networking resources. While implementing a capacity management solution could be postponed, most environments will incur some very real and potentially substantial costs by doing so. Robust capacity management systems—those that meet the requirements outlined here for enterprise-wide visibility, actionable intelligence and “out of the box” productivity—pay for themselves almost immediately, with the cost savings continuing to accumulate year after year. It is perhaps the best investment an organization will ever make to get the most from its virtualized infrastructure.

Requirements for Capacity Management in Data Center Virtual Environments

Recent Posts

Archives