Guide to Selecting a Data Center Monitoring System

– Steve Francis, Founder and CEO of LogicMonitor (www.logicmonitor.com), says:

While the process of selection of a monitoring system is necessarily unique to every enterprise, this post provides some guidance as to issues to consider when making that decision. Selecting the best monitoring system for your enterprise boils down to a single selection criteria: Pick the monitoring system that adds the most value to your business.

A monitoring system adds value if the benefits of the system are greater than the acquisition, implementation and operational costs.

Generally, the benefits an enterprise will obtain from a monitoring system fall into the following categories:

reducing the cost of outages and service degrading events
reducing staff cost (time) of investigations into performance and availability issues
improved information efficiency

Note that the focus of assessing a monitoring system’s positives should always be on the business benefits, not the features.

Balanced against these benefits will be the costs of the monitoring system:

acquisition cost
implementation costs
operational costs

Assessing the Benefits of Monitoring
A monitoring system is an efficiency tool – it allows enterprises to avoid and minimize expenses and revenue loss, rather than contributing directly to increased revenue. (Managed Service Providers that sell monitoring and value-added response services are an obvious exception.) Thus in order to assess the business value of a monitoring system, and to compare possible systems, one must have an idea of the possible expenses the tools will mitigate.

Minimizing the Cost of Outages and Service degrading events

Quantifying Outage Costs
Avoiding outage costs is a common justification of monitoring, but is often hard to quantify, and is different for every enterprise. For some enterprises (although increasingly few), downtime may matter very little, and only the simplest of monitoring is justified.

Each enterprise should consider both the immediate impacts of outages and the brand impacts, but both cases will require thought and discussion specific to the enterprise.

Consider the case of online retailers with directly measurable dollar/minute metrics attributable to web site sales. Does an outage mean that revenue for the duration of the outage is lost? Perhaps customers will simply purchase later, when the site is online. Perhaps the outage means customers lose trust in the brand, and not only make their immediate purchases at a competitor, but also make all future purchases at the competitor. In this case, the outage cost for a small but growing site could be much greater than at an established brand, despite a much lower sales volume. The established brand may impact $1million in sales during an hour long outage – but those sales will likely be made up later. A similar outage on a smaller, growing site may only directly impact $2,000 in sales – but the sales are likely to be permanently lost, and worse, the loss of goodwill by early evangelists of the site can significantly affect growth.

An outage on a site that provides a subscription service may have less impact on longer term customers, but customers are more likely to churn if they experience an outage before they have internalized the value of the service – new customers, or those in trial. In this case, the outage costs not the customers subscription fees for a month, but the lifetime customer value of those that did not convert.

An outage of an internal IT virtualization infrastructure that idles the workstations of 150 engineers (at $150 an hour fully loaded salary) is superficially an obvious direct cost – but as exempt employees, the engineers may complete their work anyway, perhaps by staying late. Then the cost becomes one of employee satisfaction – and if it results in employee turnover, the cost becomes much higher. If an outage of IT systems affect sales people at the end of the quarter, preventing them from accessing their CRM, or perhaps their phone systems, there can be a very large cost – in sales staff dissatisfaction, revenue for the quarter, and even corporate stock price.

There are non-market driven costs too – downtime in a business unit may be valued
disproportionately to its revenue contribution due to political clout of its executives. Thus determining the cost of an outage is not a simple matter of entering data into a formula, but requires knowledge of the revenue models of the enterprise.

Quantifying Service Degradation Costs
Service degradation issues can often cost more than outages. With an outage, there is a clear, identifiable situation – a service is down. With a degradation, there is often a lag before the issue is reported, another before it is acknowledged, and further complications with identifying the systems and personnel responsible (networking staff, server staff, and storage staff may each insist their respective systems are working correctly). This longer duration of the issue can result in larger costs. The costs may be lower sales revenue on an ecommerce site (slower site performance directly correlates with less conversions.1) For internal systems, costs may be inefficient use of engineers time as they wait for compilations or other resources; or less effective sales staff if their CRM system is slow. Given the high fully loaded cost of personnel, any system impact that detracts from productivity can quickly become a large drain.

Analysis of past Outages
Each organization will have to rely on its own experience to assess the historical frequency of outages, whether the outage would have been averted given ideal monitoring, the direct costs of the outage and the indirect, brand costs of the outage.

Some questions to discuss that can help guide this assessment:

Why do you want a monitoring system?
What do you want the monitoring system to do? What benefits do you anticipate getting from it?
How many outages or adverse performance events occurred over the last month? 6 months?

For each historical incident, as best can be determined:

What were the direct costs of this outage or performance issue?
What were the ‘brand’ costs of this event?
How many hours of staff time were involved in determining the cause of the outage?
What is the fully loaded cost of staff time for the staff involved?
What capabilities would a monitoring system have required in order to alert on the issue and identify the cause during the event?
What capabilities would a monitoring system have required in order to alert on the impending issue before the event?

A question that is always useful to ask is “So what?” If some devices went down, and there was no monitoring – so what? Why does it matter? This is a good way to flush out who cares about the issue.

Reduction of staff cost for investigations into performance and
availability issues

With increased complexity of applications and infrastructure, the time spent to determine the root cause of performance or availability issues can be a substantial expense that good monitoring can significantly reduce.

Consider the example of a performance issue on an e-commerce web site. Troubleshooting the issue could involve bringing in staff resources to look at the network, the web server operating systems, the front end application, the load balancers, the back end database, the virtualization platform that runs the database virtual machine, fiber channel systems that connect the virtualization platform to the storage, and the storage system. Any one of these areas could reasonably be the cause of the issue. Further, silos of information can exacerbate the time required to determine a system is not contributing to the poor performance. For example, the database server operating system may be observed to be running slowly, leading to troubleshooting efforts to focus on OS level tuning and issues – but the issue may be the underlying virtualization platform being memory starved, and transparently swapping out memory from the virtualized OS. In such a case, if the monitoring system alerted that the virtualization layer was low on memory and that swapping of virtual machines was occurring, and this information was available to all team members, troubleshooting would be much quicker, involve fewer resources, and the issue would be resolved sooner.

Of course, not every situation is going to be alerted on by monitoring, but even in such cases monitoring can still greatly reduce the time to resolution of the issue. This will only be true if the monitoring is collecting a wide variety of information, from a wide variety of systems, and making this information visible in chart form, so that trends and changes can be spotted by human intelligence, and the issue correlated with these changes. A simple example: after a software release, the performance of an application is worse. A quick examination of charts can show if there are differences in request load. If this is the same as recent historical levels, the monitoring can show if the database is performing significantly more table scans after the release, perhaps because a needed index was not created. Charts will also show that the increase in sequential scans was attributable to the release, and not a gradual increase over time with load; and also show how much extra Disk IO is being put on the storage system as a result, and how this is affecting request latency. Without historical charts, resolution of such an issue would take much longer – translating to a significant expense.

Improved information efficiency

By providing accurate data as to where resource bottlenecks are, and by aggregating data from multiple systems, monitoring systems can provide actionable data about costs and performance that improve enterprise efficiency. A simple example is that in the fact of performance issues and inadequate monitoring and analysis, it is not uncommon for organizations to purchase new capital infrastructure that does not address the root issue. (For example, upgrading front end CPU capacity when the issue is the storage system IO operations per second capacity.)

Another example where monitoring can optimize capital expenditures is to ensure equipment purchases meet current and future needs, but avoid overspending on overcapacity. (“Buying out of fear”, as one customer calls it – spending $80,000 on storage, in case the $50,000 storage is not performant – without knowing exactly what the requirements are.) It also allows purchases to be planned – trends can clearly show when circuit or equipment upgrades will be required, giving months of warning with commensurate negotiation power, rather than requiring immediate outlays to maintain service levels.

Monitoring systems collect a lot of information about a lot of systems, and this data can, if presented efficiently, allow new insights into the enterprise’s operations, that can realize better planning and expense control. Aggregating all the ISP bandwidth used per ISP, or per datacenter, can reveal opportunities for contract negotiation savings. Being able to track storage usage by business unit across all storage assets in an enterprise may not fall under the traditional rubric of monitoring, but given that monitoring systems collect the data underlying this information (storage capacity of every volume on every storage system), it is a reasonable item to extract from them. Being able to track real time and historical trends of a variety of performance and utilization metrics can provide unanticipated benefits to enterprises.

Translating business requirements to features.

Features required for Proactive Warning of Outages
Certainly one of the business goals is to proactively warn about, and hopefully prevent, impending outages. This is one of the easier business drivers to convert to a feature list, as it is driven largely by technical requirements. While any monitoring system should be able to alert of an outage on a system, and thus speed time to resolution, being able to proactively provide warnings of impending failures and performance issue requires different capabilities. It may require a monitoring system that can alert when a load balancer detects that a Virtual IP has less than the desired level of server redundancy; or when request latency is increasing on a storage array, or when database replication is lagging more than the desired time offset, or when the number of server threads on a Java application is approaching a limit. Being able to prevent outages requires a much more capable monitoring system – but the capabilities must match the infrastructure deployed.

Converting other business requirements to features.
As noted above, the process for selecting a monitoring system should care less about features and more about evaluating how the system will impact business, positively or negatively. To align features with business value, an enterprise should detail the way their organization works (or how they want it to work), and translate that into capabilities that help meet their business goals. The important issue to remember is that except for specific technical goals as mentioned in the above section, the feature list should detail business goals and capabilities, not specific ways of achieving the goals.

For example, an organization may operate with the following operational constraints: they run east and west coast datacenters, with staff at both locations, and applications run at both. They have infrastructure from 3 business units at each location, and some infrastructure is shared. They employ virtualization technology, and have little staff time to devote to their monitoring. Their custom applications are a mix of java and windows .NET, and they also use Tomcat, IIS-, MySQL and SQL Server. They want alerts to be routed to the appropriate teams, differentiating between roles even within the same host (e.g. Storage and DB groups may both be paged for different reasons for the same host), and escalated to people to ensure coverage. They want morning alerts handled by their east coast staff, and later switch to the west coast staff. There is frequent change in their datacenter in terms of reconfiguring or adding devices or applications, but not all the devices are production, warranting production alerting. They plan to grow some infrastructure into Amazon’s EC2 cloud in the future.

Their business goals are to allow the growth of service revenue, which will require additional infrastructure to handle the load. They wish to target their capital expenditures for this growth correctly; avoid headcount growth; minimize downtime and its impact on revenue and get better information for cost allocation among business units.

Each feature should be prioritized in terms of how much value each feature brings to the enterprise. This value will vary by enterprise – an organization with a fairly static infrastructure may decide that relying on manual workflow is sufficient for ensuring changes to infrastructure are reflected in monitoring (although I would suggest that processes done rarely are also rarely done when needed!) One enterprise may initially desire role based access control, but on reflection find that it adds no business value. Another may determine it is essential, as it allows them to unify monitoring while meeting contractual requirements of confidentiality for their customers.

Having determined the list of features and their relative value to enterprise, an organization can then narrow down a list proposed solutions that meets the most important of these features, in order to accurately assess the value to the enterprise.

Evaluating Candidate Software

Each candidate solution should be evaluated for the prioritized list of features – as they relate to business value – weighted as appropriate for the typical actions of the enterprise.

With a trial deployment, the realistic costs and benefits of a system can be assessed, always keeping a focus on business value comparison, not feature comparison. There will likely be multiple ways to deliver the same business value, that may not fall into the same “feature” check box.

A simple example is system security. The business goal is to prevent the disclosure of information that may be embarrassing to the enterprise or provide intelligence to competitors or vendors. Yet this goal may be translated to a feature checklist as “all data stored locally in corporate datacenter.” This is one way of achieving the goal (although it makes many assumptions about the deployment.) But the goal may be better achieved through a SaaS model, even though it would not meet the checklist requirement. A SaaS system is likely to be delivered from audited, tested datacenters with 24 hour manned guards, biometrics, cameras, external penetration tests, and from a system designed explicitly with security in mind and encryption used at many levels (transmission and storage of data, etc). A premise based system, even if operated behind the corporate firewall, is likely to be deficient in many of these areas – so while it would meet the checkbox, it would not deliver the business value as efficiently. This illustrates why it is important to detail the business drivers for each feature (“maintain security of data”) rather than just the feature as the end users expect it to be delivered (“all data stored locally in corporate datacenter”) – no one will be able to predict the ways in which all the business drivers can be delivered, so listing the driver makes the assessment far more likely to based on the business driver, rather than the anticipated way of delivery.

Conclusion
We hope this post illustrates some of issues involved in selecting a data center
monitoring system. Selection of such a system will always require a good knowledge of the enterprise to be monitored, so that business value can be accurately aligned with the benefits of the systems. Selection lists should be driven by business values, except for specific technical requirements such as the ability to monitor a specific protocol. Some of the questions above should help bring out the expected benefits and costs of a monitoring system. After all the discussions and dialog has occurred, the selection of a monitoring system comes down to the simple statement made at the beginning of this post:

Forget about features. Pick the monitoring system that adds the most value to your business.

Guide to Selecting a Data Center Monitoring System

Recent Posts

Archives