The ideal is a monitoring system that is comprehensive (alerting on all conditions that need attention) and noise free (NOT alerting to any conditions that do not need attention.) A noisy alert system is almost as bad as no monitoring – it will train people to ignore their alerts.
These goals are diametrically opposed, and further complicated by the fact that different users of the monitoring have different criteria for what needs attention or is noise. However, with appropriate alert escalations and routing, and some good processes in place, you can approach this state.
What processes can help?
Use Scheduled Down Time
This is probably the most important process to enforce. If someone is going to be working on a system, schedule downtime! Prevent the alerts going out in the first place. If you have regular maintenance windows for sets of hosts, set up the scheduled downtime to recur automatically. If there are processes that will trigger alerts periodically (such as CPU alerts triggered by disk scrubs on NetApp filers), schedule recurring downtime for just that alert.
Get Rid of unneeded alerts
For every alert received, an assessment should be made during the initial deployment of the monitoring – was the alert needed? If so, acknowledge the alert and go fix the problem. If not, how general can the removal of the alert be made? If the alert is regarding buffer discards on a switch, but only on the port where a 1Gbps network links to a 100Mbps uplink, discards are normal and expected, so the threshold should be adjusted just for this one specific port. If the alert is regarding swap space used on a QA system, and QA regularly engages in stress tests that would trigger this, you should disable the alert or adjust the threshold on the QA group. (For resources shared on systems, such as storage array volumes, you should add a filter to disable the QA alerts, or set different thresholds.)
Send alerts to the right place
For every valid alert received, make sure it’s going to the all the right people, only the right people, by the right methods, and is escalating at appropriate periods. (An example of inappropriate escalation would be sending warning alerts on production systems via email, instead of pager – but escalate them every 5 minutes. If three warnings occur at 1.00 am, and no one checks the email until 8.00 am, everyone will have 250 email alerts cluttering their inbox.) The set of people that want to know about network retransmissions are probably different from the set that wants to know about a commerce site being completely down.
Review
We recommend a weekly review, especially during the initial roll out of a monitoring system, to ensure the above points:
- every alert was valid. If not, consensus is arrived at as to how and at what level the alert needs to be tuned. If alerts were the result of scheduled staff actions, but no one told the monitoring about the scheduled down time, liberal use of the clue stick (or stronger action) is recommended.
- every alert was delivered to all, and only, the correct recipients.
- the escalations for each alert were valid and appropriate.
- don’t close any incident that was not alerted on, until the alerts to detect it have been created.
Putting processes will likely be the key to your monitoring deployment succeeding or failing, and is another area that a SaaS monitoring service is likely to be superior to a premise based system. A SaaS service is invested in your continual use and satisfaction of monitoring – they don’t get the money up front. At LogicMonitor, at least, we help our customers with the whole implementation, including processes.