– Richard Dolewski, Chief Technology Officer and Vice President of Business Continuity Services for WTS (www.wts.com), says:
Disaster recovery scenario: The servers are all down. The computer room is dark. A major disaster has occurred and you need to determine your next steps. What are your priorities? What task do you do first? In which order do you start your server recovery? Everything is a business priority, according to the business experts. Quick, lock the doors because a stampede of self proclaimed experts is about to come charging into the computer room and start barking out orders.
Are you going to listen to the person with the loudest bark and get his server back up and running first? If not, what IS your top priority? The computer systems may or may not be recoverable in the short term. Maybe they are not available for the long term either. You take a deep breath and tell yourself this is what we have been documenting and practicing for all these years. But does your current disaster recovery plan include prioritization of server recovery in a disaster?
Managing Mission Critical Servers for Business Continuity
There is a lot of work that goes into managing the on-going requirements for mission critical servers. When you have downtime, for whatever reason, data is unavailable to your customers, and this usually means that business – yours and your customers’ –simply stops. When business stops, it gets very expensive in a hurry. This is why critical server requirements should be reviewed twice a year to ensure that effective server processes are being carried out to support the true needs of the business and to ensure that these identified servers are still in alignment with business goals and priorities. Listed below are the elements that should be reviewed on a regular basis to support the critical server definition requirements.
• Business impact analysis and risk assessment
• Strategy for server recovery
• Change in prioritization based on different business cycles
• Application dependencies and interdependencies
• Application downtime considerations for planned and unplanned outages
• Backup procedures
• Offsite storage for vital records
• Data retention policies
• Recovery time objectives (RTO)
• Recovery point objectives (RPO )
• Hardware for critical server recovery
• Alternate recovery site selection
• IT and business management signoff
Classifying Systems for Disaster Recovery Priority
When you walk into the computer room it’s easy to be overwhelmed with rows and rows of servers. Numerous hardware platforms are powered on and ready to serve some business purpose. Typically you’ll find that the servers span several hardware generations. What’s required is a planned roadmap and prioritized recovery of your complete critical server infrastructure. You need to understand the supporting business needs of all servers in advance of any disaster ever occurring. Don’t wait for that phone call at 4 a.m. to decide your server recovery strategy. All the servers that reside in your computer room are not equal in level of importance to your business. That is why you need to consider the difference between:
- what you need
- what you want to have
- what you don’t need at all to run your business in a disaster.
The backup recovery team should assign priorities to the servers as they relate to your business support priorities. There will be a mixed bag of opinions, of course, but a good Business Impact Analysis will reveal which of those opinions carry the most weight. You should categorize the business requirements and supporting servers as Critical, Essential, Necessary, or Optional, as follows;
- Critical Systems – Absolutely these servers must be in place for any business process to continue at all. These systems have a significant financial impact on the viability of your organization. Extended loss of these servers will cause a long term disruption to the business, and potentially cause legal and financial ramifications. These should be on the A-List of your disaster recovery strategy.
- Essential Systems – These servers must be in place to support day-to-day operations and are typically integrated with Critical Systems. These systems play an important role in delivering your business solution. These should also be on the A-List recovery strategy.
- Necessary Systems – These servers contribute to improved business operations and provide improved productivity for employees. However, they are not mandatory at a time of disaster. These might include business forecasting tools, reporting, or maybe improvement tools utilized by the business. In other words, minimal business or financial impact. The targeted systems can be easily restored as part of the B-List recovery strategy.
- Optional Systems – These servers may or may not enhance the productivity of your organization. Optional systems may include test systems, archived or historical data, company Intranet and non-essential complementary products. These servers can be excluded from your recovery strategy.
These server classifications will provide you with the baseline for your decision making matrix. The key is your IT recovery team and your business management team must agree with the disaster recovery planning scope for classifications of the servers. By differentiating between critical, essential, necessary and optional, the reduction in the number of servers required to support the disaster recovery plan not only helps increase backup and recovery efficiency for the servers, but it also helps reduce your financial budget for disaster recovery.
The Big Picture
When compiling the list of mission critical applications, you must also consider application interdependencies. First, many software solutions are considered modular in design yet the software must be 100 percent intact — in other words, fully restored to function correctly. You cannot break the applications apart from the supporting infrastructure for the server. You may choose not to utilize specific business functions, but the entire solution must be rebuilt 100 percent to function normally.
Second, consider the flow of information. Follow the flow of a transaction from order inception to product delivery. You may find that a server not considered critical by the Business Impact Analysis does indeed have a significant role in feeding information back to yet another identified mission critical application. Therefore, IT input is needed in addition to the defined business needs. The restoration process for most servers is generally recovered in its entirety which includes every user library saved on the system. The question is, are you restoring too much? Omitting non-critical libraries can save hours, which translates to the business coming online more quickly in a disaster. The libraries and user directories that could be omitted include:
• Performances data
• Audit journals
• Test libraries
• ERP walk-through libraries
• Online education
• Developer libraries
• User test environments
• Data archives
• EDI successful transmission objects
• Trial software
• Temporary product work directories
• Auxiliary Storage Pools (ASP s)
• Independent Auxiliary Storage Pools (IASP )
Required Hardware for Your Disaster Recovery Plan
In the development of every disaster recovery plan, you must determine the minimum hardware requirements for your mission critical servers. Some IT professionals will say: “Obviously, you want your mission-critical servers to run the exact same equipment. However, in an emergency, any equipment is better than none. After all, it’s a disaster, not production.”
This statement should not be accepted at face value. The reality is, only mission-critical applications absolutely need to be restored in a disaster, not everything. However, you will need to ask whether your business will accept running the “Mission Critical “ business functions at say 50 percent less capacity or throughput. In most cases, the answer will be no — totally unacceptable.
In the Business Impact Analysis you identified the financial impacts for your organization of being down for an extended period of time. Running your business at half speed will only further cripple your long term business capabilities and will not ensure customer satisfaction. Reduce the disaster recovery footprint by eliminating non-essential applications rather than providing less processing capabilities. Invest your disaster recovery budget wisely by supporting your business requirements in a disaster, and that means getting the right hardware. The last thing you want is your sales order desk telling customers to be patient; we can only process half the orders right now because we had a disaster and we are still working things out.
The Human Element
What if you declared a disaster and your staff did not show? Your servers can’t recover themselves. Many companies have plans that address their equipment requirements and recovery processes but often underestimate the amount of staff required to successfully execute their plan. Equipment only works if somebody is able to operate it. In Gulf coast hurricanes, key personnel have been displaced or unavailable due to health risks or personal priorities.
When regional disasters hit, transportation within the area can be difficult and may result in your staff being unable to reach their assigned locations. Equipment may be accessible, but it will be ineffective if your staff cannot access the recovery site. What is the level of expertise your employees possess when they finally do reach the recovery site?
Too many companies, especially those that perform recovery tests with no more than their data center staff, often count on IT heroics to pull them out of a crisis. Expecting IT to perform a miracle in an outage is difficult for your staff and avoidable today when full recovery tests can be performed without impacting your production users. When your disaster recovery plan includes cross departmental staffing, it is important to have detailed and precise documentation. Companies should create recovery documentation so that anyone in the business, from the shipping manager to the CFO, can start a recovery. In a well tested plan, an employee from another department should be able to start the recovery in the event employees from your IT staff are not available. You may never know if all your key personnel will be able to assist with the recovery. After identifying your critical equipment, it is a good idea to test your disaster recovery plan with a subgroup of assigned individuals while leaving the remainder of the team to run normal business operations. The success or failure will be a good indicator of your corporate readiness.
When the servers are down, your disaster recovery plan will determine the precise server recovery strategy and recovery priorities. So, lock the doors to keep the stampeding herd of users away. Fire up the iPod, plug in your earphones, and start recovering the business as stated in the plan. Step through the tasks and follow the precise order of server recovery by predetermined importance criteria versus listening to who screams the loudest. And tune out the noise while listening to your favorite disaster recovery iPod tunes!