Network Disaster Recovery
Keeping the Network Running
System outages can be devastating to a business. Regardless of the cause, any outage can cost a company hundreds of thousands or even millions of dollars per hour of system downtime.
Network Disaster recovery is the planning and implementation of systems and practices to ensure that when disasters do occur, the core business functions continue to operate.
Many people prefer to use the term “business continuance” rather than disaster recovery, because the former term implies that you can actually avoid disaster (business stoppage) with the proper planning and implementation.
What Are Typical Causes of Disasters?
Disaster come in all shapes and sizes. For simplicity, we organized “typical” causes of business disruptions into a few categories:
- Natural disasters
- Earthquakes
- Flood
- Hurricane or typhoon
- Blizzard
- Unintentional man-made disasters
- Backhoes
- Fire
- Illness (loss of staff)
- Power outages
- Intentional man-made disasters
- Acts of war
- Hacking
- Work stoppages
What Are the Problems to Solve?
A disaster-recovery plan has four phases: assessment, planning, testing, and implementationJrecovery. You must put a plan in place for each risk assessed. Although disruptions can come in many forms, we concentrate on network services and critical applications and data.
Before Disaster Strikes
The first step in a business-continuance plan is to assess the business criticality and downtime impact of each business application. The risk assessment should consider how a temporary or extended loss of each application and function impacts the business, regarding the following:
- Financial losses (lost revenue)
- Operation disruption
- Customer satisfaction and retention
- Lost productivity
- Brand dilution
- Legal liability
- Stock price
- Credit rating
For each critical system, application, or function, you must implement a backup and recovery plan.
Planning for Disasters
After you identify and assess the critical systems, data, and applications, you must develop a plan. A business-continuance plan has two primary components: designing the network for high availability and backing up critical systems in geographically diverse buildings.
Networks designed for high availability are resilient to disruptions such as faulty hardware, disconnected or broken cables (“backhoe failures”), and power outages.
More severe disasters (such as a building fire or earthquake), however, can wipe out entire data centers and application-server farms. The only way to recover gracefully from such an event is to have a completely backed-up secondary data center, as shown in this figure
Backing Up Systems
You can back up data centers and application farms in many ways. Some companies back up systems each night after the close of business hours. When they do, the worst-case data loss is a single day. Another backup scheme is called synchronous data mirroring. Synchronous data mirroring allows companies to perform real-time backups with no lag, ensuring that they lose virtually no data in the event of a disaster. An added benefit of synchronous data mirroring is that both systems can be online at the same time, providing load and application sharing, which can increase overall productivity. The main challenge with synchronous data mirroring is that the potential decrease in application performance is significant. To achieve synchronous remote mirroring without affecting application performance, you need a high-speed, low-latency connection, such as dense wavelength division multiplexing (DWDM) over optical fiber.
Practicing for Disaster
One of the best ways to ensure a smooth recovery from a disaster is to provide staff with real-world training simulations. Allowing IT staff to practice different disaster scenarios greatly improves their ability to cope with actual disasters.
After a Disaster Occurs
Practice and planning are put to the test if a disaster strikes. To avoid confusion or worse (such as causing more damage), develop a checklist as part of the planning effort and follow it when the time comes. The checklist varies from business to business and situation to situation, but most should closely resemble this example:
- Make sure that your people are safe. Are all personnel accounted for? Consider sending noncritical personnel home to avoid confusion.
- Make sure the backup systems are online.
- Assess the likelihood of additional or secondary disasters. An earthquake, for example, could spark fires or burst gas or water mains.
- Monitor the network to ensure business continuation.
Restoring Primary Systems
Depending on the severity of damage and the duration of downtime for primary systems, restoring these systems might also disrupt business.
Consideration for backing up data stored on the backup systems, and the restoration of the primary systems, must be taken into account.