Network High Availability
A highly available network means that a network and its applications are both operational and accessible at all times. As more businesses use networking to conduct day-to-day business, networking becomes a critical to business. To put this in perspective, look at the cost of a network outage. The following numbers reflect the cost of one hour of downtime for various business functions:
- ATM fees: $14,000
- Package shipping: $28,000
- Teleticket sales: $69,000
- Airline sales: $89,500
- Catalog sales: $90,000
- Credit-card authorization: $2.6 million
- Brokerage operations: $6.24 million
Designing a network for high availability does the following:
- Prevents financial loss
- Prevents productivity loss
- Reduces reactive support costs
- Improves customer satisfaction and loyalty
What Affects Network Availability?
The following three types of errors are the most common causes of network failures:
- Operational errors account for 40 percent of network failures; they are usually the result of poor change-management processes or a lack of training and documentation.
- Network failures account for 30 percent of network failures, and they include single points of failure.
- Software failures account for 30 percent of network failures. They can be caused by software crashes, unsuccessful switchovers, or latent-code failures.
How Do You Measure Availability?
The two most common methods for measuring availability are “number of 9s” and defects per million (DPM). Number of 9s refers to the measurement of availability in terms of a percentage. For example, five 9s implies that the network is available 99.999 percent of the time (and not available for .001 percent of the time). Although this measurement is still common, it is really a holdover from the mainframe world, which measured only the availability of the mainframe hosts. Modern networks, however, are distributed and consist of hundreds and thousands of devices. In this case, DPM is a more realistic measurement. DPM is the number of defects per million hours of operations.
Best Practices
Hardware redundancy means redundant hardware, processors, line cards, and links. You should design the network such that critical hardware (e.g. core switches) has no single points of failure. Hardware availability also allows you to hot-swap cards or other devices without interrupting the device’s operation (online insertion and removaJ).
Reduction of Network Complexity
Although some redundancy is good (and necessary), overdoing it can cause more problems than it solves. Selecting a simple, logical, and repetitive network design over a complex one simplifies the availability to troubleshoot and grow the network. There is a trade-off between expenses and risk. A good design maintains the proper balance between the two extremes.
Software Availability
Software availability refers to both reliability-based protocols, such as Spanning Tree Protocol (STP) and Hot Standby Router Protocol (HSRP), and reliable code and nondisruptive upgrades.
STP, HSRP, and other protocols provide instructions to the network and to components of the network on how to behave in the event of a failure. Failure could be a power outage, a hardware failure, a disconnected cable, or any number of things. These protocols provide rules to reroute packets and reconfigure paths. Convergence is the process of applying these rules to the resolution of any such network errors. A converged network is one that, from a user standpoint, has recovered from a failure and can process instructions and requests.
You should thoroughly test and use software in a real (quarantined) or simulated real environment before putting it on the network. Avoid “bleedingedge” or inadequately tested code. You should also follow procedures for introducing new or updated code. Shutting down the network, loading new code, and hoping it all works is usually a bad idea. You should first introduce new code on segmented, noncritical parts of the network. Plan for the worst case when loading new code.
Link and Carrier Availability
Another component in building highly available networks is understanding your service provider’s plans and policies for network availability. For business-critical applications, it might be worthwhile to purchase a secondary service from an additional service provider. You can sometimes use this second link for load sharing.
Clean Implementation and Cable Management
This best practice might seem like a waste of time when first implementing a network, but disorganized cabling and poor implementation can increase the probability of network disasters and hinder their timely resolution.
You can save time, grief, and money by taking some simple steps, such as labeling cables, tying cables down, using simple network designs, and keeping up-to-date network diagrams.