If everyone is thinking the same, someone isn't thinking

Lori MacVittie

Subscribe to Lori MacVittie: eMailAlertsEmail Alerts
Get Lori MacVittie via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Related Topics: DevOps Journal

DevOpsJournal: Blog Feed Post

Data Center Feng Shui: Fault Tolerance and Fault Isolation

Like most architectural decisions the two goals do not require mutually exclusive decisions. 

The difference between fault isolation and fault tolerance is not necessarily intuitive. The differences, though subtle, are profound and have a substantial image_thumb[4][6]impact on data center architecture.

Fault tolerance is an attribute of systems and architecture that allow it to continue performing its tasks in the

event of a component failure. Fault tolerance of servers, for example, is achieved through the use of redundancy in power-supplies, in hard-drives, and in network cards. In an architecture, fault tolerance is also achieved through redundancy by deploying two of everything: two servers, two load balancers, two switches, two firewalls, two Internet connections. The fault tolerant architecture includes no single point of failure; no component that can fail and cause a disruption in service. load balancing, for example, is a fault tolerant-based strategy that leverages multiple application instances to ensure that failure of one instance does not impact the availability of the application.

Fault isolation on the other hand is an attribute of systems and architectures that isolates the impact of a failure such that only a single system, application, or component is impacted. Fault isolation allows that a component may fail as long as it does not impact the overall system. That sounds like a paradox, but it’s not. Many intermediary devices employ a “fail open” strategy as a method of fault isolation. When a network device is required to intercept data in order to perform its task – a common web application firewall configuration – it becomes a single point of failure in the data path. To mitigate the potential failure of the device, if something should fail and cause the system to crash it “fails open” and acts like a simple network bridge by simply forwarding packets on to the next device in the chain without performing any processing. If the same component were deployed in a fault-tolerant architecture, there would be deployed two devices and hopefully leveraging non-network based failover mechanisms.

Similarly, application infrastructure components are often isolated through a contained deployment model (like sandboxes) that prevent a failure – whether an outright crash or sudden massive consumption of resources – from impacting other applications. Fault isolation is of increasing interest as it relates to cloud computing environments as part of a strategy to minimize the perceived negative impact of shared network, application delivery network, and server infrastructure.


It may sound at first as though designing for fault tolerance is not very much different than designing for fault isolation. On the surface this is true. But the importance assigned to fault tolerance is generally higher, and it is often the case that in fault tolerant architectures the “secondary” or “fallback” component always remains in “standby” in the event it is needed. Sometimes IT management decides that since there hasn’t been a need for the image secondary components as long as they can remember that the secondary components should be engaged and leveraged as additional resources. After all, idle resources are the devil’s playground and a source of inefficiency that cannot be tolerated in today’s increasingly Maxwell House “to the last drop” paradigm. At issue with this approach is that the MTBF (Mean Time Between Failure) for a component is based on its use, and the more it is used the closer it comes to experiencing a failure. Thus leveraging what appear to be “idle” resources actually increases the possibility that in the event of a primary component failure the secondary, too, will fail. In a truly fault tolerant architecture or system this is unacceptable. A truly fault tolerant architecture will not allow for secondary components to be utilized on a day-to-day basis.

A fault isolation strategy is about designing an architecture in which a failure on the part of a component does not impact other applications. For example, an architecture that employs fault isolation will ensure that a rogue or run-away process in an application does not negatively impact other applications also deployed on that same server. This is one of the biggest benefits of virtualization and one that is rarely discussed. Virtualization, like sandboxes in a browser, can isolate individual applications and ensure rogue or runaway processes/applications cannot impact the overall system or other applications. Virtualization, however, is better at fault isolation than sandboxes because it can constrain the compute resources that can be consumed by a given application or process while browsers more than often do not and cannot impose this restriction.


Data center Feng Shui is about the right solution in the right place in the right form factor. So when we look at application delivery controllers (a.k.a. load balancers) we need to look at both the physical (pADC) and the virtual (vADC) and how each one might – or might not – meet the needs for each of these fault-based architectures.

In general, when designing an architecture for fault tolerance there needs to be provisions made to address any single component level failure. Hence the architecture is redundant, comprising two of everything. The mechanisms through which fault tolerance is achieved is failover and finely grained monitoring capabilities from the application layer through the networking stack down to the hardware components that make up the physical servers. pADC hardware designs are carrier-hardened for rapid failover and reliability. Redundant components (power, fans, RAID, and hardware watchdogs) and serial-based failover make for extremely high up-times and MBTF numbers.

vADC are generally deployed on commodity hardware and will lack the redundancy, serial-based failover, and finely grained hardware watchdogs as theseimage types of components are costly and would negate much of the savings achieved through standardization on commodity hardware for virtualization- based architectures. Thus if you are designing specifically for fault tolerance, a physical (hardware) ADC should be employed.

Conversely, vADC more naturally allows for isolation of application-specific configurations a la architectural multi-tenancy. This means fault isolation can be readily achieved by deploying a virtualized application delivery controller on a per-application or per-customer basis. This level of fault isolation cannot be achieved on hardware-based application delivery controllers (nor on most hardware network infrastructure today) because the internal architecture of these systems is not designed to completely isolate configuration in a multi-tenant fashion. Thus if fault isolation is your primary concern, a vADC will be the logical choice.

It follows, then, if you are designing for both fault-tolerance and fault-isolation that a hybrid virtualized infrastructure architecture Links directly to a PDF white paper will be best suited to implementing such a strategy. An architectural multi-tenant approach in which the pADC is used to aggregate and distribute requests to individual vADC instances serving specific applications or customers will allow for fault tolerance at the aggregation layer while ensuring fault isolation by segregating application or customer-specific ADC functions and configuration.

Related blogs & articles:

Follow me on Twitter    View Lori's profile on SlideShare  friendfeed icon_facebook

AddThis Feed Button Bookmark and Share


Read the original blog entry...

More Stories By Lori MacVittie

Lori MacVittie is responsible for education and evangelism of application services available across F5’s entire product suite. Her role includes authorship of technical materials and participation in a number of community-based forums and industry standards organizations, among other efforts. MacVittie has extensive programming experience as an application architect, as well as network and systems development and administration expertise. Prior to joining F5, MacVittie was an award-winning Senior Technology Editor at Network Computing Magazine, where she conducted product research and evaluation focused on integration with application and network architectures, and authored articles on a variety of topics aimed at IT professionals. Her most recent area of focus included SOA-related products and architectures. She holds a B.S. in Information and Computing Science from the University of Wisconsin at Green Bay, and an M.S. in Computer Science from Nova Southeastern University.