Thursday, December 4, 2008

Defining High Availability

In todays business computing environments a wide variety of terms are used to describe systems management, systems performance and system availability. One commonly used term is High Availability (HA). This is a very broad term that can encompass many different levels of availability and the costs associated with the various levels of availability. This term is open to quite a bit of interpretation and this interpretation often leads to confusion about exactly what level of HA an application, device or service provides. Below are the items to factor in when assessing the actual availability of a given service to ensure that it meets your specific interpretation of HA.

Level-setting Expectations
High Availability can mean something different for each person that says or hears the term. It is important to level set expectations about HA and its meaning before having an in-depth discussion about how to meet the objectives laid out in an HA environment. Properly defining HA and calculating the costs associated with implementing HA has four components:

Time to recovery – It is important to understand how long a failure will take to recover from, this will allow you to properly choose solutions that can identify and recover from a failure within a given time frame. A failure can be a hardware problem, a software malfunction or a human error that causes the specific service to act in a way other then it was designed. There are many valid cases where time to recovery can take on the order of minutes or hours, there are other valid cases where recovery should be near-instantaneous.

Method of recovery – Method of recovery is an important component of planning and HA solution and it's associated cost. Many times recovering from failures is an automated fashion, but it is not uncommon to have an error that requires manual intervention to clear the problem. This is often done for categories of problems that are not critical to the operation of a business or customer impacting.

Data Loss and Corruption – Data loss and corruption is an important part of developing a strategy for HA. Data loss and corruption can occur during a failover of services between nodes, while the network works to get into a state of equilibrium after a change or during periods when a given service is down. All data has a value associated with it and when calculating the maximum allowable downtime for a service, data value should be calculated in as well.

Performance Impact– Often times a failure of a service component will cause a degradation in service, yet leave the service online for users. This degradation if often times acceptable assuming it is for a short, limited period of time. Understanding how users will use the service will enable you to understand what level of performance loss is acceptable.

A Perfect World
Before we continue into a discussion about how to achieve a given level of High Availability, I want to define my expectation when I hear the term High Availability. When I use the term HA I expect and application or service that can transparently handle failures from a user perspective. I expect an application that despite a failure on the back-end including a server, disk, network connection or otherwise will automatically failover in a way that the end user does not see a disruption in how they are used to interacting with the application. The user should see no degradation in service or loss of data because of the failure.

My definition of HA is assuming a perfect world and adequate funding to architect and implement such a solution. But as we know, IT is not always funded with the necessary money to make dreams into reality. In these cases we must refer back to the first list of components that make up HA to determine which items can be compromised on.

Defining HA for your Environment
Now that we have covered what items are used to define HA, and my definition of HA in a perfect IT world, lets discuss the process for defining a level of HA appropriate for your needs and balancing that with the associated costs with a given level of HA. First is to understand your user base and what their expectations are around application performance, response time and recovery. Things to consider are when your users use the application, how they enter data and what response time they are used to when interacting with the application.

Second is to define what the technical solution will look like for the above customer requirements. This stage is where you will evaluate various levels of redundancy and capability in any database servers, network components, data centers and application capabilities. This stage should include an evaluation of both vendor packaged solutions, and home grown solutions that will meet your needs. This assessment should also include a review of staff capabilities to determine if training will be needed for staff when implementing new technologies.

Third we will define the cost for each component of the above developed architecture. This cost is the cost for an optimal solution, broken down by each individual component. This cost should include all hardware, implementation and software licensing costs for a given period of time. A three year costing is standard within IT and is a good basis to compare several different solutions in an equal fashion.

Finally, we must evaluate the potential cost savings for each component of the solution if we were to cut back from an optimal solution to a more cost effective one. This evaluation should show the portions of the solution that can be implemented via multiple methods, and the associated costs for each method. This information is then used for comparison to balance the required level of HA with the budget available for the project. By properly understanding how much each component of the solution will cost, you can properly evaluate what the possible level of HA will be with each potential increase or decrease in project funding.

Methods for implementing HA
For most of this document I have avoided discussions actual technical solutions available on the market for implementing HAs. This omission was to ensure that HA was defined per your specific needs before defining possible hardware and software solutions. Now I am going to dive into several popular options on the market for assisting to make applications HA capable.

Linux-HA – Linux-HA is an open source solution for managing services across multiple nodes within a cluster for providing a basic high availability solution. Linux-HA is often used to provide automated failover for applications like Jboss, Apache, Lustre or FTP. While Linux-HA will not provide the sub-second failover that some environments need, it will allow administrators to easily setup a pair of servers to act as hot-standbys for one another.

Redundant Switch Fabrics – Modern ethernet switches have multiple levels of redundant capability including redundant controllers within a switch, redundant power supplies, and at the high end redundant switch fabrics that should one complete set of switches and routers fail, a second will seamlessly handle the failover and subsequent network traffic. Technologies like OSPF will ensure that routing of IP traffic continues uninterrupted and protocols like spanning tree will ensure that switches with multiple paths will utilize them in an optimal fashion during both regular and failover scenarios.

RAID – Redundant Arrays of Independent Disks (RAID) is a common method of ensuring that a single disk failure within a server does not cause data loss or corruption. RAID capability can be added through specialized hardware solutions or via low cost software solutions. Both provide a level of protection above standard disks, while keeping total solution costs low.

Oracle RAC – Oracle's Real Application Cluster (RAC) is a clustering solution, often associated with Oracle's database products for both providing high availability functionality, as well as a platform to scale a databases performance. While Oracle RAC is often more expensive then other clustering solutions from MySQL, it provides a very scalable and reliable platform for ensuring very high levels of availability for applications and their associated databases.

Fiber Channel – Fiber Channel solutions for attaching storage to servers often implement redundancy via dual, redundant fiber channel fabrics. These are often implemented utilizing completely separate switches, cables and power connections. This type of solution can ensure that common failures like cables and PCI cards will not cause a server to loose access to its storage or data corruption.

High Availability is often taken to mean something different for each person. Ultimately, HA is ensuring that customer and end user expectations are met for how an application performs and recovers in the event of a failure. When setting up an application, you must first define HA for your specific needs, you can then properly develop a solution that will meet those expectations. As with most projects within Information Technology, you will then have to assess each component of the solution and make possible tradeoffs to ensure the projects budget is met. Ensuring an application is available and properly recovers is a part of all major Information Technology projects, today there are many possible technical solutions to ensure your customers expectation of HA is met.

No comments: