Wednesday, December 10, 2008

Networking in IT

There is no doubt, we are seeing a challenging time in the economy, and it is trickling down to the Information Technology (IT) sector. One of the most important parts of weathering a storm like this in the IT industry is focusing on ensuring your network of coworkers, alumni and friends is strong. By developing a strong network, you have a team to folks you can turn to for advise, recommendations, job postings and the inside track on potential job leads.

The other benefit to networking, beyond looking for a new position, is to develop professionally in your current position. By interacting with others in your field and similar fields you can build your personal toolbox by learning from other peoples' experiences and skills. By regularly working with others you can see what methodologies they use to be successful and what tools they have developed and found to ensure they are efficient in their roles.

Here are some common methods myself and others have used to build a community within the IT space:

Users Groups – Most cities today have multiple users groups including Linux, Oracle, MySQL, Dell and DB2 just to name a few. These organizations are always looking for speakers and folks to hold lab sessions. Volunteer to present, volunteer to organize meetings and volunteer to recruit other speakers. It is a wonderful way to meet folks in similar roles as well as share your knowledge and experience with others.

Brown Bag Events – Host a brown bag at your office, invite your coworkers and do a short talk about a topic that interests you or you think would be of relevance in your environment. This gives you publicity within the company, and allows members of other teams to see the skill and experience they have available when new projects come up.

Operate your companies test bed – Often times companies will have a test and quality assurance environment to use for testing new software deployments, to complete software builds and to test new vendor hardware. Often times this environment does not fall on corporate IT, or the quality departments to manage, but is somewhere in the middle. Volunteer to manage this environment and take real ownership of it. This will give you a great forum to meet people in other departments, as well as share your ideas in a way that will allow them to be utilized in production for the company.

Blogging – Blogging is a simple, effective method to put your ideas out in the public for comments, development and to show your level of expertise in a field. Blogging allows you to share ideas and findings as you write them. Blogging provides communication in a forum that while not peer-reviewed, others can comment on your postings and post additional follow up information.

Conferences – Presenting at conferences is a wonderful way to show both your level of experience, as well as new ideas and methodologies you can bring to your field. Conferences provide a peer-reviewed environment that you can submit papers and do talks. These type of environments show not only your level of expertise, but that others in your field value your contributions and capabilities.

Thursday, December 4, 2008

Defining High Availability

In todays business computing environments a wide variety of terms are used to describe systems management, systems performance and system availability. One commonly used term is High Availability (HA). This is a very broad term that can encompass many different levels of availability and the costs associated with the various levels of availability. This term is open to quite a bit of interpretation and this interpretation often leads to confusion about exactly what level of HA an application, device or service provides. Below are the items to factor in when assessing the actual availability of a given service to ensure that it meets your specific interpretation of HA.

Level-setting Expectations
High Availability can mean something different for each person that says or hears the term. It is important to level set expectations about HA and its meaning before having an in-depth discussion about how to meet the objectives laid out in an HA environment. Properly defining HA and calculating the costs associated with implementing HA has four components:

Time to recovery – It is important to understand how long a failure will take to recover from, this will allow you to properly choose solutions that can identify and recover from a failure within a given time frame. A failure can be a hardware problem, a software malfunction or a human error that causes the specific service to act in a way other then it was designed. There are many valid cases where time to recovery can take on the order of minutes or hours, there are other valid cases where recovery should be near-instantaneous.

Method of recovery – Method of recovery is an important component of planning and HA solution and it's associated cost. Many times recovering from failures is an automated fashion, but it is not uncommon to have an error that requires manual intervention to clear the problem. This is often done for categories of problems that are not critical to the operation of a business or customer impacting.

Data Loss and Corruption – Data loss and corruption is an important part of developing a strategy for HA. Data loss and corruption can occur during a failover of services between nodes, while the network works to get into a state of equilibrium after a change or during periods when a given service is down. All data has a value associated with it and when calculating the maximum allowable downtime for a service, data value should be calculated in as well.

Performance Impact– Often times a failure of a service component will cause a degradation in service, yet leave the service online for users. This degradation if often times acceptable assuming it is for a short, limited period of time. Understanding how users will use the service will enable you to understand what level of performance loss is acceptable.

A Perfect World
Before we continue into a discussion about how to achieve a given level of High Availability, I want to define my expectation when I hear the term High Availability. When I use the term HA I expect and application or service that can transparently handle failures from a user perspective. I expect an application that despite a failure on the back-end including a server, disk, network connection or otherwise will automatically failover in a way that the end user does not see a disruption in how they are used to interacting with the application. The user should see no degradation in service or loss of data because of the failure.

My definition of HA is assuming a perfect world and adequate funding to architect and implement such a solution. But as we know, IT is not always funded with the necessary money to make dreams into reality. In these cases we must refer back to the first list of components that make up HA to determine which items can be compromised on.

Defining HA for your Environment
Now that we have covered what items are used to define HA, and my definition of HA in a perfect IT world, lets discuss the process for defining a level of HA appropriate for your needs and balancing that with the associated costs with a given level of HA. First is to understand your user base and what their expectations are around application performance, response time and recovery. Things to consider are when your users use the application, how they enter data and what response time they are used to when interacting with the application.

Second is to define what the technical solution will look like for the above customer requirements. This stage is where you will evaluate various levels of redundancy and capability in any database servers, network components, data centers and application capabilities. This stage should include an evaluation of both vendor packaged solutions, and home grown solutions that will meet your needs. This assessment should also include a review of staff capabilities to determine if training will be needed for staff when implementing new technologies.

Third we will define the cost for each component of the above developed architecture. This cost is the cost for an optimal solution, broken down by each individual component. This cost should include all hardware, implementation and software licensing costs for a given period of time. A three year costing is standard within IT and is a good basis to compare several different solutions in an equal fashion.

Finally, we must evaluate the potential cost savings for each component of the solution if we were to cut back from an optimal solution to a more cost effective one. This evaluation should show the portions of the solution that can be implemented via multiple methods, and the associated costs for each method. This information is then used for comparison to balance the required level of HA with the budget available for the project. By properly understanding how much each component of the solution will cost, you can properly evaluate what the possible level of HA will be with each potential increase or decrease in project funding.

Methods for implementing HA
For most of this document I have avoided discussions actual technical solutions available on the market for implementing HAs. This omission was to ensure that HA was defined per your specific needs before defining possible hardware and software solutions. Now I am going to dive into several popular options on the market for assisting to make applications HA capable.

Linux-HA – Linux-HA is an open source solution for managing services across multiple nodes within a cluster for providing a basic high availability solution. Linux-HA is often used to provide automated failover for applications like Jboss, Apache, Lustre or FTP. While Linux-HA will not provide the sub-second failover that some environments need, it will allow administrators to easily setup a pair of servers to act as hot-standbys for one another.

Redundant Switch Fabrics – Modern ethernet switches have multiple levels of redundant capability including redundant controllers within a switch, redundant power supplies, and at the high end redundant switch fabrics that should one complete set of switches and routers fail, a second will seamlessly handle the failover and subsequent network traffic. Technologies like OSPF will ensure that routing of IP traffic continues uninterrupted and protocols like spanning tree will ensure that switches with multiple paths will utilize them in an optimal fashion during both regular and failover scenarios.

RAID – Redundant Arrays of Independent Disks (RAID) is a common method of ensuring that a single disk failure within a server does not cause data loss or corruption. RAID capability can be added through specialized hardware solutions or via low cost software solutions. Both provide a level of protection above standard disks, while keeping total solution costs low.

Oracle RAC – Oracle's Real Application Cluster (RAC) is a clustering solution, often associated with Oracle's database products for both providing high availability functionality, as well as a platform to scale a databases performance. While Oracle RAC is often more expensive then other clustering solutions from MySQL, it provides a very scalable and reliable platform for ensuring very high levels of availability for applications and their associated databases.

Fiber Channel – Fiber Channel solutions for attaching storage to servers often implement redundancy via dual, redundant fiber channel fabrics. These are often implemented utilizing completely separate switches, cables and power connections. This type of solution can ensure that common failures like cables and PCI cards will not cause a server to loose access to its storage or data corruption.

High Availability is often taken to mean something different for each person. Ultimately, HA is ensuring that customer and end user expectations are met for how an application performs and recovers in the event of a failure. When setting up an application, you must first define HA for your specific needs, you can then properly develop a solution that will meet those expectations. As with most projects within Information Technology, you will then have to assess each component of the solution and make possible tradeoffs to ensure the projects budget is met. Ensuring an application is available and properly recovers is a part of all major Information Technology projects, today there are many possible technical solutions to ensure your customers expectation of HA is met.

Building a Lustre Patchless Client

One common need within Lustre environments is the requirement to build Lustre clients using standard Linux kernels. Lustre servers commonly have a custom kernel with specific patches to optimize performance, but clients do not always require these kernel patches.

These directions will enable you to build the RPMs necessary to install the Lustre client bits on a system with a standard Redhat kernel.

1) Umount all Lustre clients and comment entries from /etc/fstab

2) Reboot a node into the standard redhat kernel you would like to build the client for. Assumption for these directions is RHEL 2.6.18-92.1.13 x86_64.

3) Install the full kernel source tree for the running kernel
- cd ~
- yum install rpm-build redhat-rpm-config unifdef
- mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
- rpm -i

4) Unzip the lustre bits
- Download from
- mv lustre-1.6.6.tar.gz /usr/src
- gunzip lustre-1.6.6.tar.gz
- tar -xvf lustre-1.6.6.tar

5) Prep the kernel tree for building Lustre
- cd /usr/src/linux
- cp /boot/config-'uname -r' .config
- make oldconfig || make menuconfig
- make include/asm
- make include/linux/version.h
- make SUBDIRS=scripts

6) Configure the build - configure will detect an unpatched kernel and only build the client
- cd lustre
- ./configure --with-linux=/usr/src/linux

7) Create RPMs
- make rpms

8) You should get a set of Lustre RPMs in the build directory.
- ls ~/rpmbuild/RPMS

9) rpm -e lustre*

10) Install new client bits
- rpm -ivh lustre-client-1.6.6-2.6.18_92.1.1.13.el5.x86_64.rpm
- rpm -ivh lustre-modules-1.6.6-2.6.18_92.1.1.13.el5.x86_64.rpm

11) Remount all Lustre mounts
- vi /etc/fstab
uncomment lustre lines
- mount -a

Monday, December 1, 2008

Implementing Lustre Failover

Linux-HA, also referred too as Heartbeat is an OpenSource tool for managing services across multiple nodes within a cluster. Linux-HA ensures that a given service or disk is only running or mounted on a single server within the cluster at a given time. Linux-HA ensures that if a server within the cluster was to fail, the other server was become active for the service automatically, minimizing downtime for the users.

A default install, as I will document today, only catches problems with a server in the cluster not responding to Linux-HA communication. If a node was to have other problems like failed disks, failed network auxiliary connections or errors in I/O access, Heartbeat would not catch and respond to those failures without additional instrumentation.

These below directions are how to implement Lustre-HA to provide for more automated failover of Lustre services. These directions were developed and tested with Lustre version and Linux-HA version 2.1.4.

  • 4 total nodes (2 node-pairs)
    • 1 MGS (Lustre Management Servers)
    • 1 MDS (Lustre Metadata Server)
    • 1 MDT (Metadata Target) on the MDS
    • 2 OSSs (Lustre Object Storage Servers) (OSS01 and OSS02)
    • 2 OSTs (Object Storage Targets) per OSS (OST00-OST03)
  • The MGS and MDS will be on a pair of clustered servers
  • Nodes MGS and MDS have access to the same shared physical disks
  • Nodes OSS01 and OSS02 have access to the same shared physical disks
  • The name of the filesystem is 'lustre'
  • STONITH method is IPMI and the IPMI interface is configured for remote access
  • No software RAID, all RAID is implemented via hardware solutions
Configuring Linux-HA
1) Install Linux-HA
# yum -y install heartbeat

2) Comment out all Lustre mounts from /etc/fstab and umount existing Lustre server and client filesystems. This will ensure no data corruption or contention issues when starting Heartbeat.
mgs # cat /etc/fstab | grep lus
#/dev/MGTDISK /mnt/lustre/mgs lustre defaults,_netdev 0 0
mds # cat /etc/fstab | grep lus
#/dev/MDTDISK /mnt/lustre/mds lustre defaults,_netdev 0 0

OSS Pair
oss01 # cat /etc/fstab | grep lus
#/dev/OST00DISK /mnt/lustre/oss00 lustre defaults,_netdev 0 0
#/dev/OST02DISK /mnt/lustre/oss02 lustre defaults,_netdev 0 0
oss02 # cat /etc/fstab | grep lus
#/dev/OST01DISK /mnt/lustre/oss01 lustre defaults,_netdev 0 0
#/dev/OST03DISK /mnt/lustre/oss03 lustre defaults,_netdev 0 0

3) Create all mount points on both nodes in each node-pair
# mkdir /mnt/lustre/mgt
# mkdir /mnt/lustre/mdt
OSS Pair
# mkdir /mnt/lustre/ost00
# mkdir /mnt/lustre/ost01
# mkdir /mnt/lustre/ost02
# mkdir /mnt/lustre/ost03

4) Execute '/sbin/chkconfig –level 345 heartbeat on' on all 4 nodes

5) /etc/ha.d/ changes
# cat | grep -v '#'
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 10100
auto_failback off
stonith_host mgs external/ipmi mds admin adminpassword
stonith_host mds external/ipmi mgs admin adminpassword
node mgs
node mds

OSS Pair
# cat | grep -v '#'
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
initdead 120
# different from MGS/MDS node-pair
udpport 10101
auto_failback off
stonith_host oss01 external/ipmi oss02 admin adminpassword
stonith_host oss02 external/ipmi oss01 admin adminpassword
node oss01
node oss02

6) /etc/ha.d/authkeys changes
# cat authkeys | grep -v '#'
auth 2
2 sha1 SetYourMGSMDSPhasphraseHere
OSS Pair
# cat authkeys | grep -v '#'
auth 2
2 sha1 SetYourOSSPhasphraseHere

7) /etc/ha.d/haresources changes
# cat haresources | grep -v '#'
mgs Filesystem::/dev/MGTDISK::/mnt/lustre/mgt::lustre
mds Filesystem::/dev/MDTDISK::/mnt/lustre/mdt::lustre
OSS Pair
# cat haresource | grep -v '#'
oss01 Filesystem::/dev/OST00DISK::/mnt/lustre/ost00::lustre
oss02 Filesystem::/dev/OST01DISK::/mnt/lustre/ost01::lustre
oss01 Filesystem::/dev/OST02DISK::/mnt/lustre/ost02::lustre
oss02 Filesystem::/dev/OST03DISK::/mnt/lustre/ost03::lustre

8) Specify the address of the failover MGS node for all Lustre filesystem components
mds # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode= /dev/MDTDISK
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode= /dev/OST00DISK
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode= /dev/OST01DISK
oss02 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode= /dev/OST01
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode= /dev/OST02DISK

9) Execute 'service heartbeat start' on MGS/MDS pair

10) Execute 'service heartbeat start' on OSS pair

11) Mount the Lustre filesystem on all clients
client # mount -t lustre, /mnt/lustre
client # cat /etc/fstab | grep lustre, /mnt/lustre lustre defaults 0 0

With the above setup, if a single node within each pair (MGS/MDS and OSS01/OSS02) were to fail, after the specified timeout period the clients would be able successfully recover and continue their I/O operations. Linux-HA is not designed for immediate failover, and a recovery can often take on the order of minutes when resources need to move from one node in a pair to another. While this solution will not provide immediate failover, it will allow administrators to setup an inexpensive system that will automatically recovery from hardware failures without lengthy downtimes and impacts to users.