Monday, December 1, 2008

Implementing Lustre Failover

Linux-HA, also referred too as Heartbeat is an OpenSource tool for managing services across multiple nodes within a cluster. Linux-HA ensures that a given service or disk is only running or mounted on a single server within the cluster at a given time. Linux-HA ensures that if a server within the cluster was to fail, the other server was become active for the service automatically, minimizing downtime for the users.

A default install, as I will document today, only catches problems with a server in the cluster not responding to Linux-HA communication. If a node was to have other problems like failed disks, failed network auxiliary connections or errors in I/O access, Heartbeat would not catch and respond to those failures without additional instrumentation.

These below directions are how to implement Lustre-HA to provide for more automated failover of Lustre services. These directions were developed and tested with Lustre version 1.6.5.1 and Linux-HA version 2.1.4.

Assumptions
  • 4 total nodes (2 node-pairs)
    • 1 MGS (Lustre Management Servers)
    • 1 MDS (Lustre Metadata Server)
    • 1 MDT (Metadata Target) on the MDS
    • 2 OSSs (Lustre Object Storage Servers) (OSS01 and OSS02)
    • 2 OSTs (Object Storage Targets) per OSS (OST00-OST03)
  • The MGS and MDS will be on a pair of clustered servers
  • Nodes MGS and MDS have access to the same shared physical disks
  • Nodes OSS01 and OSS02 have access to the same shared physical disks
  • The name of the filesystem is 'lustre'
  • STONITH method is IPMI and the IPMI interface is configured for remote access
  • No software RAID, all RAID is implemented via hardware solutions
Configuring Linux-HA
1) Install Linux-HA
# yum -y install heartbeat


2) Comment out all Lustre mounts from /etc/fstab and umount existing Lustre server and client filesystems. This will ensure no data corruption or contention issues when starting Heartbeat.
MGS/MDS Pair
mgs # cat /etc/fstab | grep lus
#/dev/MGTDISK /mnt/lustre/mgs lustre defaults,_netdev 0 0
mds # cat /etc/fstab | grep lus
#/dev/MDTDISK /mnt/lustre/mds lustre defaults,_netdev 0 0

OSS Pair
oss01 # cat /etc/fstab | grep lus
#/dev/OST00DISK /mnt/lustre/oss00 lustre defaults,_netdev 0 0
#/dev/OST02DISK /mnt/lustre/oss02 lustre defaults,_netdev 0 0
oss02 # cat /etc/fstab | grep lus
#/dev/OST01DISK /mnt/lustre/oss01 lustre defaults,_netdev 0 0
#/dev/OST03DISK /mnt/lustre/oss03 lustre defaults,_netdev 0 0

3) Create all mount points on both nodes in each node-pair
MGS/MDS Pair
# mkdir /mnt/lustre/mgt
# mkdir /mnt/lustre/mdt
OSS Pair
# mkdir /mnt/lustre/ost00
# mkdir /mnt/lustre/ost01
# mkdir /mnt/lustre/ost02
# mkdir /mnt/lustre/ost03

4) Execute '/sbin/chkconfig –level 345 heartbeat on' on all 4 nodes

5) /etc/ha.d/ha.cf changes
MGS/MDS Pair
# cat ha.cf | grep -v '#'
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 10100
auto_failback off
stonith_host mgs external/ipmi mds 10.0.1.100 admin adminpassword
stonith_host mds external/ipmi mgs 10.0.1.101 admin adminpassword
node mgs
node mds

OSS Pair
# cat ha.cf | grep -v '#'
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
initdead 120
# different from MGS/MDS node-pair
udpport 10101
auto_failback off
stonith_host oss01 external/ipmi oss02 10.0.1.102 admin adminpassword
stonith_host oss02 external/ipmi oss01 10.0.1.103 admin adminpassword
node oss01
node oss02

6) /etc/ha.d/authkeys changes
MGS/MDS Pair
# cat authkeys | grep -v '#'
auth 2
2 sha1 SetYourMGSMDSPhasphraseHere
OSS Pair
# cat authkeys | grep -v '#'
auth 2
2 sha1 SetYourOSSPhasphraseHere

7) /etc/ha.d/haresources changes
MGS/MDS Pair
# cat haresources | grep -v '#'
mgs Filesystem::/dev/MGTDISK::/mnt/lustre/mgt::lustre
mds Filesystem::/dev/MDTDISK::/mnt/lustre/mdt::lustre
OSS Pair
# cat haresource | grep -v '#'
oss01 Filesystem::/dev/OST00DISK::/mnt/lustre/ost00::lustre
oss02 Filesystem::/dev/OST01DISK::/mnt/lustre/ost01::lustre
oss01 Filesystem::/dev/OST02DISK::/mnt/lustre/ost02::lustre
oss02 Filesystem::/dev/OST03DISK::/mnt/lustre/ost03::lustre

8) Specify the address of the failover MGS node for all Lustre filesystem components
mds # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/MDTDISK
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST00DISK
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST01DISK
oss02 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST01
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST02DISK

9) Execute 'service heartbeat start' on MGS/MDS pair

10) Execute 'service heartbeat start' on OSS pair

11) Mount the Lustre filesystem on all clients
client # mount -t lustre 10.0.0.101@tcp0,10.0.0.100@tcp0:/lustre /mnt/lustre
client # cat /etc/fstab | grep lustre
10.0.0.101@tcp0,10.0.0.100@tcp0:/lustre /mnt/lustre lustre defaults 0 0

With the above setup, if a single node within each pair (MGS/MDS and OSS01/OSS02) were to fail, after the specified timeout period the clients would be able successfully recover and continue their I/O operations. Linux-HA is not designed for immediate failover, and a recovery can often take on the order of minutes when resources need to move from one node in a pair to another. While this solution will not provide immediate failover, it will allow administrators to setup an inexpensive system that will automatically recovery from hardware failures without lengthy downtimes and impacts to users.

No comments: