Showing posts with label lustre. Show all posts
Showing posts with label lustre. Show all posts

Monday, August 3, 2009

Lustre 1.8 and Pools

Beginning with Lustre 1.8, the concept of pools was introduced. Pools are a method for isolating groups of OSTs based on common characteristics. This is most commonly used to group OSTs based on similar hardware type or RAID configuration. An example would be to have a pool of very high performance SAS disks, and a lower performance set of SATA disks, within the same filesystem. Pools will allow users to specify which pool their files are read from and written too.

Next to each section of commands is the system they must be run from.

For these commands, 'lusfs01' is the name of the lustre file system. pool1 and pool2 are the names of the example pools, and we have a total of 10 OSTs within this file system.

Creating a new pool (MGS)
# lctl pool_new lusfs01.pool1
# lctl pool_new lusfs01.pool2

Assigning OSTs to a pool (MGS)
# lctl pool_add lusfs01.pool1 lustre-OST000[0-3]_UUID
# lctl pool_add lusfs01.pool2 lustre-OST000[4-7]_UUID

Listing Available pools (MGS)
# lfs pool_list lusfs01

List OSTs in a given pool (MGS)
# lfs pool_list lusfs01.pool1
# lfs pool_list lusfs01.pool2

Setting a file/directory strip to use a specific pool (Client)
# lfs setstripe -p pool1 /lusfs01/dir1
# lfs setstripe -p pool1 /lusfs01/dir1/file1
# lfs setstripe -p pool2 /lusfs01/dir2
# lfs setstripe -p pool2 /lusfs01/dir2/file1

Thursday, May 21, 2009

Understanding Lustre Internals

Lustre can be a complex package to manage and understand. The folks at ORNL, with assistance from the Lustre Center of Excellence have put out a wonderful paper on Understanding Lustre Internals.

I recommend that all Lustre administrators read it, it is very useful information for understanding how all the Lustre pieces plug together.

Tuesday, April 21, 2009

Lustre Users Group 2009

Last week we held the 2009 Lustre Users Group. It was a success; we had the largest user turn out ever.

All slides can be found here.

I did a presentation on Best Practices for the Sun Lustre Storage System, those slides can be found here.

Saturday, March 14, 2009

LUG 2009

For those attending the Lustre Users Group 2009, I will be presenting on Best Practices for the Sun Storage Cluster. A full agenda is at http://blogs.sun.com/HPC/resource/agenda.pdf

More information on the LUG can be found at http://www.regonline.com/LUG09

Tuesday, February 10, 2009

Lustre Monitoring

I have had several Lustre deployments recently that included performance monitoring, here are the common tools folks use for monitoring their Lustre environments. As time allows I will get a HOWTO posted for each on installing and configuring them.

Sunday, January 25, 2009

Lustre 1.6.6 with MX 1.2.7

Below is the process for installing Lustre 1.6.6 while using MX (Myricom) as the transport.

1) Compile and install Lustre Kernel
- yum install rpm-build redhat-rpm-config
- mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
- echo '%_topdir %(echo $HOME)/rpmbuild' > .rpmmacros
- rpm -ivh kernel-lustre-source-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64.rpm (can be obtained from http://www.sun.com/software/products/lustre/get.jsp)
- make distclean
- make oldconfig dep bzImage modules
- cp /boot/config-`uname -r` .config
- make oldconfig || make menuconfig
- make include/asm
- make include/linux/version.h
- make SUBDIRS=scripts
- make rpm
- rpm -ivh ~/rpmbuild/kernel-lustre-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64.rpm
- mkinitrd /boot/2.6.18-92.1.10.el5_lustre.1.6.6
- Update /etc/grub.conf with new kernel boot information
- /sbin/shutdown 0 -r

2) Compile and install MX Stack
- cd /usr/src/
- gunzip mx_1.2.7.tar.gz (can be obtained from www.myri.com/scs/)
- tar -xvf mx_1.2.7.tar
- cd mx-1.2.7
- ln -s common include
- ./configure --with-kernel-lib
- make
- make install

3) Compile and install Lustre
- cd /usr/src/
- gunzip lustre-1.6.6.tar.gz (can be obtained from http://www.sun.com/software/products/lustre/get.jsp)
- tar -xvf lustre-1.6.6.tar
- cd lustre-1.6.6
- ./configure --with-linux=/usr/src/linux --with-mx=/usr/src/mx-1.2.7
- make
- make rpms (at the bottom of the output it will show location of the generated RPMs)
- rpm -ivh lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64.rpm
lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64.rpm
lustre-ldiskfs-3.0.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64.rpm

4) Add the following lines to /etc/modprobe.conf
options kmxlnd hosts=/etc/hosts.mxlnd
options lnet networks=mx0(myri0),tcp0(eth0)

5) Populate myri0 Configuration with proper IP addresses
- vim /etc/sysconfig/network-scripts/myri0

6) Populate /etc/hosts.mxlnd with the following information
# IP HOST BOARD EP_ID

7) Start Lustre by mounting the disks that contain the MGS, MDT and OSS data stores

Thursday, December 4, 2008

Building a Lustre Patchless Client

One common need within Lustre environments is the requirement to build Lustre clients using standard Linux kernels. Lustre servers commonly have a custom kernel with specific patches to optimize performance, but clients do not always require these kernel patches.

These directions will enable you to build the RPMs necessary to install the Lustre client bits on a system with a standard Redhat kernel.

1) Umount all Lustre clients and comment entries from /etc/fstab

2) Reboot a node into the standard redhat kernel you would like to build the client for. Assumption for these directions is RHEL 2.6.18-92.1.13 x86_64.

3) Install the full kernel source tree for the running kernel
- cd ~
- yum install rpm-build redhat-rpm-config unifdef
- mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
- rpm -i http://mirror.centos.org/centos/5/updates/SRPMS/kernel-2.6.18-92.1.13.el5.src.rpm

4) Unzip the lustre bits
- Download from http://www.sun.com/software/products/lustre/get.jsp
- mv lustre-1.6.6.tar.gz /usr/src
- gunzip lustre-1.6.6.tar.gz
- tar -xvf lustre-1.6.6.tar

5) Prep the kernel tree for building Lustre
- cd /usr/src/linux
- cp /boot/config-'uname -r' .config
- make oldconfig || make menuconfig
- make include/asm
- make include/linux/version.h
- make SUBDIRS=scripts

6) Configure the build - configure will detect an unpatched kernel and only build the client
- cd lustre
- ./configure --with-linux=/usr/src/linux

7) Create RPMs
- make rpms

8) You should get a set of Lustre RPMs in the build directory.
- ls ~/rpmbuild/RPMS

9) rpm -e lustre*

10) Install new client bits
- rpm -ivh lustre-client-1.6.6-2.6.18_92.1.1.13.el5.x86_64.rpm
- rpm -ivh lustre-modules-1.6.6-2.6.18_92.1.1.13.el5.x86_64.rpm

11) Remount all Lustre mounts
- vi /etc/fstab
uncomment lustre lines
- mount -a

Monday, December 1, 2008

Implementing Lustre Failover

Linux-HA, also referred too as Heartbeat is an OpenSource tool for managing services across multiple nodes within a cluster. Linux-HA ensures that a given service or disk is only running or mounted on a single server within the cluster at a given time. Linux-HA ensures that if a server within the cluster was to fail, the other server was become active for the service automatically, minimizing downtime for the users.

A default install, as I will document today, only catches problems with a server in the cluster not responding to Linux-HA communication. If a node was to have other problems like failed disks, failed network auxiliary connections or errors in I/O access, Heartbeat would not catch and respond to those failures without additional instrumentation.

These below directions are how to implement Lustre-HA to provide for more automated failover of Lustre services. These directions were developed and tested with Lustre version 1.6.5.1 and Linux-HA version 2.1.4.

Assumptions
  • 4 total nodes (2 node-pairs)
    • 1 MGS (Lustre Management Servers)
    • 1 MDS (Lustre Metadata Server)
    • 1 MDT (Metadata Target) on the MDS
    • 2 OSSs (Lustre Object Storage Servers) (OSS01 and OSS02)
    • 2 OSTs (Object Storage Targets) per OSS (OST00-OST03)
  • The MGS and MDS will be on a pair of clustered servers
  • Nodes MGS and MDS have access to the same shared physical disks
  • Nodes OSS01 and OSS02 have access to the same shared physical disks
  • The name of the filesystem is 'lustre'
  • STONITH method is IPMI and the IPMI interface is configured for remote access
  • No software RAID, all RAID is implemented via hardware solutions
Configuring Linux-HA
1) Install Linux-HA
# yum -y install heartbeat


2) Comment out all Lustre mounts from /etc/fstab and umount existing Lustre server and client filesystems. This will ensure no data corruption or contention issues when starting Heartbeat.
MGS/MDS Pair
mgs # cat /etc/fstab | grep lus
#/dev/MGTDISK /mnt/lustre/mgs lustre defaults,_netdev 0 0
mds # cat /etc/fstab | grep lus
#/dev/MDTDISK /mnt/lustre/mds lustre defaults,_netdev 0 0

OSS Pair
oss01 # cat /etc/fstab | grep lus
#/dev/OST00DISK /mnt/lustre/oss00 lustre defaults,_netdev 0 0
#/dev/OST02DISK /mnt/lustre/oss02 lustre defaults,_netdev 0 0
oss02 # cat /etc/fstab | grep lus
#/dev/OST01DISK /mnt/lustre/oss01 lustre defaults,_netdev 0 0
#/dev/OST03DISK /mnt/lustre/oss03 lustre defaults,_netdev 0 0

3) Create all mount points on both nodes in each node-pair
MGS/MDS Pair
# mkdir /mnt/lustre/mgt
# mkdir /mnt/lustre/mdt
OSS Pair
# mkdir /mnt/lustre/ost00
# mkdir /mnt/lustre/ost01
# mkdir /mnt/lustre/ost02
# mkdir /mnt/lustre/ost03

4) Execute '/sbin/chkconfig –level 345 heartbeat on' on all 4 nodes

5) /etc/ha.d/ha.cf changes
MGS/MDS Pair
# cat ha.cf | grep -v '#'
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 10100
auto_failback off
stonith_host mgs external/ipmi mds 10.0.1.100 admin adminpassword
stonith_host mds external/ipmi mgs 10.0.1.101 admin adminpassword
node mgs
node mds

OSS Pair
# cat ha.cf | grep -v '#'
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
initdead 120
# different from MGS/MDS node-pair
udpport 10101
auto_failback off
stonith_host oss01 external/ipmi oss02 10.0.1.102 admin adminpassword
stonith_host oss02 external/ipmi oss01 10.0.1.103 admin adminpassword
node oss01
node oss02

6) /etc/ha.d/authkeys changes
MGS/MDS Pair
# cat authkeys | grep -v '#'
auth 2
2 sha1 SetYourMGSMDSPhasphraseHere
OSS Pair
# cat authkeys | grep -v '#'
auth 2
2 sha1 SetYourOSSPhasphraseHere

7) /etc/ha.d/haresources changes
MGS/MDS Pair
# cat haresources | grep -v '#'
mgs Filesystem::/dev/MGTDISK::/mnt/lustre/mgt::lustre
mds Filesystem::/dev/MDTDISK::/mnt/lustre/mdt::lustre
OSS Pair
# cat haresource | grep -v '#'
oss01 Filesystem::/dev/OST00DISK::/mnt/lustre/ost00::lustre
oss02 Filesystem::/dev/OST01DISK::/mnt/lustre/ost01::lustre
oss01 Filesystem::/dev/OST02DISK::/mnt/lustre/ost02::lustre
oss02 Filesystem::/dev/OST03DISK::/mnt/lustre/ost03::lustre

8) Specify the address of the failover MGS node for all Lustre filesystem components
mds # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/MDTDISK
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST00DISK
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST01DISK
oss02 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST01
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST02DISK

9) Execute 'service heartbeat start' on MGS/MDS pair

10) Execute 'service heartbeat start' on OSS pair

11) Mount the Lustre filesystem on all clients
client # mount -t lustre 10.0.0.101@tcp0,10.0.0.100@tcp0:/lustre /mnt/lustre
client # cat /etc/fstab | grep lustre
10.0.0.101@tcp0,10.0.0.100@tcp0:/lustre /mnt/lustre lustre defaults 0 0

With the above setup, if a single node within each pair (MGS/MDS and OSS01/OSS02) were to fail, after the specified timeout period the clients would be able successfully recover and continue their I/O operations. Linux-HA is not designed for immediate failover, and a recovery can often take on the order of minutes when resources need to move from one node in a pair to another. While this solution will not provide immediate failover, it will allow administrators to setup an inexpensive system that will automatically recovery from hardware failures without lengthy downtimes and impacts to users.

Friday, October 17, 2008

Building a new Lustre Filesystem

Here are the quick and dirty steps to create a new Lustre filesystem for testing purposes. I use this at times to test out commands and test benchmarking tools, not to test performance, but to ensure they operate correctly.

This is a simple test environment on a single system with a single physical disk. Lustre is designed for scalability so these commands can be run on multiple machines and across many disks to ensure that a bottleneck does not occur in larger environments. The purposes of this is to generate a working Lustre filesystem for testing and sandbox work.

This set of directions assumes you have compiled and installed both the Lustre kernel and the Lustre userspace bits. Check my previous blog posting for how to complete those items if necessary. This also assumes that you have a spare physical disk that can be partitioned to create the various components of the filesystem. In the example case below I created the filesystem within a Xen virtual machine.

1) Create a script to partition the disk that will be used for testing (using /dev/xvdb for example purposes)
#!/bin/sh
sfdisk /dev/xvdb << EOF
,1ooo,L
,1000,L
,2000,L
,2000,L
EOF

2) Format the MGS Partition
- mkfs.lustre --mgs --reformat /dev/xvdb1
- mkdir -p /mnt/mgs
- mount -t lustre /dev/xvdb1 /mnt/mgs

3) Format the MDT Partition
- mkfs.lustre --mdt --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb2
- mkdir -p /mnt/lusfs01/mdt
- mount -t lustre /dev/xvdb2 /mnt/lusfs01/mdt

4) Format the First OST Partition
- mkfs.lustre --ost --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb3
- mkdir -p /mnt/lusfs01/ost00
- mount -t lustre /dev/xvdb3 /mnt/lusfs01/ost00

5) Format the Second OST Partition
- mkfs.lustre --ost --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb4
- mkdir -p /mnt/lusfs01/ost01
- mount -t lustre /dev/xvdb4 /mnt/lusfs01/ost01

6) Mount the client view of the filesystem
- mkdir -p /mnt/lusfs01/client
- mount -t lustre 127.0.0.1@tcp0:/lusfs01 /mnt/lusfs01/client

At this point you should be able to do an ls, touch, rm or any other standard file manipulation command on files in /mnt/lusfs01/client.

Thursday, October 16, 2008

Building Lustre 1.6.5.1 against the latest Redhat Kernel

I was at a customer site this week and had the need to build Lustre 1.6.5.1 against the latest kernel from Redhat, 2.6.18-92.1.13. Being this process has multiple steps, I thought I would document it so that others do not have to reinvent the wheel.

1) Prep a build environment
- cd ~
- yum install rpm-build redhat-rpm-config unifdef
- mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
- echo '%_topdir %(echo $HOME)/rpmbuild' > .rpmmacros
- rpm -i http://mirror.centos.org/centos/5/updates/SRPMS/kernel-2.6.18-92.1.13.el5.src.rpm
- cd ~/rpmbuild/SPECS
- rpmbuild -bp --target=`uname -m` kernel-2.6.spec 2> prep-err.log | tee prep-out.log

2) Download and install quilt (quilt is used for applying kernel patches from a series file)
- cd ~
- wget http://download.savannah.gnu.org/releases/quilt/quilt-0.47.tar.gz
- gunzip quilt-0.47.tar.gz
- tar -xvf quilt-0.47.tar
- cd quilt-0.47
- ./configure
- make
- make install

3) Prepare the Lustre source code
- Download from http://www.sun.com/software/products/lustre/get.jsp
- mv lustre-1.6.5.1.tar.gz /usr/src
- gunzip lustre-1.6.5.1.tar.gz
- tar -xvf lustre-1.6.5.1.tar

4) Apply the Lustre kernel-space patches to the kernel source tree
- cd /root/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/
- ln -s /usr/src/lustre-1.6.5.1/lustre/kernel_patches/series/2.6-rhel5.series series (there are several diffrent series files in the series dir, choose the one closest to your environment)
- ln -s /usr/src/lustre-1.6.5.1/lustre/kernel_patches/patches patches
- quilt push -av

5) Compile a new kernel from source
- make distclean
- make oldconfig dep bzImage modules
- cp /boot/config-`uname -r` .config
- make oldconfig || make menuconfig
- make include/asm
- make include/linux/version.h
- make SUBDIRS=scripts
- make rpm
- rpm -ivh ~/rpmbuild/RPMS/kernel-2.6.18prep-1.x86_64.rpm
- mkinitrd /boot/initrd-2.6.18-prep.img 2.6.18-prep
- Update /etc/grub.conf with new kernel boot information

6) Reboot system with new, patched kernel

7) Compile Lustre with the new kernel running
- cd /usr/src/lustre-1.6.5.1
- ./configure --with-linux=/root/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18.x86_64
- make rpms (Build RPMs will be in ~/rpmbuild/RPMS)

8) Install the appropriate RPMs for your environment