To continue my previous post on cluster management, I wanted to focus on the tools that are available for implementing and monitoring cluster health including process, hardware and configuration management.
There are two primary ways that one can go about building a change management and cluster management system. The first is going with a complete Linux stack solution that is integrated with a scheduler, monitoring utilities and OS deployment Tools. The second is to build a suite of tools using commercially or open source available tools in the field. Both have there benefits and tradeoffs, ultimately most firms use a combination of the two.
Types of Tools
There are several types of tools that are necessary to manage any cluster, large or small. The tools are categorized by the need they fill in the overall management of a cluster, including request tracking, change management, availability monitoring, performance monitoring and operating system deployment.
It is important when evaluating an HPC software stack, either complete or built from individual pieces, to ensure that each of these components is included, and evaluated for the capability they will provide versus similar, competing products.
Complete HPC stacks are becoming more common because of there ease of integration, and integrated support models. Complete stacks usually consist of all the base software that is needed to deploy and manage a cluster, as well as the libraries needed for parallel job execution. These stacks significantly cut the time needed to deploy new clusters, as well as ensure that all initial software on the system is compatible and fully tested.
The difficulty with stacks is there set versions of libraries and smaller compatibility matrices. These stacks are very tightly integrated solutions that ensure they are compatible and stable. They can present a challenge for sites that have outside requirements for different versions of libraries and compilers then the complete stack provides. While this is a challenge for some complex installations, this standard set of tested and integrated libraries provides a much easier solution for companies just using mainstream ISV applications. The developers of the primary stacks on the market work to ensure there kernel and library versions are within the framework that the primary ISVs support and expect.
Even in environments where a complete HPC stack solution has been deployed, there could be the need for additional tools to meet all operational requirements. The individual tools mentioned below can be used to fill some of these needs, as well as be used as a starting point for companies that decide to not use an integrated stack solution, but instead roll there own.
The primary benefit to rolling your own stack based on these and other tools is that it will much more clearly meet your companies needs. The integrated stacks are meant as a solution to meet very broad HPC needs within a given customer base, but by developing a custom stack, a company can ensure all there specific needs are met and integrate in with existing company platforms. This integration can include management APIs that are similar to existing platforms, as well as data integration to ensure reporting, authentication and logging meets company standards.
Sun HPC Software, Linux Edition (http://www.sun.com/software/products/hpcsoftware/index.xml) – The Sun Linux HPC Stack is an integrated solution of open source software for deploying and managing the compute resources within an HPC environment. It includes a variety of tools for performance and availability monitoring, OS deployment and management, troubleshooting and necessary libraries to support the primary interconnects on the market.
Rocks (http://www.rocksclusters.org/wordpress/) - Rocks is an open source, community driven integrated solution for deploying and managing clusters. It is based on a concept of rolls, each roll is specific to an application or set of tools that could be needed in an HPC environment. This modularity allows users to add the components they need as there needs evolve.
Trac (http://trac.edgewall.org/wiki/TracDownload) – Trac is a toolkit originally designed to be used in software development organizations. It has integrated capabilities for tracking bugs, release cycles, source code and a wiki for documenting notes and process information. These may all seem like software development specific capabilities, but they can all be used in very effective ways to better manage and document the associated processes for a cluster.
Request Tracker (http://bestpractical.com/rt/) - Request Tracker is an integrated tool for tracking, responding too and reporting on support requests. It is heavily used in call center environments, and works very well for HPC environments to track customer requests for support, requests for upgrades and other system changes.
RASilience (http://sourceforge.net/projects/rasilience/) - RASilience is built around Request Tracker with the Asset Tracker and Event Tracker add-ons. It is an interface and general-purpose engine for gathering, filtering, and dispatching system events. It can be used to provide event correlation across all nodes and other components within a cluster.
Nagios (http://www.nagios.org/) – Nagios is an open source monitoring solution built on the idea of plugins, plugins can be developed to monitor a wide variety of platforms and applications, while reporting back to a central interface for notification management, escalation and reporting capabilities.
Ganglia (http://ganglia.info/) - Ganglia is a highly scalable, distributed monitoring tool for Clusters. It is capable of providing historical information on node utilization rates and performance information via XML feeds from individual nodes, that can subsequently be aggregated for centralized viewing and reporting.
OneSIS (http://www.onesis.org/) - OneSIS is a tool to managing system images, both diskless and diskfull. OneSIS is an effective tool to ensuring that all images within a cluster are stored from a central repository, and integrated in with the appropriate tools to utilize kickstart for installing new operating system images, as well as booting nodes in a diskless environment.
Sun Grid Engine (http://gridengine.sunsource.net/) - SGE is a distributed resource manager which has proven scalability to 38,000 cores within a Grid environment. SGE is rapidly being updated by Sun to more efficiently handle multi-threading and too improve launch times for jobs, as well as tty output for non-interactive jobs.
Cluster Administration Package (http://www.capforge.org/cgi-bin/trac.cgi) – CAP is a set of tools for integrating clusters. It is designed and tested to accomplish three main objectives; Information Management, Control and Installation. CAP is a proven tool for deploying and managing a centralized set of configuration files within a cluster, and ensuring that any changes to master configuration files are correctly propagated to all nodes within the cluster.
Cbench (http://cbench.sourceforge.net/) – Cbench is a set of tools for benchmarking and characterizing performance on clusters. Cbench can be used for both initial bring up of new systems, as well as testing of hardware that has been upgraded, modified or repaired.
ConMan (http://home.gna.org/conman/) - ConMan is a console management utility. It is most often used as an aggregator for a large number of serial console outputs within clusters. It can be used to both take console output and redirect it to a file for later reference, as well as allow administrators to redirect output to a console in ReadWrite mode.
Netdump (http://www.redhat.com/support/wpapers/redhat/netdump/) - Netdump is a crash dump logging utility from Redhat. The purpose of Netdump is to ensure that if a node with no console attached crashes, administrators have a reference point within logs to catch the crash and debug output.
Logsurfer (http://www.crypt.gen.nz/logsurfer/) - Logsurfer is a regular expression driven utility for matching incoming log entries and taking action based up matches. Logsurfer can do a variety of actions based upon a match including running an external script, or counting the number of entries until a threshold is met.
Specific Tool Integration Techniques
These are some specific methods myself and some colleges have used to integrate these tools into larger frameworks used for change management and monitoring within Enterprise Environments. These are meant as a way to show how the different tools, used in combination, can simplify cluster management and lower administration costs. All of these methods have also been tested at scales well beyond typical HPC systems today, including OneSIS and Cbench which have been tested up to scales of 4500 nodes.
OneSIS can be used in two primary methods within a cluster, each can be used independently or in combination. The first and most common is to assemble an image that is then deployed to all compute nodes and installed locally. OneSIS can also be used to distribute that image to all compute nodes so they can run in a diskless fashion, using the image from a central management server.
These methods can also be used in combination when preparing to upgrade a cluster. A new image can be developed and booted into a diskless mode on a subset of a clusters nodes. Those nodes can then be used to test all applications and cluster uses to ensure the image is correct. Once that testing is complete, OneSIS can be used to ensure an exact copy of the tested image in installed on all compute nodes. This method ensure that no bad images are installed on the cluster, and that the majority of the cluster nodes can be left in place for production users while the new image is tested.
Nagios is a very dynamic tool because of its ability to use plugins for monitoring and response. Plugins can be written for any variety of hardware within a cluster to ensure they are online, are not showing excessive physical errors and do not need proactive attention. Nagios's dynamic nature also allows plugins that allow it to communicate with centralized databases of node information and report are hardware or node problems to RT for proper tracking and attention
Nagios plugins can easily be used to remotely execute health check scripts on compute nodes. These health check scripts can check to ensure nodes are operating and responding correctly, there are no hung processes that might affect future jobs, and that the nodes configuration files and libraries are the expected versions. If Nagios does detect an error on a given node, it can easily be configured to automatically open an RT ticket for staff to repair the node, and mark the node offline in the job scheduler until such time as the node is repaired.
Cbench is a wonderful tool for automating the process of both bringing up new clusters as well as testing hardware that has been repaired or replaced to ensure it meets the same benchmarks as other hardware in the cluster. Cbench has a collection of benchmarks that can be used to benchmark a new cluster to ensure that the system, storage, memory and attached file systems perform as designed. This can be a valuable tool in locating issues that were introduced during deployment and will ultimately cause performances decreases for users.
Cbench can also be used to ensure that all hardware that was repaired was done so correctly before being reintroduced into the cluster. By properly benchmarking a cluster at installation time, it allows support staff to run identical benchmarks on nodes that have been subsequently repaired. These new results can be compared to the initial results from the cluster and ensure that the node is now operating as peak, expected performance.
Logsurfer is best used as an aggregator and automated response mechanism within a cluster. By having all nodes send their respective logs to a central log host, it enables the cluster administrators to configure a single Logsurfer daemon to monitor and respond to appropriate log entries.
Many sites will subsequently configure Logsurfer to proactively mark nodes in the scheduler offline if an error is found in the logs relating to that node. This ensures that no future jobs are run on the node until repair staff are able to verify the node is operating correctly and repair the reason for the initial error.
Clusters are complex mixes of hardware and software, the more effectively the tools are picked and integrated early in system design, the more efficiently the system can be managed. There are many tools available, both commercial and open source, that can be used in cluster environments. It is critical that each ones benefits, tradeoffs and scalability be weighed when picking the tools for for environment.
As a final thought, clusters are complex solutions that often require customization at every level. This can also be extended to the applications used to manage the cluster, but was not mentioned previously in this document. It is always an option to develop a tool in house for your needs, chances are, if you have a need, so does someone else. The majority of the tools above were developed because a single company had a need, developed a tool to meet that need and put the tool back into the community for everyone else to use. This is a wonderful way to not only continue improving the capabilities we as a community have around clusters, but is a great way to get company recognition in a rapidly growing field.