Showing posts with label hpc. Show all posts
Showing posts with label hpc. Show all posts

Thursday, March 15, 2012

Isn't Big Data just a new name for HPC?

High Performance Computing (HPC) is a field of study and associated technologies that has been around for multiple decades. Big Data is an emerging term used to describe the new types of operational challenges and complexity that are common with today’s growing data sets that must be stored, analyzed and understood. Big Data has many roots from the HPC space related to parallel programming methods, data set size and complexity and algorithms used for data analysis, manipulation and understanding.

Ultimately, HPC and Big Data are not technologies. They are a common set of concepts and practices, supported by specific tools and technology. Each is commonly used to represent a set of problems and the technical solutions for solving the problems. HPC and Big Data overlap in many places, but also have specific domains that they are unique too and do not overlap.

HPC Commonly includes technologies and concepts like (certainly not exhaustive):

  • Message Passing Interface (MPI) – MPI provides a common set of functions to enable distributed processes to communicate at high speeds.
  • Parallel File Systems – Parallel file systems allow for high levels of throughput by simultaneously writing and reading across many different storage servers that appear as a single file system and name space. Parallel file systems enable access to single data sets from many different systems, at high speed, and ensure data integrity while many different systems are simultaneously reading and writing to different files and locations within the file system.
  • Lustre – Lustre is an open source, commercially supported parallel file system that is highly scalable, commonly allowing for use on multi-thousand node clusters and multi-thousand processor HPC systems.
  • Infiniband – Infiniband is a network technology that provides very high levels of bi-directional bandwidth in an HPC environment at lower latency then is commonly available with traditional Ethernet based technologies.

Big Data commonly includes technologies and concepts like:

  • Map Reduce – Map Reduce is a set of algorithms and the functions that allow for the distributed analysis of large data sets. Map Reduce is the result of many years of work in the computer science fields and research papers from a variety of universities and technology focused companies.
  • Distributed File Systems – Distributed file systems provide the ability to store large data sets with no pre-defined structure, distributed across many commodity nodes. Distributed file systems work in the Big Data space to provide scalability, data locality and data integrity through replication and checksum validation on read.
  • Hadoop – Hadoop is an open source, commercially supportabed implementation of Map Reduce and the Hadoop Distributed File System (HDFS). Hadoop has enabled a large ecosystem of additional tools to process and exploit the data stored in a Hadoop environment. Hadoop has many tools to allow it to be integrated into data pipelines allowing for complex storage and analysis of data across a variety of tools.
  • HPCC Systems (LexisNexis) – Using the Enterprise Control Language (ECL) for development, HPCC Systems is an open source application stack from LexisNexis for storing and analyzing large, complex data sets.
  • NoSQL – Not-only-SQL (NoSQL) is an emerging set of tools for storing loosely structured data and providing access through SQL-like interfaces, but removing some of the more complex functionality common in traditional relational databases, but unnecessary in some of the current used cases for NoSQL tools.

Common workloads for use in Big Data environments:

  • Better advertising – Many online retailers and businesses utilize technologies within the Big Data space to provide targeted advertising, ensuing a higher acceptance rate and purchases by customers.
  • Social Networking – The rapid rise of social networking sites and tools are the most common example of Big Data. Today, tools like Hadoop, MongoDB and Cassandra are the supporting technology for the majority of the social networking sites. These tools have been developed specifically to meet the needs and requirements of social networking companies.
  • Recommendations engine and matching – Big Data tools like Hadoop are commonly used to make recommendations to customers on purchases, these recommendations are driven by a large data set specific to customer type based on previous purchases, recommendations by friends and other items priced and reviewed online.
  • Differential Pricing – Differential pricing is becoming more and more common as tools that can quickly determine market value get deployed. Differential pricing is the adjustment of a good or service’s price, up or down, to influence demand. While this has long been leveraged by retailers to control inventory; Big Data technologies allow it to be done more rapidly and pricing set automatically based on market dynamics.

Common workloads for use in HPC environments:

  • Chemistry and Physics research – Many workloads related to chemistry and physics modeling require a level of inter-process communication that Big Data technologies do not provide. These workloads are commonly used in traditional HPC environments, providing researchers with proven methods to model new chemicals, and physical reactions expected in actual experiments.
  • Oil & Gas Modeling – The types of modeling that the Oil & Gas industry commonly do involve large data sets captured from the field, that requires processing prior to making decisions about how to properly exploit an energy reserve. This modeling is commonly run on traditional HPC environments and has many years of proven technology behind it.

Fundamentally, Big Data and HPC are different in one major aspect – What data moves and where it moves. The big difference between traditional HPC environments and Big Data environments is where the data is relative to where the data is processed. Within HPC environments the data is always moved to the location where it will be processed. In Big Data environments the job is moved to the location of the data to minimize data movement. The struggle with this approach is that HPC users and Big Data users commonly demand dedicated environments; this can increase the operational costs for IT departments by requiring separate sets of hardware that must be managed with their own efficiency metrics to be monitored.

So, can these different methods be mixed? It is becoming more and more common for HPC departments to receive request to enable newer Big Data type applications to be used in traditional HPC computing environments. Ultimately both HPC and Big Data are about taking very large, complex data sets and analyzing the information to enable better understanding and decisions. The methods taken are what differs.

I will target some future postings on the technical implementations of running Hadoop and other Map Reduce frameworks on traditional HPC environments.

Monday, June 6, 2011

Hadoop Whitepapers

Below are links to two whitepapers I recently wrote and published as part of my role at Dell.
Enjoy!

Tuesday, May 5, 2009

"Cloud" and HPC?, Huh?

I have tried for the most part to not post on this phenomenon known as "cloud computing." "Cloud" is still evolving and as such has many different meanings. The reason this whitepaper caught my attention is it's attempt at connecting high performance computing (HPC) with "cloud computing." The way I see it, "cloud" is still more of an evolving idea then a true product. True, many companies are offering "cloud" products, but the standards are still evolving, as is the true meaning of "cloud computing."

In my mind "cloud" is the next logical evolution of computing - better resource management through enabling applications to better communicate with their supporting infrastructures (servers, storage, network, cpu and memory resources) to allow applications to have the intelligence to scale up and down based on demand. "Cloud Computing" also has a valid connection to outsourcing in the sense that shared infrastructures will at some point over take privately managed information technology (IT) infrastrucures that are common today.

There are several points about the above listed whitepaper from UnivaUD that caught my attention:
  • MPI was only mentioned once. The Message Passing Interface (MPI) is the standard on which most HPC applications and platforms are built. For a paper to truly look at the potential of outsourcing HPC to a "cloud" environment, an indepth review of MPI will need to be done to ensure the proper updates are made to handle the additional physical layer errors that could occur in a shared environment, as well as the added challenges of communication in an unknown environment.
  • There was very little mention of the actual applications that are common in HPC. Applications like Fluent, NAMD, NWChem, Gaussian, and FFTW are commonly used on clusters built in house to meet the specific needs of a given community. Moving those applications from these small, in-house envirronments will take time and review to ensure they are able to scale in shared environments, as well as properly handle the increased variation possible in hardware and configurations.
  • There was no mention of parallel file systems. This is a fundamental requirement of modern HPC environments. To truly move common HPC environments into the "cloud" a solution will be needed for data management and transfer at the high speeds required of todays applications.
In short, the above linked whitepaper is common of what I am seeing in the "cloud" space; lots of talk of the possible benefits around the use of shared environemnts. What we need to stop doing as a community is trying to associate all things IT with "cloud." I have no doubt that in time we will evolve to more use of shared resources - this has been occuring for quite a while with the migration to larger clusters within universities and national laboratories, as well as the ongoing outsourcing of email and specific applications - but as a community we need to ensure that each time we change how we do things for a given area of IT it is with specific goals in mind. Without those clearly defined goals we will not know if we were successful.

As time allows I hope to explore the above issues, particularly looking at alternatives for parallel file systems in environments that may have varying latency, and are distributed over various data centers.

Sunday, March 1, 2009

Ethernet Improvements

I wanted to call attention to a couple interesting projects in the ethernet space. They both have the goal of lowering the overhead that is commonly associated with using ethernet within an HPC environment.

Open-MX
GAMMA

There is a very good write up at http://www.linux-mag.com/id/7253, giving a little more background on each.

Sunday, January 25, 2009

Lustre 1.6.6 with MX 1.2.7

Below is the process for installing Lustre 1.6.6 while using MX (Myricom) as the transport.

1) Compile and install Lustre Kernel
- yum install rpm-build redhat-rpm-config
- mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
- echo '%_topdir %(echo $HOME)/rpmbuild' > .rpmmacros
- rpm -ivh kernel-lustre-source-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64.rpm (can be obtained from http://www.sun.com/software/products/lustre/get.jsp)
- make distclean
- make oldconfig dep bzImage modules
- cp /boot/config-`uname -r` .config
- make oldconfig || make menuconfig
- make include/asm
- make include/linux/version.h
- make SUBDIRS=scripts
- make rpm
- rpm -ivh ~/rpmbuild/kernel-lustre-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64.rpm
- mkinitrd /boot/2.6.18-92.1.10.el5_lustre.1.6.6
- Update /etc/grub.conf with new kernel boot information
- /sbin/shutdown 0 -r

2) Compile and install MX Stack
- cd /usr/src/
- gunzip mx_1.2.7.tar.gz (can be obtained from www.myri.com/scs/)
- tar -xvf mx_1.2.7.tar
- cd mx-1.2.7
- ln -s common include
- ./configure --with-kernel-lib
- make
- make install

3) Compile and install Lustre
- cd /usr/src/
- gunzip lustre-1.6.6.tar.gz (can be obtained from http://www.sun.com/software/products/lustre/get.jsp)
- tar -xvf lustre-1.6.6.tar
- cd lustre-1.6.6
- ./configure --with-linux=/usr/src/linux --with-mx=/usr/src/mx-1.2.7
- make
- make rpms (at the bottom of the output it will show location of the generated RPMs)
- rpm -ivh lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64.rpm
lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64.rpm
lustre-ldiskfs-3.0.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64.rpm

4) Add the following lines to /etc/modprobe.conf
options kmxlnd hosts=/etc/hosts.mxlnd
options lnet networks=mx0(myri0),tcp0(eth0)

5) Populate myri0 Configuration with proper IP addresses
- vim /etc/sysconfig/network-scripts/myri0

6) Populate /etc/hosts.mxlnd with the following information
# IP HOST BOARD EP_ID

7) Start Lustre by mounting the disks that contain the MGS, MDT and OSS data stores

Monday, November 17, 2008

Security Planning in HPC

Todays high performance compute (HPC) solutions have many components including compute nodes, shared storage systems, high capacity tape archiving systems and shared interconnects including ethernet and Infiniband. One primary reason companies are turning to HPC solutions is the cost benefits of shared infrastructure that can be leveraged across many different projects and teams. While this shared usage model can allow for managed, cost effective growth, it also introduces new security risks and requirements for policies and tools to ensure previously separate data is managed properly in a shared environment.

This shared infrastructure model that is often used in HPC has several areas around data security that should be addressed prior to deploying shared solutions. Often times companies will have departments working on sensitive work while others work on very public projects, other firms could be working with their customers proprietary data and most companies have a threat from outside competitors trying to gain access to confidential work. All of these issues must be addressed in shared HPC solutions to ensure data is always secure, a reliable audit platform is in place and that security policies can be changed in a rapid fashion as company needs and policies change.

When evaluating an HPC solution to ensure data access is managed within company policy, there are several components within the cluster that should be reviewed separately:

Shared file systems – Todays HPC solutions have become rapidly successfully because of the availably of massively parallel file systems. These are scalable solutions for doing very high speed I/O and are often times available on all nodes within a cluster.

Databases – More often then ever companies are utilizing databases as a way to organize massive amounts of both transactional and reporting data. Often times these databases are paired with HPC solutions to evaluate the data in a very scalable and reliable method. These databases often contain a variety of data including sales, forecasting, payroll, procurement and scheduling just to name a few.

Local disk – More often then not, compute nodes have local disk in them to provide a local operating system and swap space. This swap space and possibly temporary file systems can provide a space for users to store data while jobs are running, but is also a location that must be assessed to ensure access is provided to those that need it.

Compute node memory – Compute nodes also have local physical memory that could be exploited by software flaws to allow unexpected access.

Interconnects – Todays HPC systems often use a high speed interconnect like Infiniband or 10Gbit Ethernet, these, like any other type of network connections present the opportunity for sniffing or otherwise monitoring traffic.


Policies
Todays companies often work for a variety of customers, as well as work on internal projects. It can be a complicated balancing act ensuring that data access policies are in place to properly handle those cases. Some data will require very restrictive policies, while others will require a very open policy around usage and access. Often time separate filesystems can be utilized to ensure data is stored in manageable locations and access granted pursuant to company policies.

There are two primary components to developing these security policies, first is to assess the risk associated with each component of the system, this risk assessment can include costs in dollars, costs in time and public perception if data was to be handled incorrectly per industry best practices or legal guidelines. Policies can then be developed to mitigate that risk to acceptable levels.

Some common methods to mitigate risk across the above components are:

Data Isolation – Within a shared computing environment data can be isolated in a variety of ways including physical isolation using different storage arrays, logical isolation using technology like VLANs and access restrictions like file permissions.

Audit Trails – Considering audit trails and how to implement them is important. This ensures that there is both a path to isolating and resolving problems, but also that legal compliance regulations are met. Audit trails can include system log files, authentication log files,resource manager logs and many others to provide end to end documentation of a user and their activities.

Consistent Identity Management – To properly ensure that data is accessed by the correct individuals and audit trails are consistent it is important to ensure identity management solutions are in place that handle HPC environments, as well as other enterprise type computing resources in a consistent method. Identity Management can be provided by tools like LDAP and Kerberos, as well as more advanced authentication and authorization systems.

Notifications
– Notifications are an important part of the change management process within an HPC environment. Notifications can include alerts to security staff, administrators or management that portions of the cluster are out or company compliance, or attempts to access restricted resources have occurred. Notifications can come from a variety of tools within an HPC environment, but should be uniform in format and information so that staff can respond rapidly to unexpected cluster issues.

Data Cleanup
– Often jobs within an HPC environment will create temporary files on individual nodes, as well as on shared filesystems. These files have an impact to a systems risk assessment and should be properly cleaned up after they are no longer needed. By removing all data that is not needed, it limits that data that needs to be accounted for, as well as the potential exposure if a single system is compromised.

We have just finished reviewing risk assessments within an HPC environment. These allow management and administrators of HPC systems to understand the costs (political, financial, time) of any failure in security plans or processes. In addition to understanding risk, there is the added complication of enforcing these policies in a way that is consistent across the cluster, consistent across the company and provides a proper audit trail. The most common methods of software implementation for these security policies are:

File System Permissions
– File system permissions are the most common place to implement security controls, as well as one of the easiest items to complete and ensure compliance with. These permissions allow administrators at the lowest level to grant and deny access to data based on need. These do not assist with restricting back access to unauthorized individuals, but do contribute to ensuring that day to day operation of the system is done reliably and security.

Centralized Monitoring
– Centralized monitoring and policy management are key to ensuring consistent security and minimizing human error. By using a central repository for all log entries, it allows staff to implement tools to rapidly catch any activity that is unauthorized or unexpected and respond with the proper speed. Centralized policy management through the use of tools like Identity management allow staff to quickly add or remove access based on business needs. By centralizing this policy management a company can ensure that the often manual process of removing access is removed and proper checks are in place to ensure access changes are updated accordingly.

Resource Manager
– Most modern clusters make use of a job scheduler, or resource manager to allocate nodes and other resources to individual users to complete jobs. Most schedulers allow the allocation of resource groups and restrictions on those groups to an individual user or users. By extending this functionality it is possible to restrict users jobs to run on systems that have data they are allowed to see, and ensure they can not access nodes with filesystems they do not have permissions to utilize. The resource manager is a centralized tool that provides great flexibility in ensuring users have access to the resources they need, but no other resources or data.

Mounted File Systems – Often times HPC environments will utilize a variety of automated tools to unmount and remount filesystems based on user access needs. By un-mounting a filesystem that is not required for a given user, it adds an additional level of access protection above file permissions to ensure only authorized users access the data contained on a given filesystem.


Shared infrastructure is a challenge in all environments when assessing security solutions. A shared infrastructure means that additional precaution must be taken in implementation and security policies to ensure that data and resources are used when expected and by only authorized individuals. When planning a shared environment the initial process should begin with a risk assessment to understand what components of the solutions could be exploited and what the costs in time and money would be if that were to occur. That risk assessment can then be used to ensure the proper safeguards are implemented with available technologies to reduce the risk to a manageable and acceptable level for the company. Ultimately all safeguards should be implemented in a way that limits the potential for accidental failures in safeguards and reduces the need for manual administration and intervention. Shared resources are a challenge, but when properly managed, can ensure better overall utilization for a company without sacrificing on security.

Friday, October 17, 2008

Building a new Lustre Filesystem

Here are the quick and dirty steps to create a new Lustre filesystem for testing purposes. I use this at times to test out commands and test benchmarking tools, not to test performance, but to ensure they operate correctly.

This is a simple test environment on a single system with a single physical disk. Lustre is designed for scalability so these commands can be run on multiple machines and across many disks to ensure that a bottleneck does not occur in larger environments. The purposes of this is to generate a working Lustre filesystem for testing and sandbox work.

This set of directions assumes you have compiled and installed both the Lustre kernel and the Lustre userspace bits. Check my previous blog posting for how to complete those items if necessary. This also assumes that you have a spare physical disk that can be partitioned to create the various components of the filesystem. In the example case below I created the filesystem within a Xen virtual machine.

1) Create a script to partition the disk that will be used for testing (using /dev/xvdb for example purposes)
#!/bin/sh
sfdisk /dev/xvdb << EOF
,1ooo,L
,1000,L
,2000,L
,2000,L
EOF

2) Format the MGS Partition
- mkfs.lustre --mgs --reformat /dev/xvdb1
- mkdir -p /mnt/mgs
- mount -t lustre /dev/xvdb1 /mnt/mgs

3) Format the MDT Partition
- mkfs.lustre --mdt --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb2
- mkdir -p /mnt/lusfs01/mdt
- mount -t lustre /dev/xvdb2 /mnt/lusfs01/mdt

4) Format the First OST Partition
- mkfs.lustre --ost --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb3
- mkdir -p /mnt/lusfs01/ost00
- mount -t lustre /dev/xvdb3 /mnt/lusfs01/ost00

5) Format the Second OST Partition
- mkfs.lustre --ost --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb4
- mkdir -p /mnt/lusfs01/ost01
- mount -t lustre /dev/xvdb4 /mnt/lusfs01/ost01

6) Mount the client view of the filesystem
- mkdir -p /mnt/lusfs01/client
- mount -t lustre 127.0.0.1@tcp0:/lusfs01 /mnt/lusfs01/client

At this point you should be able to do an ls, touch, rm or any other standard file manipulation command on files in /mnt/lusfs01/client.

Thursday, October 16, 2008

Building Lustre 1.6.5.1 against the latest Redhat Kernel

I was at a customer site this week and had the need to build Lustre 1.6.5.1 against the latest kernel from Redhat, 2.6.18-92.1.13. Being this process has multiple steps, I thought I would document it so that others do not have to reinvent the wheel.

1) Prep a build environment
- cd ~
- yum install rpm-build redhat-rpm-config unifdef
- mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
- echo '%_topdir %(echo $HOME)/rpmbuild' > .rpmmacros
- rpm -i http://mirror.centos.org/centos/5/updates/SRPMS/kernel-2.6.18-92.1.13.el5.src.rpm
- cd ~/rpmbuild/SPECS
- rpmbuild -bp --target=`uname -m` kernel-2.6.spec 2> prep-err.log | tee prep-out.log

2) Download and install quilt (quilt is used for applying kernel patches from a series file)
- cd ~
- wget http://download.savannah.gnu.org/releases/quilt/quilt-0.47.tar.gz
- gunzip quilt-0.47.tar.gz
- tar -xvf quilt-0.47.tar
- cd quilt-0.47
- ./configure
- make
- make install

3) Prepare the Lustre source code
- Download from http://www.sun.com/software/products/lustre/get.jsp
- mv lustre-1.6.5.1.tar.gz /usr/src
- gunzip lustre-1.6.5.1.tar.gz
- tar -xvf lustre-1.6.5.1.tar

4) Apply the Lustre kernel-space patches to the kernel source tree
- cd /root/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/
- ln -s /usr/src/lustre-1.6.5.1/lustre/kernel_patches/series/2.6-rhel5.series series (there are several diffrent series files in the series dir, choose the one closest to your environment)
- ln -s /usr/src/lustre-1.6.5.1/lustre/kernel_patches/patches patches
- quilt push -av

5) Compile a new kernel from source
- make distclean
- make oldconfig dep bzImage modules
- cp /boot/config-`uname -r` .config
- make oldconfig || make menuconfig
- make include/asm
- make include/linux/version.h
- make SUBDIRS=scripts
- make rpm
- rpm -ivh ~/rpmbuild/RPMS/kernel-2.6.18prep-1.x86_64.rpm
- mkinitrd /boot/initrd-2.6.18-prep.img 2.6.18-prep
- Update /etc/grub.conf with new kernel boot information

6) Reboot system with new, patched kernel

7) Compile Lustre with the new kernel running
- cd /usr/src/lustre-1.6.5.1
- ./configure --with-linux=/root/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18.x86_64
- make rpms (Build RPMs will be in ~/rpmbuild/RPMS)

8) Install the appropriate RPMs for your environment

Wednesday, August 27, 2008

Risk Management in HPC

Risk management is a very broad topic within the project management space. It covers planning for and understanding the most unimaginable of possibilities within a project so that a plan is in place to respond to these situations and mitigate risk across the project. I will focus on risk management specifically in High Performance Compute (HPC) deployments. HPC, like any other specialty area has it's own specific risks and possibilities. Within HPC these risks are both procedural and technical in nature, but have equal implications to the overall delivery of a successful solution.

Risk management in any project begins with a risk assessment, this includes both identifying risks and possible risk mitigation techniques. This can be done through a variety of methods including brainstorming, the Delphi technique, or by referencing internal documentation about similar, previous projects. This initial assessment phase is critical to ensure that both risks and responses are captured. By capturing both of these upfront, it allows for better communication around the known risks, and better preparation for managing unknown risks. This risk assessment will produce a risk matrix, this is the documented list of possible risks to a project and there mitigation, or response plans. The risk matrix will become part of the overall project delivery plan.

Risk Matrix
When beginning any HPC project, either an initial deployment or an upgrade, it is important to develop a risk matrix. This can include both known risks (late delivery, poor performance, failed hardware) as well as unknown risks. The unknown risks category is much more difficult to define for just that reason, but a common approach is to define levels of severity and responses. These responses can include procedural details, escalation details, communication information and documentation about the problem to prevent a reoccurrence.

This matrix should include a variety of information including:
  1. Risk name – Should be unique within the company to facilitate communication between groups and departments
  2. Risk Type – Minimal, Moderate, Severe, Extreme, etc
  3. Cost if this risk occurs – This can be in time, money or loss of reputation, or all of the above.
  4. Process to recovery – It is important to document early on how to respond to the risk and correct any problems that have developed because of the risk
  5. Risk Owner – Often times a specific individual has additional experience with dealing with a specific risk and can work as a Subject Matter Expert (SME) for the project team
  6. Outcome documentation – Clearing defining what should be documented should the risk occur so that it can be responded too
  7. Communication Channels - different risks require that different staff and management become engaged, it is important to document who should be involved should a risk occur
  8. Time Component – Every risk has a response, every response has a time component associated with it. It is important to understand these time components up front, it will allow project management staff to adjust schedules accordingly should a risk occur

Known Risks
Often times, known risks are the easiest for people to plan for, but very difficult to handle. This understanding up front and anticipation of the risk or problem can often fool us into believing we know the best response to the problem, when often the only way to truly understand how to respond to a problem is to do it incorrectly one or more times.

Lets explore some common risks that are specific to HPC deployments, and the most common mitigation strategies to combat them:

Application Scaling
A fundamental premise of HPC is that applications should scale in a way that makes more hardware produce more accurate results and/or more efficient production of data. Because of this an application is often expected to perform with the same scalability on 64 nodes, as it does on 128 and often many more. This type of scalability must be architected into the application as it is written and improved on as hardware performance evolves over time. Every time a newer, faster or bigger cluster is installed, there is an inherent risk that the applications previously used will not properly scale on the new platform.

Often times the best mitigation strategy for this risk is proper planning, testing and benchmarking; before system deployment. The most difficult time to manage an application scaling problem is after a customer's hardware has been delivered and installed. By benchmarking and testing the application prior to shipment, the expectations with the customer can be properly set. It also allows proper time for working with any development teams to troubleshoot scaling problems and correct them before presenting results and completing acceptance testing with the customer.

Facility Limitations
HPC solutions often use large amounts of power, cooling and space within a data center compared to a companies business support systems or database centric systems. Because of the large facility needs of HPC it is very common for customers to underestimate the facility needs, or the numbers to be poorly communicated from a vendor to a customer. The power and cooling requirements can also vary widely based upon the customers final use and intended application of the cluster.

All facility design issues should be managed and planed for before hardware is shipped or systems are assembled. To ensure a smooth cluster delivery, it is critical that site planning and assessment be done as part of the system design. This site planning should ensure there is enough power, cooling and space to accommodate the cluster. It should additionally work to ensure the power and cooling are in the proper places and can be directed to the cluster in the recommended fashion.

Mean Time between Failure (MTBF)
MTBF is a calculation used to understand how often components across a single node or cluster will fail. It averages the known and designed life cycle of all individual components to provide a time between each individual component failure. These component failures can either be severe enough to impact the whole system, or just portions of a cluster based on the cluster's design. Often times a completed cluster will fail in unexpected ways because of the MTBF characteristics of putting large numbers of compute nodes in a single fabric. If proper redundancy is not built into critical systems of the cluster, a customer satisfaction issue can develop because of prolonged and unplanned for outages.

By properly assessing all uptime requirements from the customer, a system can be designed that will provide the uptime necessary to conduct business regardless of the MTBF that is collective across all components. Each individual service and capability of the cluster should be assessed to ensure that the proper level of redundancy including clustered nodes, redundant power, and redundant disks is included with the complete solution.

Performance Targets, I/O and Compute
Performance guarantees are often included in HPC proposals to customers to provide a level of comfort when planning times for job completion and capacity planning for an organization. These numbers can often be sources of concern as a system is brought online if compute capacity is not as promised or I/O is not operating as fast as expected or promised.

There is often misunderstandings with complete cluster deployments about a clusters capability for sustained versus peak performance. Sustained is most often the number used for a representative test of how the system will perform over its life cycle. Where as peak is the level of performance often stated for bragging rights because it is the theoretical maximum potential of a given cluster.

There is very little that can be done after delivery of a system if this type of risk comes up, other then giving the customer the additional hardware to pull the sustained performance number up to the peak performance number. This can be a very expensive response. This is the reason that the staff doing HPC architecture must fully understand application benchmarking and performance when designing new clusters. All numbers should also be reviewed by multiple people, this will insure errors in math or testing methodologies do not go unnoticed.

Unknown Risks
Often times planning for unknown risks can be the most stressful, but can yield the most gains when actually responding. This is because of a lack of prior perceptions and the ability to be very creative with responses and future mitigation strategies. Risk planning for unknown risks is often an exercise in understanding the levels of severities that could occur with a problem, and associating it with the appropriate level of response and future prevention.

When defining response strategies for unknown risks, often the first step is to define levels of severity that could develop from any given problem. A common list is:
  1. Most severe level of risk, requires executive management level response to the customer and has a high percentage cost to the project (greater then 50% of project revenue is at risk).
  2. Severe level of risk, requires executive level of response and carries a medium level of financial risk (less then 50% of project revenue is at risk).
  3. Medium level project risk, requires senior management response, could or could not have a financial impact on the project, but does have a deliverable and schedule component.
  4. Lower level risk, has an impact on project schedule, but no negative impact on project financials.
  5. The lowest level of project risk, often just a communication issue with a customer or potential misunderstanding. Often no schedule impact or financial impact to the project.
The next step, after defining a list of possible problem levels, is to define mitigation strategies for each. Each mitigation strategy should include the following:
  1. Steps to research and understand the problem.
  2. Communication channels, who needs to be communicated with for a problem of this magnitude and how are they communicated with. This needs to include both customer and company contacts that will be necessary to correct the problem.
  3. Flow chart for responding, this is the path to determining the appropriate response and deciding if more resources, either financial or staffing, are needed to correct the risk.
  4. Documentation to prevent future occurrences is important. It is important to ensure that any information about the project is gathered and documented to be used in house to prevent future occurrences of the same risk.
  5. Risk closure document. A checklist to document that all protocol was followed and the risk was corrected. This should include components that the risk will not return on the same project because mitigation techniques have been implemented.
The mitigation strategies for the various levels of unknown risks can all be the same, or each level can have its own mitigation strategy. Often a different strategy is used for each of the levels because different executives or financial analysts will need to be involved in the response because of the problems that can be different from level to level. The mitigation strategies are the companies last line of defense within a project to ensure that all problems, no matter the level can be resolved to ensure a smooth project delivery,

Path Forward
The most important component of risk management is skill and experience development. It is important to ensure that as a company, you have processes to document all experience that is gained as part of risk management within managing your projects. This knowledge must be documented so that other teams, new teams and new staff can learn from previous experience of the company.

The more efficient a job that is done with documenting risk response and lessons learned, the more efficiently companies can scope future projects. This allows companies to much more accurately assess costs for future projects, as well as risk versus reward tradeoffs for large, complex projects. Ultimately the best way to manage risk is to understand it before beginning actual deployment and implementation on a project. This comes from a combination of utilizing all data collected on previous projects as well as techniques like brain storming and the Delphi technique to ensure as many possible risks are documented with appropriate response plans.

Tuesday, August 5, 2008

Tools for Effective Cluster Management

To continue my previous post on cluster management, I wanted to focus on the tools that are available for implementing and monitoring cluster health including process, hardware and configuration management.


There are two primary ways that one can go about building a change management and cluster management system. The first is going with a complete Linux stack solution that is integrated with a scheduler, monitoring utilities and OS deployment Tools. The second is to build a suite of tools using commercially or open source available tools in the field. Both have there benefits and tradeoffs, ultimately most firms use a combination of the two.

Types of Tools
There are several types of tools that are necessary to manage any cluster, large or small. The tools are categorized by the need they fill in the overall management of a cluster, including request tracking, change management, availability monitoring, performance monitoring and operating system deployment.

It is important when evaluating an HPC software stack, either complete or built from individual pieces, to ensure that each of these components is included, and evaluated for the capability they will provide versus similar, competing products.

Complete Stacks
Complete HPC stacks are becoming more common because of there ease of integration, and integrated support models. Complete stacks usually consist of all the base software that is needed to deploy and manage a cluster, as well as the libraries needed for parallel job execution. These stacks significantly cut the time needed to deploy new clusters, as well as ensure that all initial software on the system is compatible and fully tested.

The difficulty with stacks is there set versions of libraries and smaller compatibility matrices. These stacks are very tightly integrated solutions that ensure they are compatible and stable. They can present a challenge for sites that have outside requirements for different versions of libraries and compilers then the complete stack provides. While this is a challenge for some complex installations, this standard set of tested and integrated libraries provides a much easier solution for companies just using mainstream ISV applications. The developers of the primary stacks on the market work to ensure there kernel and library versions are within the framework that the primary ISVs support and expect.

Individual Tools
Even in environments where a complete HPC stack solution has been deployed, there could be the need for additional tools to meet all operational requirements. The individual tools mentioned below can be used to fill some of these needs, as well as be used as a starting point for companies that decide to not use an integrated stack solution, but instead roll there own.

The primary benefit to rolling your own stack based on these and other tools is that it will much more clearly meet your companies needs. The integrated stacks are meant as a solution to meet very broad HPC needs within a given customer base, but by developing a custom stack, a company can ensure all there specific needs are met and integrate in with existing company platforms. This integration can include management APIs that are similar to existing platforms, as well as data integration to ensure reporting, authentication and logging meets company standards.

Specific Tools

Sun HPC Software, Linux Edition (http://www.sun.com/software/products/hpcsoftware/index.xml) – The Sun Linux HPC Stack is an integrated solution of open source software for deploying and managing the compute resources within an HPC environment. It includes a variety of tools for performance and availability monitoring, OS deployment and management, troubleshooting and necessary libraries to support the primary interconnects on the market.

Rocks (http://www.rocksclusters.org/wordpress/) - Rocks is an open source, community driven integrated solution for deploying and managing clusters. It is based on a concept of rolls, each roll is specific to an application or set of tools that could be needed in an HPC environment. This modularity allows users to add the components they need as there needs evolve.

Trac (http://trac.edgewall.org/wiki/TracDownload) – Trac is a toolkit originally designed to be used in software development organizations. It has integrated capabilities for tracking bugs, release cycles, source code and a wiki for documenting notes and process information. These may all seem like software development specific capabilities, but they can all be used in very effective ways to better manage and document the associated processes for a cluster.

Request Tracker (http://bestpractical.com/rt/) - Request Tracker is an integrated tool for tracking, responding too and reporting on support requests. It is heavily used in call center environments, and works very well for HPC environments to track customer requests for support, requests for upgrades and other system changes.

RASilience (http://sourceforge.net/projects/rasilience/) - RASilience is built around Request Tracker with the Asset Tracker and Event Tracker add-ons. It is an interface and general-purpose engine for gathering, filtering, and dispatching system events. It can be used to provide event correlation across all nodes and other components within a cluster.

Nagios (http://www.nagios.org/) – Nagios is an open source monitoring solution built on the idea of plugins, plugins can be developed to monitor a wide variety of platforms and applications, while reporting back to a central interface for notification management, escalation and reporting capabilities.

Ganglia (http://ganglia.info/) - Ganglia is a highly scalable, distributed monitoring tool for Clusters. It is capable of providing historical information on node utilization rates and performance information via XML feeds from individual nodes, that can subsequently be aggregated for centralized viewing and reporting.

OneSIS (http://www.onesis.org/) - OneSIS is a tool to managing system images, both diskless and diskfull. OneSIS is an effective tool to ensuring that all images within a cluster are stored from a central repository, and integrated in with the appropriate tools to utilize kickstart for installing new operating system images, as well as booting nodes in a diskless environment.

Sun Grid Engine (http://gridengine.sunsource.net/) - SGE is a distributed resource manager which has proven scalability to 38,000 cores within a Grid environment. SGE is rapidly being updated by Sun to more efficiently handle multi-threading and too improve launch times for jobs, as well as tty output for non-interactive jobs.

Cluster Administration Package (http://www.capforge.org/cgi-bin/trac.cgi) – CAP is a set of tools for integrating clusters. It is designed and tested to accomplish three main objectives; Information Management, Control and Installation. CAP is a proven tool for deploying and managing a centralized set of configuration files within a cluster, and ensuring that any changes to master configuration files are correctly propagated to all nodes within the cluster.

Cbench (http://cbench.sourceforge.net/) – Cbench is a set of tools for benchmarking and characterizing performance on clusters. Cbench can be used for both initial bring up of new systems, as well as testing of hardware that has been upgraded, modified or repaired.

ConMan (http://home.gna.org/conman/) - ConMan is a console management utility. It is most often used as an aggregator for a large number of serial console outputs within clusters. It can be used to both take console output and redirect it to a file for later reference, as well as allow administrators to redirect output to a console in ReadWrite mode.

Netdump (http://www.redhat.com/support/wpapers/redhat/netdump/) - Netdump is a crash dump logging utility from Redhat. The purpose of Netdump is to ensure that if a node with no console attached crashes, administrators have a reference point within logs to catch the crash and debug output.

Logsurfer (http://www.crypt.gen.nz/logsurfer/) - Logsurfer is a regular expression driven utility for matching incoming log entries and taking action based up matches. Logsurfer can do a variety of actions based upon a match including running an external script, or counting the number of entries until a threshold is met.

Specific Tool Integration Techniques
These are some specific methods myself and some colleges have used to integrate these tools into larger frameworks used for change management and monitoring within Enterprise Environments. These are meant as a way to show how the different tools, used in combination, can simplify cluster management and lower administration costs. All of these methods have also been tested at scales well beyond typical HPC systems today, including OneSIS and Cbench which have been tested up to scales of 4500 nodes.

OneSIS
OneSIS can be used in two primary methods within a cluster, each can be used independently or in combination. The first and most common is to assemble an image that is then deployed to all compute nodes and installed locally. OneSIS can also be used to distribute that image to all compute nodes so they can run in a diskless fashion, using the image from a central management server.

These methods can also be used in combination when preparing to upgrade a cluster. A new image can be developed and booted into a diskless mode on a subset of a clusters nodes. Those nodes can then be used to test all applications and cluster uses to ensure the image is correct. Once that testing is complete, OneSIS can be used to ensure an exact copy of the tested image in installed on all compute nodes. This method ensure that no bad images are installed on the cluster, and that the majority of the cluster nodes can be left in place for production users while the new image is tested.

Nagios
Nagios is a very dynamic tool because of its ability to use plugins for monitoring and response. Plugins can be written for any variety of hardware within a cluster to ensure they are online, are not showing excessive physical errors and do not need proactive attention. Nagios's dynamic nature also allows plugins that allow it to communicate with centralized databases of node information and report are hardware or node problems to RT for proper tracking and attention

Nagios plugins can easily be used to remotely execute health check scripts on compute nodes. These health check scripts can check to ensure nodes are operating and responding correctly, there are no hung processes that might affect future jobs, and that the nodes configuration files and libraries are the expected versions. If Nagios does detect an error on a given node, it can easily be configured to automatically open an RT ticket for staff to repair the node, and mark the node offline in the job scheduler until such time as the node is repaired.

Cbench
Cbench is a wonderful tool for automating the process of both bringing up new clusters as well as testing hardware that has been repaired or replaced to ensure it meets the same benchmarks as other hardware in the cluster. Cbench has a collection of benchmarks that can be used to benchmark a new cluster to ensure that the system, storage, memory and attached file systems perform as designed. This can be a valuable tool in locating issues that were introduced during deployment and will ultimately cause performances decreases for users.

Cbench can also be used to ensure that all hardware that was repaired was done so correctly before being reintroduced into the cluster. By properly benchmarking a cluster at installation time, it allows support staff to run identical benchmarks on nodes that have been subsequently repaired. These new results can be compared to the initial results from the cluster and ensure that the node is now operating as peak, expected performance.

Logsurfer
Logsurfer is best used as an aggregator and automated response mechanism within a cluster. By having all nodes send their respective logs to a central log host, it enables the cluster administrators to configure a single Logsurfer daemon to monitor and respond to appropriate log entries.

Many sites will subsequently configure Logsurfer to proactively mark nodes in the scheduler offline if an error is found in the logs relating to that node. This ensures that no future jobs are run on the node until repair staff are able to verify the node is operating correctly and repair the reason for the initial error.

Final Thoughts
Clusters are complex mixes of hardware and software, the more effectively the tools are picked and integrated early in system design, the more efficiently the system can be managed. There are many tools available, both commercial and open source, that can be used in cluster environments. It is critical that each ones benefits, tradeoffs and scalability be weighed when picking the tools for for environment.

As a final thought, clusters are complex solutions that often require customization at every level. This can also be extended to the applications used to manage the cluster, but was not mentioned previously in this document. It is always an option to develop a tool in house for your needs, chances are, if you have a need, so does someone else. The majority of the tools above were developed because a single company had a need, developed a tool to meet that need and put the tool back into the community for everyone else to use. This is a wonderful way to not only continue improving the capabilities we as a community have around clusters, but is a great way to get company recognition in a rapidly growing field.

Monday, July 28, 2008

Defining Effective Cluster Management Processes

Abstract

Todays high performance compute clusters are more complex then ever; they are an intricate set of hardware, middleware and processes that ensure a robust, reliable platform form companies to conduct up to the most critical business processing. When designing the processes to manage these systems companies must ensure they factor in todays needs, as well as tomorrows possibilities. This paper works towards addressing the complex issue of defining the process that will ultimately be used to manage these clusters as the are implemented and grow over time.

Definitions

I want to begin by defining some terms that I use through out the document, this will ensure the same understanding of common terms as I use them:

system – A host that contains a single operating system system image for multiple processor sockets, with all memory addressed from the single operating system image.

cluster – More then one system interconnected and managed through a common fabric.

enterprise – A class of system that supports operations of a business, this could include systems running Oracle DB, Application servers, SAP, etc.

scheduler – A resource manager to use within a cluster to ensure maximum effective use of all resources.

jobs – Submissions by individual users to the scheduler to accomplish a task on the cluster.

resources – Capabilities of the cluster to include processors, memory, storage and interconnects.

interconnect – The fabric in which a cluster uses for communication. Can be Gigabit Ethernet, Infiniband, Quadrics or others.

Intro

Todays clusters are much larger then they have been at any time in the past. As grid and cloud computing continue to increase in popularity, the number of systems that must be managed in a single environment will only continue to grow. As the number of systems continues to grow, administrators are going to continue to struggle with keeping consistent software across all systems, recovering failed systems and managing the location of applications and data and associated versions that are installed.

As the number of systems grows, companies and administrators will need more refined tools and processes to ensure systems are configured as expected, properly report problems that can be tracked in a meaningful way for upgrades, rotations and maintenance. Tools must respond properly with where problems are located and how to correct them, this will ensure companies are not relying on senior staff for on call duties and general troubleshooting.

When designing and managing these complex clusters, process must be the number one item considered. The more clearly the process for managing the cluster; including upgrades, changes, failures and testing, the more reliable the system will be over time, and the fewer unexpected problems that will result from failed processes, a lack of process or unexpected consequences of changes.

Second only to process is metrics. It must be clearly defined how these complex clusters will be monitored and measured for success. These metrics can encompass many things including uptime, job completion time, jobs completed, staff metrics and scalability metrics. The process of defining the metrics to gauge success must begin with an evaluation of the business goals that are to met by utilizing an cluster for company workloads. These metrics must accurately gauge what factors show success by migrating existing applications to an cluster, as well as implementing new tools now that the capabilities are in place.

Another key of these processes and tools is that they must be designed to scale as the customers' cluster grows. Any designed solution must factor in not only todays systems to manage, but also the expected growth in the coming years. This will ensure that all processes and tools are scalable and do not need to be replaced to upgraded as the cluster grows.

Overall, this problem is two-fold. The proper tools must be in place to support clearly defined and tested processes. There is a plethora of tools available today that provide change management, process tracking and cluster monitoring. It is important that companies understand the benefits and tradeoffs of each available tool when deciding on how to implement these processes for there environment. Some companies will find the current available tools are more then sufficient to meet there business needs, while others will find that developing new tools in house will better suite there needs.

The realm of high performance computing is no longer the island it once was. Today many staff, processes and tools can and are being shared between departments. As time continues, high performance computing and clusters will just be another set of systems that must be maintained within a company, and no longer a separate department as they are now.

Defining Business Goals
Now we will explore defining our change management processes and associated support processes for a large clustered environment This begins with defining the business goals, some questions to ask when defining these goals are:
  1. What is the maximum allowable downtime that can be afforded this cluster?
  2. How much time per month will the support staff need to handle routine maintenance?
  3. What recurring events might impact performance on the cluster? This could include end of quarter financial processing, data warehousing activities and compliance reporting activities that are given priority over standard users.
  4. How will users be grouped on the system in relation to job function, priorities and load types?

Our business goals are a key component of the information that will later be used when documenting all processes in details. These will serve as the pseudo-goals that must be hit, and will serve to define the metrics we will define in the next section. These business goals should be aligned closely with the mission and vision of the company, as well as the specific teams that will be utilizing this cluster.

This step should be mostly a business discussion, while avoiding technical architecture discussions, I believe a better set of business aligned goals can be achieved, without having to discuss tradeoffs yet for technology and cost. The cost and technology tradeoffs can be discussed and factored in after metrics for success and goals have been defined. This will ensure that tradeoffs are fully understand in there own context, and not part of this business goals discussion.

When defining these business goals, an honest assessment of both minimal and optimal goals must be done. By defining both optimal goals and minimal, we will be able to have a proper discussion later in the process about tradeoffs, while really understanding what is acceptable levels of compromise, and what is not. The optimal goals will be what management would like to see accomplished given that time and money were of no object. The minimal goals should reflect the minimum outcome from the project over it's life cycle such that the company receives a financial benefit from the system, but not necessarily all features and possible results.

Scoping and Metrics
Second, we must ask a variety of questions to define the scope of system management and the metrics used to gauge success for each component. This can include a variety of components including:
  1. File systems
  2. Hardware
  3. Interconnect
  4. Facilities – This can include the data center that houses the cluster, the offices that the users reside in and any facilities where data is stored and managed in relation to this cluster
  5. User training and support

Some examples of metrics that can be gathered and tracked are:
  1. Average job completion time
  2. Maximum job completion time
  3. Users accessing the system over time
  4. Support requests logged by users
  5. Number of supported applications on the cluster
  6. User data volume and churn per month
  7. Measured MTBF versus vendor expected MTBF
  8. Any company specific information that will later be used to gauge success

Metrics are key to ensuring that the business goals are objectively measured and monitored over time. These metrics must be defined before implementing an new cluster management processes so that the proper usage and user information is tracked and kept for future analysis. Metrics can be defined in a variety of ways including along company management lines, along functional organization lines, or across lines of business or customers. By accurately assessing all the possible structures, and providing metrics for each, the company will have a usable set of metrics that can also evolve as the companies structure does.

Metrics are also a constantly evolving item that should be kept up to date to match evolving company goals and structures. As the companies management structure changes, or goals and missions change, the metrics should be updated to reflect these changes to ensure that costs are accurately tracked, and benefits can be accurately tracked.

Scoping is the process of defining the boundaries for these metrics and associated processes. Scoping is important to ensure that we do not try and touch too much at the same time, while still ensuring that all relevant information and teams are included in discussions. Scoping is part of the metrics section because the metrics are directly related to scoping and vice versa. To properly understand how we are going to asses progress, we must fully understand what we are assessing. When defining scope it is important to have readily available charts of staff alignment, project alignment and budgeting information so that lines can be drawn at what will be included and what will be handled as a separate project.

Defining Process
Third, we will define these processes around the answers to the above questions. Some items to consider when taking these into account are:
  1. How often does the site anticipate new software to be installed?
  2. How often does the site anticipate upgrading the host operating systems of the cluster?
  3. Understanding interactions between libraries and applications?
  4. How will we document and track installed applications, libraries and versions?
  5. What shared file systems will be utilized on this cluster?
  6. What dependencies are in place for cluster monitoring?
After defining the business goals and the metrics for tracking those goals we can begin to define the process that will be followed to meet those objectives. The process must be flexible enough to evolve with the cluster and applications, but also must be rigid enough to ensure that all business goals are met and metrics are correctly tracked and reported.

Defining process involves two major components, managing the cluster when things are optimal, and proper response when things fail to work as designed and expected. To properly address each of these categories it is critical to define the process as three separate components.

The three components of the final process will be:
  1. Change management – How to effectively assess changes and plan for maintenance to have a minimum and understandable impact and risk.
  2. Failure Response – having a detailed process to handle all types of failures, including technical response, escalation process, documentation process and user notification process. This process is only to handle known types of failures, and will be updated by the next process in the event an unknown or new type of failure occurs.
  3. Unknown situation – Finally, it is important to have a general process for handling unknown situations. This should include how to contact and include the appropriate staff for resolution, how to escalation problems that span multiple teams and how to document the correct process to respond should the problem occur again.

Tradeoffs
Ultimately, most organizations are also going to have to discuss tradeoffs. It would be optimal to track all possible service related details, report on them as well as provide 24x7 levels of support, but this is often cost prohibitive. After a firm has completed defining their business goals, define there metrics for success and defining the policies to meet those goals, a tradeoffs discussion is next. This is to evaluate the cost of meeting not only the best case scenarios for business goals, but also the minimum acceptable for success.

Tradeoffs are a difficult component because each team that is involved in discussions will have goals they can not change, or processes that must be kept in tact. By ensuring that all teams affected by these decisions are also at the table for the tradeoffs discussion, a company can ensure that all relevant voices are heard and tradeoffs are fully understood when decisions are made. The tradeoff discussion does not necessarily need to be a decision of what must be given up, it can also be a decision of what can be put off until a later date, or done by a different team for better efficiency.

Tradeoffs must also balance both costs and benefits, an increased cost must be fully justified by increased benefits. Just as a decrease in benefits must be balanced by an appropriate cut in costs. By assessing each business requirement against the cost to implement versus the long term benefits, a company will be able to accurately assess if the benefit is worth the cost.

Todays clusters are complex solutions containing many staff, hardware components, software components and requirements. These must all be assembled in a way that they accomplish the goals for a given project, but are fluid enough to rapidly adjust to encountered problems, as well as a changing business landscape. By taking a systematic approach to defining the management processes for a given cluster, a company can ensure that all business objectives are being met and staff are keenly aware of the goals of the cluster.