Monday, July 28, 2008

Defining Effective Cluster Management Processes


Todays high performance compute clusters are more complex then ever; they are an intricate set of hardware, middleware and processes that ensure a robust, reliable platform form companies to conduct up to the most critical business processing. When designing the processes to manage these systems companies must ensure they factor in todays needs, as well as tomorrows possibilities. This paper works towards addressing the complex issue of defining the process that will ultimately be used to manage these clusters as the are implemented and grow over time.


I want to begin by defining some terms that I use through out the document, this will ensure the same understanding of common terms as I use them:

system – A host that contains a single operating system system image for multiple processor sockets, with all memory addressed from the single operating system image.

cluster – More then one system interconnected and managed through a common fabric.

enterprise – A class of system that supports operations of a business, this could include systems running Oracle DB, Application servers, SAP, etc.

scheduler – A resource manager to use within a cluster to ensure maximum effective use of all resources.

jobs – Submissions by individual users to the scheduler to accomplish a task on the cluster.

resources – Capabilities of the cluster to include processors, memory, storage and interconnects.

interconnect – The fabric in which a cluster uses for communication. Can be Gigabit Ethernet, Infiniband, Quadrics or others.


Todays clusters are much larger then they have been at any time in the past. As grid and cloud computing continue to increase in popularity, the number of systems that must be managed in a single environment will only continue to grow. As the number of systems continues to grow, administrators are going to continue to struggle with keeping consistent software across all systems, recovering failed systems and managing the location of applications and data and associated versions that are installed.

As the number of systems grows, companies and administrators will need more refined tools and processes to ensure systems are configured as expected, properly report problems that can be tracked in a meaningful way for upgrades, rotations and maintenance. Tools must respond properly with where problems are located and how to correct them, this will ensure companies are not relying on senior staff for on call duties and general troubleshooting.

When designing and managing these complex clusters, process must be the number one item considered. The more clearly the process for managing the cluster; including upgrades, changes, failures and testing, the more reliable the system will be over time, and the fewer unexpected problems that will result from failed processes, a lack of process or unexpected consequences of changes.

Second only to process is metrics. It must be clearly defined how these complex clusters will be monitored and measured for success. These metrics can encompass many things including uptime, job completion time, jobs completed, staff metrics and scalability metrics. The process of defining the metrics to gauge success must begin with an evaluation of the business goals that are to met by utilizing an cluster for company workloads. These metrics must accurately gauge what factors show success by migrating existing applications to an cluster, as well as implementing new tools now that the capabilities are in place.

Another key of these processes and tools is that they must be designed to scale as the customers' cluster grows. Any designed solution must factor in not only todays systems to manage, but also the expected growth in the coming years. This will ensure that all processes and tools are scalable and do not need to be replaced to upgraded as the cluster grows.

Overall, this problem is two-fold. The proper tools must be in place to support clearly defined and tested processes. There is a plethora of tools available today that provide change management, process tracking and cluster monitoring. It is important that companies understand the benefits and tradeoffs of each available tool when deciding on how to implement these processes for there environment. Some companies will find the current available tools are more then sufficient to meet there business needs, while others will find that developing new tools in house will better suite there needs.

The realm of high performance computing is no longer the island it once was. Today many staff, processes and tools can and are being shared between departments. As time continues, high performance computing and clusters will just be another set of systems that must be maintained within a company, and no longer a separate department as they are now.

Defining Business Goals
Now we will explore defining our change management processes and associated support processes for a large clustered environment This begins with defining the business goals, some questions to ask when defining these goals are:
  1. What is the maximum allowable downtime that can be afforded this cluster?
  2. How much time per month will the support staff need to handle routine maintenance?
  3. What recurring events might impact performance on the cluster? This could include end of quarter financial processing, data warehousing activities and compliance reporting activities that are given priority over standard users.
  4. How will users be grouped on the system in relation to job function, priorities and load types?

Our business goals are a key component of the information that will later be used when documenting all processes in details. These will serve as the pseudo-goals that must be hit, and will serve to define the metrics we will define in the next section. These business goals should be aligned closely with the mission and vision of the company, as well as the specific teams that will be utilizing this cluster.

This step should be mostly a business discussion, while avoiding technical architecture discussions, I believe a better set of business aligned goals can be achieved, without having to discuss tradeoffs yet for technology and cost. The cost and technology tradeoffs can be discussed and factored in after metrics for success and goals have been defined. This will ensure that tradeoffs are fully understand in there own context, and not part of this business goals discussion.

When defining these business goals, an honest assessment of both minimal and optimal goals must be done. By defining both optimal goals and minimal, we will be able to have a proper discussion later in the process about tradeoffs, while really understanding what is acceptable levels of compromise, and what is not. The optimal goals will be what management would like to see accomplished given that time and money were of no object. The minimal goals should reflect the minimum outcome from the project over it's life cycle such that the company receives a financial benefit from the system, but not necessarily all features and possible results.

Scoping and Metrics
Second, we must ask a variety of questions to define the scope of system management and the metrics used to gauge success for each component. This can include a variety of components including:
  1. File systems
  2. Hardware
  3. Interconnect
  4. Facilities – This can include the data center that houses the cluster, the offices that the users reside in and any facilities where data is stored and managed in relation to this cluster
  5. User training and support

Some examples of metrics that can be gathered and tracked are:
  1. Average job completion time
  2. Maximum job completion time
  3. Users accessing the system over time
  4. Support requests logged by users
  5. Number of supported applications on the cluster
  6. User data volume and churn per month
  7. Measured MTBF versus vendor expected MTBF
  8. Any company specific information that will later be used to gauge success

Metrics are key to ensuring that the business goals are objectively measured and monitored over time. These metrics must be defined before implementing an new cluster management processes so that the proper usage and user information is tracked and kept for future analysis. Metrics can be defined in a variety of ways including along company management lines, along functional organization lines, or across lines of business or customers. By accurately assessing all the possible structures, and providing metrics for each, the company will have a usable set of metrics that can also evolve as the companies structure does.

Metrics are also a constantly evolving item that should be kept up to date to match evolving company goals and structures. As the companies management structure changes, or goals and missions change, the metrics should be updated to reflect these changes to ensure that costs are accurately tracked, and benefits can be accurately tracked.

Scoping is the process of defining the boundaries for these metrics and associated processes. Scoping is important to ensure that we do not try and touch too much at the same time, while still ensuring that all relevant information and teams are included in discussions. Scoping is part of the metrics section because the metrics are directly related to scoping and vice versa. To properly understand how we are going to asses progress, we must fully understand what we are assessing. When defining scope it is important to have readily available charts of staff alignment, project alignment and budgeting information so that lines can be drawn at what will be included and what will be handled as a separate project.

Defining Process
Third, we will define these processes around the answers to the above questions. Some items to consider when taking these into account are:
  1. How often does the site anticipate new software to be installed?
  2. How often does the site anticipate upgrading the host operating systems of the cluster?
  3. Understanding interactions between libraries and applications?
  4. How will we document and track installed applications, libraries and versions?
  5. What shared file systems will be utilized on this cluster?
  6. What dependencies are in place for cluster monitoring?
After defining the business goals and the metrics for tracking those goals we can begin to define the process that will be followed to meet those objectives. The process must be flexible enough to evolve with the cluster and applications, but also must be rigid enough to ensure that all business goals are met and metrics are correctly tracked and reported.

Defining process involves two major components, managing the cluster when things are optimal, and proper response when things fail to work as designed and expected. To properly address each of these categories it is critical to define the process as three separate components.

The three components of the final process will be:
  1. Change management – How to effectively assess changes and plan for maintenance to have a minimum and understandable impact and risk.
  2. Failure Response – having a detailed process to handle all types of failures, including technical response, escalation process, documentation process and user notification process. This process is only to handle known types of failures, and will be updated by the next process in the event an unknown or new type of failure occurs.
  3. Unknown situation – Finally, it is important to have a general process for handling unknown situations. This should include how to contact and include the appropriate staff for resolution, how to escalation problems that span multiple teams and how to document the correct process to respond should the problem occur again.

Ultimately, most organizations are also going to have to discuss tradeoffs. It would be optimal to track all possible service related details, report on them as well as provide 24x7 levels of support, but this is often cost prohibitive. After a firm has completed defining their business goals, define there metrics for success and defining the policies to meet those goals, a tradeoffs discussion is next. This is to evaluate the cost of meeting not only the best case scenarios for business goals, but also the minimum acceptable for success.

Tradeoffs are a difficult component because each team that is involved in discussions will have goals they can not change, or processes that must be kept in tact. By ensuring that all teams affected by these decisions are also at the table for the tradeoffs discussion, a company can ensure that all relevant voices are heard and tradeoffs are fully understood when decisions are made. The tradeoff discussion does not necessarily need to be a decision of what must be given up, it can also be a decision of what can be put off until a later date, or done by a different team for better efficiency.

Tradeoffs must also balance both costs and benefits, an increased cost must be fully justified by increased benefits. Just as a decrease in benefits must be balanced by an appropriate cut in costs. By assessing each business requirement against the cost to implement versus the long term benefits, a company will be able to accurately assess if the benefit is worth the cost.

Todays clusters are complex solutions containing many staff, hardware components, software components and requirements. These must all be assembled in a way that they accomplish the goals for a given project, but are fluid enough to rapidly adjust to encountered problems, as well as a changing business landscape. By taking a systematic approach to defining the management processes for a given cluster, a company can ensure that all business objectives are being met and staff are keenly aware of the goals of the cluster.

No comments: