Wednesday, August 27, 2008

Risk Management in HPC

Risk management is a very broad topic within the project management space. It covers planning for and understanding the most unimaginable of possibilities within a project so that a plan is in place to respond to these situations and mitigate risk across the project. I will focus on risk management specifically in High Performance Compute (HPC) deployments. HPC, like any other specialty area has it's own specific risks and possibilities. Within HPC these risks are both procedural and technical in nature, but have equal implications to the overall delivery of a successful solution.

Risk management in any project begins with a risk assessment, this includes both identifying risks and possible risk mitigation techniques. This can be done through a variety of methods including brainstorming, the Delphi technique, or by referencing internal documentation about similar, previous projects. This initial assessment phase is critical to ensure that both risks and responses are captured. By capturing both of these upfront, it allows for better communication around the known risks, and better preparation for managing unknown risks. This risk assessment will produce a risk matrix, this is the documented list of possible risks to a project and there mitigation, or response plans. The risk matrix will become part of the overall project delivery plan.

Risk Matrix
When beginning any HPC project, either an initial deployment or an upgrade, it is important to develop a risk matrix. This can include both known risks (late delivery, poor performance, failed hardware) as well as unknown risks. The unknown risks category is much more difficult to define for just that reason, but a common approach is to define levels of severity and responses. These responses can include procedural details, escalation details, communication information and documentation about the problem to prevent a reoccurrence.

This matrix should include a variety of information including:
  1. Risk name – Should be unique within the company to facilitate communication between groups and departments
  2. Risk Type – Minimal, Moderate, Severe, Extreme, etc
  3. Cost if this risk occurs – This can be in time, money or loss of reputation, or all of the above.
  4. Process to recovery – It is important to document early on how to respond to the risk and correct any problems that have developed because of the risk
  5. Risk Owner – Often times a specific individual has additional experience with dealing with a specific risk and can work as a Subject Matter Expert (SME) for the project team
  6. Outcome documentation – Clearing defining what should be documented should the risk occur so that it can be responded too
  7. Communication Channels - different risks require that different staff and management become engaged, it is important to document who should be involved should a risk occur
  8. Time Component – Every risk has a response, every response has a time component associated with it. It is important to understand these time components up front, it will allow project management staff to adjust schedules accordingly should a risk occur

Known Risks
Often times, known risks are the easiest for people to plan for, but very difficult to handle. This understanding up front and anticipation of the risk or problem can often fool us into believing we know the best response to the problem, when often the only way to truly understand how to respond to a problem is to do it incorrectly one or more times.

Lets explore some common risks that are specific to HPC deployments, and the most common mitigation strategies to combat them:

Application Scaling
A fundamental premise of HPC is that applications should scale in a way that makes more hardware produce more accurate results and/or more efficient production of data. Because of this an application is often expected to perform with the same scalability on 64 nodes, as it does on 128 and often many more. This type of scalability must be architected into the application as it is written and improved on as hardware performance evolves over time. Every time a newer, faster or bigger cluster is installed, there is an inherent risk that the applications previously used will not properly scale on the new platform.

Often times the best mitigation strategy for this risk is proper planning, testing and benchmarking; before system deployment. The most difficult time to manage an application scaling problem is after a customer's hardware has been delivered and installed. By benchmarking and testing the application prior to shipment, the expectations with the customer can be properly set. It also allows proper time for working with any development teams to troubleshoot scaling problems and correct them before presenting results and completing acceptance testing with the customer.

Facility Limitations
HPC solutions often use large amounts of power, cooling and space within a data center compared to a companies business support systems or database centric systems. Because of the large facility needs of HPC it is very common for customers to underestimate the facility needs, or the numbers to be poorly communicated from a vendor to a customer. The power and cooling requirements can also vary widely based upon the customers final use and intended application of the cluster.

All facility design issues should be managed and planed for before hardware is shipped or systems are assembled. To ensure a smooth cluster delivery, it is critical that site planning and assessment be done as part of the system design. This site planning should ensure there is enough power, cooling and space to accommodate the cluster. It should additionally work to ensure the power and cooling are in the proper places and can be directed to the cluster in the recommended fashion.

Mean Time between Failure (MTBF)
MTBF is a calculation used to understand how often components across a single node or cluster will fail. It averages the known and designed life cycle of all individual components to provide a time between each individual component failure. These component failures can either be severe enough to impact the whole system, or just portions of a cluster based on the cluster's design. Often times a completed cluster will fail in unexpected ways because of the MTBF characteristics of putting large numbers of compute nodes in a single fabric. If proper redundancy is not built into critical systems of the cluster, a customer satisfaction issue can develop because of prolonged and unplanned for outages.

By properly assessing all uptime requirements from the customer, a system can be designed that will provide the uptime necessary to conduct business regardless of the MTBF that is collective across all components. Each individual service and capability of the cluster should be assessed to ensure that the proper level of redundancy including clustered nodes, redundant power, and redundant disks is included with the complete solution.

Performance Targets, I/O and Compute
Performance guarantees are often included in HPC proposals to customers to provide a level of comfort when planning times for job completion and capacity planning for an organization. These numbers can often be sources of concern as a system is brought online if compute capacity is not as promised or I/O is not operating as fast as expected or promised.

There is often misunderstandings with complete cluster deployments about a clusters capability for sustained versus peak performance. Sustained is most often the number used for a representative test of how the system will perform over its life cycle. Where as peak is the level of performance often stated for bragging rights because it is the theoretical maximum potential of a given cluster.

There is very little that can be done after delivery of a system if this type of risk comes up, other then giving the customer the additional hardware to pull the sustained performance number up to the peak performance number. This can be a very expensive response. This is the reason that the staff doing HPC architecture must fully understand application benchmarking and performance when designing new clusters. All numbers should also be reviewed by multiple people, this will insure errors in math or testing methodologies do not go unnoticed.

Unknown Risks
Often times planning for unknown risks can be the most stressful, but can yield the most gains when actually responding. This is because of a lack of prior perceptions and the ability to be very creative with responses and future mitigation strategies. Risk planning for unknown risks is often an exercise in understanding the levels of severities that could occur with a problem, and associating it with the appropriate level of response and future prevention.

When defining response strategies for unknown risks, often the first step is to define levels of severity that could develop from any given problem. A common list is:
  1. Most severe level of risk, requires executive management level response to the customer and has a high percentage cost to the project (greater then 50% of project revenue is at risk).
  2. Severe level of risk, requires executive level of response and carries a medium level of financial risk (less then 50% of project revenue is at risk).
  3. Medium level project risk, requires senior management response, could or could not have a financial impact on the project, but does have a deliverable and schedule component.
  4. Lower level risk, has an impact on project schedule, but no negative impact on project financials.
  5. The lowest level of project risk, often just a communication issue with a customer or potential misunderstanding. Often no schedule impact or financial impact to the project.
The next step, after defining a list of possible problem levels, is to define mitigation strategies for each. Each mitigation strategy should include the following:
  1. Steps to research and understand the problem.
  2. Communication channels, who needs to be communicated with for a problem of this magnitude and how are they communicated with. This needs to include both customer and company contacts that will be necessary to correct the problem.
  3. Flow chart for responding, this is the path to determining the appropriate response and deciding if more resources, either financial or staffing, are needed to correct the risk.
  4. Documentation to prevent future occurrences is important. It is important to ensure that any information about the project is gathered and documented to be used in house to prevent future occurrences of the same risk.
  5. Risk closure document. A checklist to document that all protocol was followed and the risk was corrected. This should include components that the risk will not return on the same project because mitigation techniques have been implemented.
The mitigation strategies for the various levels of unknown risks can all be the same, or each level can have its own mitigation strategy. Often a different strategy is used for each of the levels because different executives or financial analysts will need to be involved in the response because of the problems that can be different from level to level. The mitigation strategies are the companies last line of defense within a project to ensure that all problems, no matter the level can be resolved to ensure a smooth project delivery,

Path Forward
The most important component of risk management is skill and experience development. It is important to ensure that as a company, you have processes to document all experience that is gained as part of risk management within managing your projects. This knowledge must be documented so that other teams, new teams and new staff can learn from previous experience of the company.

The more efficient a job that is done with documenting risk response and lessons learned, the more efficiently companies can scope future projects. This allows companies to much more accurately assess costs for future projects, as well as risk versus reward tradeoffs for large, complex projects. Ultimately the best way to manage risk is to understand it before beginning actual deployment and implementation on a project. This comes from a combination of utilizing all data collected on previous projects as well as techniques like brain storming and the Delphi technique to ensure as many possible risks are documented with appropriate response plans.

No comments: