I was in a planning meeting with a customer recently and we were assessing the customers security plan. We had two major topics to discuss, the first was in regards to data management and compliance. The second, and the one we discussed at length was in regards to previous policies they had around what they called “the edge”, the previous end of their network and the beginning of folks and systems they could not trust. The discussion went on for a while with us working towards consensus on how to define “the edge.” I believe we made the right decisions for there needs, but wanted to continue the discussion, I imagine most companies have this discussion at some point and will continue as new technologies evolve.
At one time "the edge" of any given network was easy to find; the last router between you and the upstream access provider. But today, "the edge" is getting increasingly difficult to find, and this has implications for the fundamentals of Information Technology (IT) including patching and password policies, and the most complicated of questions including privacy, monitoring and forensics. Today we have to evaluate many different details in regards to where “the edge” truly is, these include PDAs, company laptops with VPN access, employees home systems, thumb drives, and outside vendors/contractors.
The most important implications around defining what constitutes “the edge” is defining how customers, and staff will be able to access servers, services and storage. By clearly defining “the edge” we can then work to define what services will be publicly accessible, and which ones will be restricted by VPN access, firewalls, or other mechanisms. By defining “the edge” we also have a baseline to use when defining policies for information management, information tracking and information retention. These are critical areas in todays world of compliance, being able to precisely say who accessed and stored what day and when, is almost necessity.
When defining “the edge”, I start by listing all possible devices (laptop, desktop, thumb drive, PDA, cell phone, etc) that an employee or partner could use to access data that is not publicly available. This should be a list of devices currently allowed and possible technologies to employ. This data could include sales presentations, engineering documents, support forums, or any other data that is intentionally kept private to provide a competitive edge in your industry.
Second, I work to list where those devices could possible be used (office, Starbucks, employees home, airport, restaurant, etc). This is important to understand what implications those devices have including being lost, stolen, or a staff member having a conversation listened too by an outside party. This list should include the associate risks and possibility of it occurring at each location. The chance of a desktop system being stolen from the office is relatively low compared to a laptop being stolen while at the coffee shop. This does not imply that less security should be utilized to protect data on office systems, but that different techniques should be employed to do so.
The final component of defining “the edge” is defining appropriate policies for each device based on risk to the device and associated data, and a cost benefit tradeoff analysis for which devices should be allowed and which should not because of the level of risk they pose. These policies should take into account technologies like full disk encryption, passwords and non-reusable password generators, Virtual Private Network (VPN) technologies, and physical security like cable locks for laptops. Each potential technology is a tool to lower the risk and increase the reward for offering various tools and capabilities to employees.
Ultimately, this is a discussion around what risks can be outweighed by there benefits in a business setting. Often times staff can gain a significant level of productivity by having access too laptops, PDAs and other mobile devices, the company must weigh that additional productivity against the risk of a company device becoming compromised.
The concept of “the edge” is always going to be present for a companies IT infrastructure. As Web 2.0 and associated architectures grow, the ability to present more and more tools and capability to staff is only going to increase. By properly laying the ground work for how staff securely access these systems, a company can ensure that new tools can increase productivity without negatively impacting the risk to the company.
Thursday, August 28, 2008
Wednesday, August 27, 2008
Risk Management in HPC
Risk management is a very broad topic within the project management space. It covers planning for and understanding the most unimaginable of possibilities within a project so that a plan is in place to respond to these situations and mitigate risk across the project. I will focus on risk management specifically in High Performance Compute (HPC) deployments. HPC, like any other specialty area has it's own specific risks and possibilities. Within HPC these risks are both procedural and technical in nature, but have equal implications to the overall delivery of a successful solution.
Risk management in any project begins with a risk assessment, this includes both identifying risks and possible risk mitigation techniques. This can be done through a variety of methods including brainstorming, the Delphi technique, or by referencing internal documentation about similar, previous projects. This initial assessment phase is critical to ensure that both risks and responses are captured. By capturing both of these upfront, it allows for better communication around the known risks, and better preparation for managing unknown risks. This risk assessment will produce a risk matrix, this is the documented list of possible risks to a project and there mitigation, or response plans. The risk matrix will become part of the overall project delivery plan.
Risk Matrix
When beginning any HPC project, either an initial deployment or an upgrade, it is important to develop a risk matrix. This can include both known risks (late delivery, poor performance, failed hardware) as well as unknown risks. The unknown risks category is much more difficult to define for just that reason, but a common approach is to define levels of severity and responses. These responses can include procedural details, escalation details, communication information and documentation about the problem to prevent a reoccurrence.
This matrix should include a variety of information including:
Known Risks
Often times, known risks are the easiest for people to plan for, but very difficult to handle. This understanding up front and anticipation of the risk or problem can often fool us into believing we know the best response to the problem, when often the only way to truly understand how to respond to a problem is to do it incorrectly one or more times.
Lets explore some common risks that are specific to HPC deployments, and the most common mitigation strategies to combat them:
Application Scaling
A fundamental premise of HPC is that applications should scale in a way that makes more hardware produce more accurate results and/or more efficient production of data. Because of this an application is often expected to perform with the same scalability on 64 nodes, as it does on 128 and often many more. This type of scalability must be architected into the application as it is written and improved on as hardware performance evolves over time. Every time a newer, faster or bigger cluster is installed, there is an inherent risk that the applications previously used will not properly scale on the new platform.
Often times the best mitigation strategy for this risk is proper planning, testing and benchmarking; before system deployment. The most difficult time to manage an application scaling problem is after a customer's hardware has been delivered and installed. By benchmarking and testing the application prior to shipment, the expectations with the customer can be properly set. It also allows proper time for working with any development teams to troubleshoot scaling problems and correct them before presenting results and completing acceptance testing with the customer.
Facility Limitations
HPC solutions often use large amounts of power, cooling and space within a data center compared to a companies business support systems or database centric systems. Because of the large facility needs of HPC it is very common for customers to underestimate the facility needs, or the numbers to be poorly communicated from a vendor to a customer. The power and cooling requirements can also vary widely based upon the customers final use and intended application of the cluster.
All facility design issues should be managed and planed for before hardware is shipped or systems are assembled. To ensure a smooth cluster delivery, it is critical that site planning and assessment be done as part of the system design. This site planning should ensure there is enough power, cooling and space to accommodate the cluster. It should additionally work to ensure the power and cooling are in the proper places and can be directed to the cluster in the recommended fashion.
Mean Time between Failure (MTBF)
MTBF is a calculation used to understand how often components across a single node or cluster will fail. It averages the known and designed life cycle of all individual components to provide a time between each individual component failure. These component failures can either be severe enough to impact the whole system, or just portions of a cluster based on the cluster's design. Often times a completed cluster will fail in unexpected ways because of the MTBF characteristics of putting large numbers of compute nodes in a single fabric. If proper redundancy is not built into critical systems of the cluster, a customer satisfaction issue can develop because of prolonged and unplanned for outages.
By properly assessing all uptime requirements from the customer, a system can be designed that will provide the uptime necessary to conduct business regardless of the MTBF that is collective across all components. Each individual service and capability of the cluster should be assessed to ensure that the proper level of redundancy including clustered nodes, redundant power, and redundant disks is included with the complete solution.
Performance Targets, I/O and Compute
Performance guarantees are often included in HPC proposals to customers to provide a level of comfort when planning times for job completion and capacity planning for an organization. These numbers can often be sources of concern as a system is brought online if compute capacity is not as promised or I/O is not operating as fast as expected or promised.
There is often misunderstandings with complete cluster deployments about a clusters capability for sustained versus peak performance. Sustained is most often the number used for a representative test of how the system will perform over its life cycle. Where as peak is the level of performance often stated for bragging rights because it is the theoretical maximum potential of a given cluster.
There is very little that can be done after delivery of a system if this type of risk comes up, other then giving the customer the additional hardware to pull the sustained performance number up to the peak performance number. This can be a very expensive response. This is the reason that the staff doing HPC architecture must fully understand application benchmarking and performance when designing new clusters. All numbers should also be reviewed by multiple people, this will insure errors in math or testing methodologies do not go unnoticed.
Unknown Risks
Often times planning for unknown risks can be the most stressful, but can yield the most gains when actually responding. This is because of a lack of prior perceptions and the ability to be very creative with responses and future mitigation strategies. Risk planning for unknown risks is often an exercise in understanding the levels of severities that could occur with a problem, and associating it with the appropriate level of response and future prevention.
When defining response strategies for unknown risks, often the first step is to define levels of severity that could develop from any given problem. A common list is:
Path Forward
The most important component of risk management is skill and experience development. It is important to ensure that as a company, you have processes to document all experience that is gained as part of risk management within managing your projects. This knowledge must be documented so that other teams, new teams and new staff can learn from previous experience of the company.
The more efficient a job that is done with documenting risk response and lessons learned, the more efficiently companies can scope future projects. This allows companies to much more accurately assess costs for future projects, as well as risk versus reward tradeoffs for large, complex projects. Ultimately the best way to manage risk is to understand it before beginning actual deployment and implementation on a project. This comes from a combination of utilizing all data collected on previous projects as well as techniques like brain storming and the Delphi technique to ensure as many possible risks are documented with appropriate response plans.
Risk management in any project begins with a risk assessment, this includes both identifying risks and possible risk mitigation techniques. This can be done through a variety of methods including brainstorming, the Delphi technique, or by referencing internal documentation about similar, previous projects. This initial assessment phase is critical to ensure that both risks and responses are captured. By capturing both of these upfront, it allows for better communication around the known risks, and better preparation for managing unknown risks. This risk assessment will produce a risk matrix, this is the documented list of possible risks to a project and there mitigation, or response plans. The risk matrix will become part of the overall project delivery plan.
Risk Matrix
When beginning any HPC project, either an initial deployment or an upgrade, it is important to develop a risk matrix. This can include both known risks (late delivery, poor performance, failed hardware) as well as unknown risks. The unknown risks category is much more difficult to define for just that reason, but a common approach is to define levels of severity and responses. These responses can include procedural details, escalation details, communication information and documentation about the problem to prevent a reoccurrence.
This matrix should include a variety of information including:
- Risk name – Should be unique within the company to facilitate communication between groups and departments
- Risk Type – Minimal, Moderate, Severe, Extreme, etc
- Cost if this risk occurs – This can be in time, money or loss of reputation, or all of the above.
- Process to recovery – It is important to document early on how to respond to the risk and correct any problems that have developed because of the risk
- Risk Owner – Often times a specific individual has additional experience with dealing with a specific risk and can work as a Subject Matter Expert (SME) for the project team
- Outcome documentation – Clearing defining what should be documented should the risk occur so that it can be responded too
- Communication Channels - different risks require that different staff and management become engaged, it is important to document who should be involved should a risk occur
- Time Component – Every risk has a response, every response has a time component associated with it. It is important to understand these time components up front, it will allow project management staff to adjust schedules accordingly should a risk occur
Known Risks
Often times, known risks are the easiest for people to plan for, but very difficult to handle. This understanding up front and anticipation of the risk or problem can often fool us into believing we know the best response to the problem, when often the only way to truly understand how to respond to a problem is to do it incorrectly one or more times.
Lets explore some common risks that are specific to HPC deployments, and the most common mitigation strategies to combat them:
Application Scaling
A fundamental premise of HPC is that applications should scale in a way that makes more hardware produce more accurate results and/or more efficient production of data. Because of this an application is often expected to perform with the same scalability on 64 nodes, as it does on 128 and often many more. This type of scalability must be architected into the application as it is written and improved on as hardware performance evolves over time. Every time a newer, faster or bigger cluster is installed, there is an inherent risk that the applications previously used will not properly scale on the new platform.
Often times the best mitigation strategy for this risk is proper planning, testing and benchmarking; before system deployment. The most difficult time to manage an application scaling problem is after a customer's hardware has been delivered and installed. By benchmarking and testing the application prior to shipment, the expectations with the customer can be properly set. It also allows proper time for working with any development teams to troubleshoot scaling problems and correct them before presenting results and completing acceptance testing with the customer.
Facility Limitations
HPC solutions often use large amounts of power, cooling and space within a data center compared to a companies business support systems or database centric systems. Because of the large facility needs of HPC it is very common for customers to underestimate the facility needs, or the numbers to be poorly communicated from a vendor to a customer. The power and cooling requirements can also vary widely based upon the customers final use and intended application of the cluster.
All facility design issues should be managed and planed for before hardware is shipped or systems are assembled. To ensure a smooth cluster delivery, it is critical that site planning and assessment be done as part of the system design. This site planning should ensure there is enough power, cooling and space to accommodate the cluster. It should additionally work to ensure the power and cooling are in the proper places and can be directed to the cluster in the recommended fashion.
Mean Time between Failure (MTBF)
MTBF is a calculation used to understand how often components across a single node or cluster will fail. It averages the known and designed life cycle of all individual components to provide a time between each individual component failure. These component failures can either be severe enough to impact the whole system, or just portions of a cluster based on the cluster's design. Often times a completed cluster will fail in unexpected ways because of the MTBF characteristics of putting large numbers of compute nodes in a single fabric. If proper redundancy is not built into critical systems of the cluster, a customer satisfaction issue can develop because of prolonged and unplanned for outages.
By properly assessing all uptime requirements from the customer, a system can be designed that will provide the uptime necessary to conduct business regardless of the MTBF that is collective across all components. Each individual service and capability of the cluster should be assessed to ensure that the proper level of redundancy including clustered nodes, redundant power, and redundant disks is included with the complete solution.
Performance Targets, I/O and Compute
Performance guarantees are often included in HPC proposals to customers to provide a level of comfort when planning times for job completion and capacity planning for an organization. These numbers can often be sources of concern as a system is brought online if compute capacity is not as promised or I/O is not operating as fast as expected or promised.
There is often misunderstandings with complete cluster deployments about a clusters capability for sustained versus peak performance. Sustained is most often the number used for a representative test of how the system will perform over its life cycle. Where as peak is the level of performance often stated for bragging rights because it is the theoretical maximum potential of a given cluster.
There is very little that can be done after delivery of a system if this type of risk comes up, other then giving the customer the additional hardware to pull the sustained performance number up to the peak performance number. This can be a very expensive response. This is the reason that the staff doing HPC architecture must fully understand application benchmarking and performance when designing new clusters. All numbers should also be reviewed by multiple people, this will insure errors in math or testing methodologies do not go unnoticed.
Unknown Risks
Often times planning for unknown risks can be the most stressful, but can yield the most gains when actually responding. This is because of a lack of prior perceptions and the ability to be very creative with responses and future mitigation strategies. Risk planning for unknown risks is often an exercise in understanding the levels of severities that could occur with a problem, and associating it with the appropriate level of response and future prevention.
When defining response strategies for unknown risks, often the first step is to define levels of severity that could develop from any given problem. A common list is:
- Most severe level of risk, requires executive management level response to the customer and has a high percentage cost to the project (greater then 50% of project revenue is at risk).
- Severe level of risk, requires executive level of response and carries a medium level of financial risk (less then 50% of project revenue is at risk).
- Medium level project risk, requires senior management response, could or could not have a financial impact on the project, but does have a deliverable and schedule component.
- Lower level risk, has an impact on project schedule, but no negative impact on project financials.
- The lowest level of project risk, often just a communication issue with a customer or potential misunderstanding. Often no schedule impact or financial impact to the project.
- Steps to research and understand the problem.
- Communication channels, who needs to be communicated with for a problem of this magnitude and how are they communicated with. This needs to include both customer and company contacts that will be necessary to correct the problem.
- Flow chart for responding, this is the path to determining the appropriate response and deciding if more resources, either financial or staffing, are needed to correct the risk.
- Documentation to prevent future occurrences is important. It is important to ensure that any information about the project is gathered and documented to be used in house to prevent future occurrences of the same risk.
- Risk closure document. A checklist to document that all protocol was followed and the risk was corrected. This should include components that the risk will not return on the same project because mitigation techniques have been implemented.
Path Forward
The most important component of risk management is skill and experience development. It is important to ensure that as a company, you have processes to document all experience that is gained as part of risk management within managing your projects. This knowledge must be documented so that other teams, new teams and new staff can learn from previous experience of the company.
The more efficient a job that is done with documenting risk response and lessons learned, the more efficiently companies can scope future projects. This allows companies to much more accurately assess costs for future projects, as well as risk versus reward tradeoffs for large, complex projects. Ultimately the best way to manage risk is to understand it before beginning actual deployment and implementation on a project. This comes from a combination of utilizing all data collected on previous projects as well as techniques like brain storming and the Delphi technique to ensure as many possible risks are documented with appropriate response plans.
Monday, August 11, 2008
Platform Decisions - OS Choices
Following up on my previous discussion around platform decisions and solution architecture, I wanted to dive into Operating System (OS) choices. This is a difficult choice for many companies because of competing priorities, experience, training levels, costs and ultimately faith to one OS or another. Choosing an OS for production use is also a difficult choice because the options change so frequently, and the applications you will ultimately use may not work with your preferred choice of OS.
With most companies, this is rarely a discussion around which OS will be used exclusively. More often it is a discussion around which OS will be added to or eliminated from the infrastructure to either lower administration costs, lower maintenance costs or increase capabilities. Often, companies also break down an OS choice into groups, either server and desktop class systems, or by departmental needs. This separation can be very beneficial when discussing any changes; it allows the folks doing the assessment to clearly define needs and balance them based on focused groups, trying to balance needs across a large company can often prove to be difficult to impossible.
Primary Reasons for Change
First, lets explore the primary reasons a company would change the mix of OSs already in use within the IT organizations.
Lower Training Costs: Today Linux is the predominant OS within the Education communities. This creates an environment where new staff entering the work force are very experienced and knowledgeable on working with Linux based systems. This is important because by using an OS that potential staff are experienced with limits the training that is required to get and keep them proficient at system administration. Companies will often eliminate an OS from use because staff skills are not at a peak for it, and costs to keep them trained at appropriate levels continue to rise as an OSs dominance disappears. This was primarily seen with the large UNIX variants (Solaris, HP-UX, AIX), over time companies have limited the use of them because students were no longer coming with these skills from college, and existing staff were spending more and more time keeping up with training on these platforms.
Increase Performance: Performance is often a primary reason to evaluate utilizing a new OS. Most application vendors today support a very narrow sub-set of the available OSs on the market. Because of this, they must focus there resources on tuning and performance enhancements, at times companies can get a 20% to 30% improvement in application performance by moving the application to a better supported and tuned OS.
Lower Maintenance Costs: In todays world where Open Source is becoming more and more dominant in the business world, companies are reviewing there traditional support and licensing modes. There is a multitude of options available today from OSs that have no cost to use, to OSs that charge for all used instances. A company with legacy OSs in place has an opportunity to review how they negotiate support contracts with these new models so that they are paying for an appropriate level of support for all systems.
Increase Capabilities: Being able to provide a new capability that was previously not available is a large reason companies look to adding new OSs to there existing enterprise. Todays applications vendors rarely support all possible operating systems, more often then not, they choose a subset of OSs that they feel will best cover there potential market. Companies are constantly evaluating new applications for potential benefits to the companies bottom line, as part of this, often times a new OS must be brought in for the administration team to manage to provide new capabilities by adding new applications to the enterprise.
Assessment Questions
Second, lets look at some questions that can be asked when assessing possible OSs for use in your environment:
In addition to the financial questions for each OS, a company must consider the life cycle of the OS. Most OSs have formal release schedules for patches, upgrades and subsequent versions. It is important to evaluate any new OSs with these details in mind. It can end up being quite costly if an OS hits its end of life and you have to rapidly stop using it and migrate the work load to another platform, where as be carefully evaluating the roadmaps for the OS, you can make an informed decision that will successfully work with your in-house processes for support and upgrades.
Making a choice to add or eliminate an OS within a company can be a difficult one, both because of personal territorial issues, as well as complicated technical needs. It is important to focus on the true costs to the company related to the decision, this will ensure that training, implementation, licensing and support are factored in and staff fully understand the costs and ultimate decision.
With most companies, this is rarely a discussion around which OS will be used exclusively. More often it is a discussion around which OS will be added to or eliminated from the infrastructure to either lower administration costs, lower maintenance costs or increase capabilities. Often, companies also break down an OS choice into groups, either server and desktop class systems, or by departmental needs. This separation can be very beneficial when discussing any changes; it allows the folks doing the assessment to clearly define needs and balance them based on focused groups, trying to balance needs across a large company can often prove to be difficult to impossible.
Primary Reasons for Change
First, lets explore the primary reasons a company would change the mix of OSs already in use within the IT organizations.
Lower Training Costs: Today Linux is the predominant OS within the Education communities. This creates an environment where new staff entering the work force are very experienced and knowledgeable on working with Linux based systems. This is important because by using an OS that potential staff are experienced with limits the training that is required to get and keep them proficient at system administration. Companies will often eliminate an OS from use because staff skills are not at a peak for it, and costs to keep them trained at appropriate levels continue to rise as an OSs dominance disappears. This was primarily seen with the large UNIX variants (Solaris, HP-UX, AIX), over time companies have limited the use of them because students were no longer coming with these skills from college, and existing staff were spending more and more time keeping up with training on these platforms.
Increase Performance: Performance is often a primary reason to evaluate utilizing a new OS. Most application vendors today support a very narrow sub-set of the available OSs on the market. Because of this, they must focus there resources on tuning and performance enhancements, at times companies can get a 20% to 30% improvement in application performance by moving the application to a better supported and tuned OS.
Lower Maintenance Costs: In todays world where Open Source is becoming more and more dominant in the business world, companies are reviewing there traditional support and licensing modes. There is a multitude of options available today from OSs that have no cost to use, to OSs that charge for all used instances. A company with legacy OSs in place has an opportunity to review how they negotiate support contracts with these new models so that they are paying for an appropriate level of support for all systems.
Increase Capabilities: Being able to provide a new capability that was previously not available is a large reason companies look to adding new OSs to there existing enterprise. Todays applications vendors rarely support all possible operating systems, more often then not, they choose a subset of OSs that they feel will best cover there potential market. Companies are constantly evaluating new applications for potential benefits to the companies bottom line, as part of this, often times a new OS must be brought in for the administration team to manage to provide new capabilities by adding new applications to the enterprise.
Assessment Questions
Second, lets look at some questions that can be asked when assessing possible OSs for use in your environment:
- Why am I assessing my current install base of OSs? What is the goal of any changes?
- What is the current cost, both in licensing and training, for all current OSs we have deployed?
- What OSs are our staff skilled at administering? Both current utilized and non-utilized OSs.
- For any new OSs we are assessing, what will be the training cost to get staff proficient at maintaining them? The yearly cost to keep our staffs' skills up to date?
- What level of OS support can be provided by in-house resources and what will need to be included with any purchased support agreements?
- What tools currently in place will need changes or license upgrades to support a new OS?
- Will this OS introduce security vulnerabilities that will be unreasonably difficult to manage?
- Is this system mission critical? Can the system utilize an OS with just community support and no formal SLAs?
- What is the yearly cost in support for this OS by itself? In relation to other OSs within the company?
- What percentage of staff in house are proficient on this versus other OSs?
- What is the support cycle for this OS? How much longer will the vendor provide patches without additional support contract costs being incurred?
- Does the vendor, both OS and application, provide a supported upgrade path to a newer version?
In addition to the financial questions for each OS, a company must consider the life cycle of the OS. Most OSs have formal release schedules for patches, upgrades and subsequent versions. It is important to evaluate any new OSs with these details in mind. It can end up being quite costly if an OS hits its end of life and you have to rapidly stop using it and migrate the work load to another platform, where as be carefully evaluating the roadmaps for the OS, you can make an informed decision that will successfully work with your in-house processes for support and upgrades.
Making a choice to add or eliminate an OS within a company can be a difficult one, both because of personal territorial issues, as well as complicated technical needs. It is important to focus on the true costs to the company related to the decision, this will ensure that training, implementation, licensing and support are factored in and staff fully understand the costs and ultimate decision.
Tuesday, August 5, 2008
Tools for Effective Cluster Management
To continue my previous post on cluster management, I wanted to focus on the tools that are available for implementing and monitoring cluster health including process, hardware and configuration management.
There are two primary ways that one can go about building a change management and cluster management system. The first is going with a complete Linux stack solution that is integrated with a scheduler, monitoring utilities and OS deployment Tools. The second is to build a suite of tools using commercially or open source available tools in the field. Both have there benefits and tradeoffs, ultimately most firms use a combination of the two.
Types of Tools
There are several types of tools that are necessary to manage any cluster, large or small. The tools are categorized by the need they fill in the overall management of a cluster, including request tracking, change management, availability monitoring, performance monitoring and operating system deployment.
It is important when evaluating an HPC software stack, either complete or built from individual pieces, to ensure that each of these components is included, and evaluated for the capability they will provide versus similar, competing products.
Complete Stacks
Complete HPC stacks are becoming more common because of there ease of integration, and integrated support models. Complete stacks usually consist of all the base software that is needed to deploy and manage a cluster, as well as the libraries needed for parallel job execution. These stacks significantly cut the time needed to deploy new clusters, as well as ensure that all initial software on the system is compatible and fully tested.
The difficulty with stacks is there set versions of libraries and smaller compatibility matrices. These stacks are very tightly integrated solutions that ensure they are compatible and stable. They can present a challenge for sites that have outside requirements for different versions of libraries and compilers then the complete stack provides. While this is a challenge for some complex installations, this standard set of tested and integrated libraries provides a much easier solution for companies just using mainstream ISV applications. The developers of the primary stacks on the market work to ensure there kernel and library versions are within the framework that the primary ISVs support and expect.
Individual Tools
Even in environments where a complete HPC stack solution has been deployed, there could be the need for additional tools to meet all operational requirements. The individual tools mentioned below can be used to fill some of these needs, as well as be used as a starting point for companies that decide to not use an integrated stack solution, but instead roll there own.
The primary benefit to rolling your own stack based on these and other tools is that it will much more clearly meet your companies needs. The integrated stacks are meant as a solution to meet very broad HPC needs within a given customer base, but by developing a custom stack, a company can ensure all there specific needs are met and integrate in with existing company platforms. This integration can include management APIs that are similar to existing platforms, as well as data integration to ensure reporting, authentication and logging meets company standards.
Specific Tools
Sun HPC Software, Linux Edition (http://www.sun.com/software/products/hpcsoftware/index.xml) – The Sun Linux HPC Stack is an integrated solution of open source software for deploying and managing the compute resources within an HPC environment. It includes a variety of tools for performance and availability monitoring, OS deployment and management, troubleshooting and necessary libraries to support the primary interconnects on the market.
Rocks (http://www.rocksclusters.org/wordpress/) - Rocks is an open source, community driven integrated solution for deploying and managing clusters. It is based on a concept of rolls, each roll is specific to an application or set of tools that could be needed in an HPC environment. This modularity allows users to add the components they need as there needs evolve.
Trac (http://trac.edgewall.org/wiki/TracDownload) – Trac is a toolkit originally designed to be used in software development organizations. It has integrated capabilities for tracking bugs, release cycles, source code and a wiki for documenting notes and process information. These may all seem like software development specific capabilities, but they can all be used in very effective ways to better manage and document the associated processes for a cluster.
Request Tracker (http://bestpractical.com/rt/) - Request Tracker is an integrated tool for tracking, responding too and reporting on support requests. It is heavily used in call center environments, and works very well for HPC environments to track customer requests for support, requests for upgrades and other system changes.
RASilience (http://sourceforge.net/projects/rasilience/) - RASilience is built around Request Tracker with the Asset Tracker and Event Tracker add-ons. It is an interface and general-purpose engine for gathering, filtering, and dispatching system events. It can be used to provide event correlation across all nodes and other components within a cluster.
Nagios (http://www.nagios.org/) – Nagios is an open source monitoring solution built on the idea of plugins, plugins can be developed to monitor a wide variety of platforms and applications, while reporting back to a central interface for notification management, escalation and reporting capabilities.
Ganglia (http://ganglia.info/) - Ganglia is a highly scalable, distributed monitoring tool for Clusters. It is capable of providing historical information on node utilization rates and performance information via XML feeds from individual nodes, that can subsequently be aggregated for centralized viewing and reporting.
OneSIS (http://www.onesis.org/) - OneSIS is a tool to managing system images, both diskless and diskfull. OneSIS is an effective tool to ensuring that all images within a cluster are stored from a central repository, and integrated in with the appropriate tools to utilize kickstart for installing new operating system images, as well as booting nodes in a diskless environment.
Sun Grid Engine (http://gridengine.sunsource.net/) - SGE is a distributed resource manager which has proven scalability to 38,000 cores within a Grid environment. SGE is rapidly being updated by Sun to more efficiently handle multi-threading and too improve launch times for jobs, as well as tty output for non-interactive jobs.
Cluster Administration Package (http://www.capforge.org/cgi-bin/trac.cgi) – CAP is a set of tools for integrating clusters. It is designed and tested to accomplish three main objectives; Information Management, Control and Installation. CAP is a proven tool for deploying and managing a centralized set of configuration files within a cluster, and ensuring that any changes to master configuration files are correctly propagated to all nodes within the cluster.
Cbench (http://cbench.sourceforge.net/) – Cbench is a set of tools for benchmarking and characterizing performance on clusters. Cbench can be used for both initial bring up of new systems, as well as testing of hardware that has been upgraded, modified or repaired.
ConMan (http://home.gna.org/conman/) - ConMan is a console management utility. It is most often used as an aggregator for a large number of serial console outputs within clusters. It can be used to both take console output and redirect it to a file for later reference, as well as allow administrators to redirect output to a console in ReadWrite mode.
Netdump (http://www.redhat.com/support/wpapers/redhat/netdump/) - Netdump is a crash dump logging utility from Redhat. The purpose of Netdump is to ensure that if a node with no console attached crashes, administrators have a reference point within logs to catch the crash and debug output.
Logsurfer (http://www.crypt.gen.nz/logsurfer/) - Logsurfer is a regular expression driven utility for matching incoming log entries and taking action based up matches. Logsurfer can do a variety of actions based upon a match including running an external script, or counting the number of entries until a threshold is met.
Specific Tool Integration Techniques
These are some specific methods myself and some colleges have used to integrate these tools into larger frameworks used for change management and monitoring within Enterprise Environments. These are meant as a way to show how the different tools, used in combination, can simplify cluster management and lower administration costs. All of these methods have also been tested at scales well beyond typical HPC systems today, including OneSIS and Cbench which have been tested up to scales of 4500 nodes.
OneSIS
OneSIS can be used in two primary methods within a cluster, each can be used independently or in combination. The first and most common is to assemble an image that is then deployed to all compute nodes and installed locally. OneSIS can also be used to distribute that image to all compute nodes so they can run in a diskless fashion, using the image from a central management server.
These methods can also be used in combination when preparing to upgrade a cluster. A new image can be developed and booted into a diskless mode on a subset of a clusters nodes. Those nodes can then be used to test all applications and cluster uses to ensure the image is correct. Once that testing is complete, OneSIS can be used to ensure an exact copy of the tested image in installed on all compute nodes. This method ensure that no bad images are installed on the cluster, and that the majority of the cluster nodes can be left in place for production users while the new image is tested.
Nagios
Nagios is a very dynamic tool because of its ability to use plugins for monitoring and response. Plugins can be written for any variety of hardware within a cluster to ensure they are online, are not showing excessive physical errors and do not need proactive attention. Nagios's dynamic nature also allows plugins that allow it to communicate with centralized databases of node information and report are hardware or node problems to RT for proper tracking and attention
Nagios plugins can easily be used to remotely execute health check scripts on compute nodes. These health check scripts can check to ensure nodes are operating and responding correctly, there are no hung processes that might affect future jobs, and that the nodes configuration files and libraries are the expected versions. If Nagios does detect an error on a given node, it can easily be configured to automatically open an RT ticket for staff to repair the node, and mark the node offline in the job scheduler until such time as the node is repaired.
Cbench
Cbench is a wonderful tool for automating the process of both bringing up new clusters as well as testing hardware that has been repaired or replaced to ensure it meets the same benchmarks as other hardware in the cluster. Cbench has a collection of benchmarks that can be used to benchmark a new cluster to ensure that the system, storage, memory and attached file systems perform as designed. This can be a valuable tool in locating issues that were introduced during deployment and will ultimately cause performances decreases for users.
Cbench can also be used to ensure that all hardware that was repaired was done so correctly before being reintroduced into the cluster. By properly benchmarking a cluster at installation time, it allows support staff to run identical benchmarks on nodes that have been subsequently repaired. These new results can be compared to the initial results from the cluster and ensure that the node is now operating as peak, expected performance.
Logsurfer
Logsurfer is best used as an aggregator and automated response mechanism within a cluster. By having all nodes send their respective logs to a central log host, it enables the cluster administrators to configure a single Logsurfer daemon to monitor and respond to appropriate log entries.
Many sites will subsequently configure Logsurfer to proactively mark nodes in the scheduler offline if an error is found in the logs relating to that node. This ensures that no future jobs are run on the node until repair staff are able to verify the node is operating correctly and repair the reason for the initial error.
Final Thoughts
Clusters are complex mixes of hardware and software, the more effectively the tools are picked and integrated early in system design, the more efficiently the system can be managed. There are many tools available, both commercial and open source, that can be used in cluster environments. It is critical that each ones benefits, tradeoffs and scalability be weighed when picking the tools for for environment.
As a final thought, clusters are complex solutions that often require customization at every level. This can also be extended to the applications used to manage the cluster, but was not mentioned previously in this document. It is always an option to develop a tool in house for your needs, chances are, if you have a need, so does someone else. The majority of the tools above were developed because a single company had a need, developed a tool to meet that need and put the tool back into the community for everyone else to use. This is a wonderful way to not only continue improving the capabilities we as a community have around clusters, but is a great way to get company recognition in a rapidly growing field.
There are two primary ways that one can go about building a change management and cluster management system. The first is going with a complete Linux stack solution that is integrated with a scheduler, monitoring utilities and OS deployment Tools. The second is to build a suite of tools using commercially or open source available tools in the field. Both have there benefits and tradeoffs, ultimately most firms use a combination of the two.
Types of Tools
There are several types of tools that are necessary to manage any cluster, large or small. The tools are categorized by the need they fill in the overall management of a cluster, including request tracking, change management, availability monitoring, performance monitoring and operating system deployment.
It is important when evaluating an HPC software stack, either complete or built from individual pieces, to ensure that each of these components is included, and evaluated for the capability they will provide versus similar, competing products.
Complete Stacks
Complete HPC stacks are becoming more common because of there ease of integration, and integrated support models. Complete stacks usually consist of all the base software that is needed to deploy and manage a cluster, as well as the libraries needed for parallel job execution. These stacks significantly cut the time needed to deploy new clusters, as well as ensure that all initial software on the system is compatible and fully tested.
The difficulty with stacks is there set versions of libraries and smaller compatibility matrices. These stacks are very tightly integrated solutions that ensure they are compatible and stable. They can present a challenge for sites that have outside requirements for different versions of libraries and compilers then the complete stack provides. While this is a challenge for some complex installations, this standard set of tested and integrated libraries provides a much easier solution for companies just using mainstream ISV applications. The developers of the primary stacks on the market work to ensure there kernel and library versions are within the framework that the primary ISVs support and expect.
Individual Tools
Even in environments where a complete HPC stack solution has been deployed, there could be the need for additional tools to meet all operational requirements. The individual tools mentioned below can be used to fill some of these needs, as well as be used as a starting point for companies that decide to not use an integrated stack solution, but instead roll there own.
The primary benefit to rolling your own stack based on these and other tools is that it will much more clearly meet your companies needs. The integrated stacks are meant as a solution to meet very broad HPC needs within a given customer base, but by developing a custom stack, a company can ensure all there specific needs are met and integrate in with existing company platforms. This integration can include management APIs that are similar to existing platforms, as well as data integration to ensure reporting, authentication and logging meets company standards.
Specific Tools
Sun HPC Software, Linux Edition (http://www.sun.com/software/products/hpcsoftware/index.xml) – The Sun Linux HPC Stack is an integrated solution of open source software for deploying and managing the compute resources within an HPC environment. It includes a variety of tools for performance and availability monitoring, OS deployment and management, troubleshooting and necessary libraries to support the primary interconnects on the market.
Rocks (http://www.rocksclusters.org/wordpress/) - Rocks is an open source, community driven integrated solution for deploying and managing clusters. It is based on a concept of rolls, each roll is specific to an application or set of tools that could be needed in an HPC environment. This modularity allows users to add the components they need as there needs evolve.
Trac (http://trac.edgewall.org/wiki/TracDownload) – Trac is a toolkit originally designed to be used in software development organizations. It has integrated capabilities for tracking bugs, release cycles, source code and a wiki for documenting notes and process information. These may all seem like software development specific capabilities, but they can all be used in very effective ways to better manage and document the associated processes for a cluster.
Request Tracker (http://bestpractical.com/rt/) - Request Tracker is an integrated tool for tracking, responding too and reporting on support requests. It is heavily used in call center environments, and works very well for HPC environments to track customer requests for support, requests for upgrades and other system changes.
RASilience (http://sourceforge.net/projects/rasilience/) - RASilience is built around Request Tracker with the Asset Tracker and Event Tracker add-ons. It is an interface and general-purpose engine for gathering, filtering, and dispatching system events. It can be used to provide event correlation across all nodes and other components within a cluster.
Nagios (http://www.nagios.org/) – Nagios is an open source monitoring solution built on the idea of plugins, plugins can be developed to monitor a wide variety of platforms and applications, while reporting back to a central interface for notification management, escalation and reporting capabilities.
Ganglia (http://ganglia.info/) - Ganglia is a highly scalable, distributed monitoring tool for Clusters. It is capable of providing historical information on node utilization rates and performance information via XML feeds from individual nodes, that can subsequently be aggregated for centralized viewing and reporting.
OneSIS (http://www.onesis.org/) - OneSIS is a tool to managing system images, both diskless and diskfull. OneSIS is an effective tool to ensuring that all images within a cluster are stored from a central repository, and integrated in with the appropriate tools to utilize kickstart for installing new operating system images, as well as booting nodes in a diskless environment.
Sun Grid Engine (http://gridengine.sunsource.net/) - SGE is a distributed resource manager which has proven scalability to 38,000 cores within a Grid environment. SGE is rapidly being updated by Sun to more efficiently handle multi-threading and too improve launch times for jobs, as well as tty output for non-interactive jobs.
Cluster Administration Package (http://www.capforge.org/cgi-bin/trac.cgi) – CAP is a set of tools for integrating clusters. It is designed and tested to accomplish three main objectives; Information Management, Control and Installation. CAP is a proven tool for deploying and managing a centralized set of configuration files within a cluster, and ensuring that any changes to master configuration files are correctly propagated to all nodes within the cluster.
Cbench (http://cbench.sourceforge.net/) – Cbench is a set of tools for benchmarking and characterizing performance on clusters. Cbench can be used for both initial bring up of new systems, as well as testing of hardware that has been upgraded, modified or repaired.
ConMan (http://home.gna.org/conman/) - ConMan is a console management utility. It is most often used as an aggregator for a large number of serial console outputs within clusters. It can be used to both take console output and redirect it to a file for later reference, as well as allow administrators to redirect output to a console in ReadWrite mode.
Netdump (http://www.redhat.com/support/wpapers/redhat/netdump/) - Netdump is a crash dump logging utility from Redhat. The purpose of Netdump is to ensure that if a node with no console attached crashes, administrators have a reference point within logs to catch the crash and debug output.
Logsurfer (http://www.crypt.gen.nz/logsurfer/) - Logsurfer is a regular expression driven utility for matching incoming log entries and taking action based up matches. Logsurfer can do a variety of actions based upon a match including running an external script, or counting the number of entries until a threshold is met.
Specific Tool Integration Techniques
These are some specific methods myself and some colleges have used to integrate these tools into larger frameworks used for change management and monitoring within Enterprise Environments. These are meant as a way to show how the different tools, used in combination, can simplify cluster management and lower administration costs. All of these methods have also been tested at scales well beyond typical HPC systems today, including OneSIS and Cbench which have been tested up to scales of 4500 nodes.
OneSIS
OneSIS can be used in two primary methods within a cluster, each can be used independently or in combination. The first and most common is to assemble an image that is then deployed to all compute nodes and installed locally. OneSIS can also be used to distribute that image to all compute nodes so they can run in a diskless fashion, using the image from a central management server.
These methods can also be used in combination when preparing to upgrade a cluster. A new image can be developed and booted into a diskless mode on a subset of a clusters nodes. Those nodes can then be used to test all applications and cluster uses to ensure the image is correct. Once that testing is complete, OneSIS can be used to ensure an exact copy of the tested image in installed on all compute nodes. This method ensure that no bad images are installed on the cluster, and that the majority of the cluster nodes can be left in place for production users while the new image is tested.
Nagios
Nagios is a very dynamic tool because of its ability to use plugins for monitoring and response. Plugins can be written for any variety of hardware within a cluster to ensure they are online, are not showing excessive physical errors and do not need proactive attention. Nagios's dynamic nature also allows plugins that allow it to communicate with centralized databases of node information and report are hardware or node problems to RT for proper tracking and attention
Nagios plugins can easily be used to remotely execute health check scripts on compute nodes. These health check scripts can check to ensure nodes are operating and responding correctly, there are no hung processes that might affect future jobs, and that the nodes configuration files and libraries are the expected versions. If Nagios does detect an error on a given node, it can easily be configured to automatically open an RT ticket for staff to repair the node, and mark the node offline in the job scheduler until such time as the node is repaired.
Cbench
Cbench is a wonderful tool for automating the process of both bringing up new clusters as well as testing hardware that has been repaired or replaced to ensure it meets the same benchmarks as other hardware in the cluster. Cbench has a collection of benchmarks that can be used to benchmark a new cluster to ensure that the system, storage, memory and attached file systems perform as designed. This can be a valuable tool in locating issues that were introduced during deployment and will ultimately cause performances decreases for users.
Cbench can also be used to ensure that all hardware that was repaired was done so correctly before being reintroduced into the cluster. By properly benchmarking a cluster at installation time, it allows support staff to run identical benchmarks on nodes that have been subsequently repaired. These new results can be compared to the initial results from the cluster and ensure that the node is now operating as peak, expected performance.
Logsurfer
Logsurfer is best used as an aggregator and automated response mechanism within a cluster. By having all nodes send their respective logs to a central log host, it enables the cluster administrators to configure a single Logsurfer daemon to monitor and respond to appropriate log entries.
Many sites will subsequently configure Logsurfer to proactively mark nodes in the scheduler offline if an error is found in the logs relating to that node. This ensures that no future jobs are run on the node until repair staff are able to verify the node is operating correctly and repair the reason for the initial error.
Final Thoughts
Clusters are complex mixes of hardware and software, the more effectively the tools are picked and integrated early in system design, the more efficiently the system can be managed. There are many tools available, both commercial and open source, that can be used in cluster environments. It is critical that each ones benefits, tradeoffs and scalability be weighed when picking the tools for for environment.
As a final thought, clusters are complex solutions that often require customization at every level. This can also be extended to the applications used to manage the cluster, but was not mentioned previously in this document. It is always an option to develop a tool in house for your needs, chances are, if you have a need, so does someone else. The majority of the tools above were developed because a single company had a need, developed a tool to meet that need and put the tool back into the community for everyone else to use. This is a wonderful way to not only continue improving the capabilities we as a community have around clusters, but is a great way to get company recognition in a rapidly growing field.
Friday, August 1, 2008
Enterprise Architecture versus Solution Architecture
Recently I have been in several customer meetings where their newly hired Enterprise Architect(EA) joined to listen in and provide feedback. Most of these meetings were to discuss an individual cluster or system that is being implemented, and it seemed that most EAs these days are still too focused on systems, solutions and details; and too little focus is being paid to the true activities I see as relevant for an Enterprise Architect. I decided to throw my own comments out there about where an EA falls within an organization, and how that differs from what I call Solution Architects.
The way I see it, a Solution Architect is more closely associated with what I see as technical sales people, they focus on the individual system or application, focusing on the details of what software packages will work, what a good support model is, and how to implement it within the companies framework that is defined by the Enterprise Architect.
I then see the Enterprise Architect as a pathway between the companies Business Goals and the IT personnel that must delivery tools to meet and track those goals. The EAs goal is to define a set of policies at the company wide level that ensure things like legal compliance, consistent identity management and company wide reporting capabilities.
If an EA gets too involved in the Solution Architect level details, the company suffers because those higher level activities are not being managed appropriately. A successful EA has both the willingness and capability to work with the company executives and turn there business vision into a technology vision and push that down to the Solution Architects and IT staff.
The way I see it, a Solution Architect is more closely associated with what I see as technical sales people, they focus on the individual system or application, focusing on the details of what software packages will work, what a good support model is, and how to implement it within the companies framework that is defined by the Enterprise Architect.
I then see the Enterprise Architect as a pathway between the companies Business Goals and the IT personnel that must delivery tools to meet and track those goals. The EAs goal is to define a set of policies at the company wide level that ensure things like legal compliance, consistent identity management and company wide reporting capabilities.
If an EA gets too involved in the Solution Architect level details, the company suffers because those higher level activities are not being managed appropriately. A successful EA has both the willingness and capability to work with the company executives and turn there business vision into a technology vision and push that down to the Solution Architects and IT staff.
Subscribe to:
Posts (Atom)