Monday, November 17, 2008

Security Planning in HPC

Todays high performance compute (HPC) solutions have many components including compute nodes, shared storage systems, high capacity tape archiving systems and shared interconnects including ethernet and Infiniband. One primary reason companies are turning to HPC solutions is the cost benefits of shared infrastructure that can be leveraged across many different projects and teams. While this shared usage model can allow for managed, cost effective growth, it also introduces new security risks and requirements for policies and tools to ensure previously separate data is managed properly in a shared environment.

This shared infrastructure model that is often used in HPC has several areas around data security that should be addressed prior to deploying shared solutions. Often times companies will have departments working on sensitive work while others work on very public projects, other firms could be working with their customers proprietary data and most companies have a threat from outside competitors trying to gain access to confidential work. All of these issues must be addressed in shared HPC solutions to ensure data is always secure, a reliable audit platform is in place and that security policies can be changed in a rapid fashion as company needs and policies change.

When evaluating an HPC solution to ensure data access is managed within company policy, there are several components within the cluster that should be reviewed separately:

Shared file systems – Todays HPC solutions have become rapidly successfully because of the availably of massively parallel file systems. These are scalable solutions for doing very high speed I/O and are often times available on all nodes within a cluster.

Databases – More often then ever companies are utilizing databases as a way to organize massive amounts of both transactional and reporting data. Often times these databases are paired with HPC solutions to evaluate the data in a very scalable and reliable method. These databases often contain a variety of data including sales, forecasting, payroll, procurement and scheduling just to name a few.

Local disk – More often then not, compute nodes have local disk in them to provide a local operating system and swap space. This swap space and possibly temporary file systems can provide a space for users to store data while jobs are running, but is also a location that must be assessed to ensure access is provided to those that need it.

Compute node memory – Compute nodes also have local physical memory that could be exploited by software flaws to allow unexpected access.

Interconnects – Todays HPC systems often use a high speed interconnect like Infiniband or 10Gbit Ethernet, these, like any other type of network connections present the opportunity for sniffing or otherwise monitoring traffic.


Policies
Todays companies often work for a variety of customers, as well as work on internal projects. It can be a complicated balancing act ensuring that data access policies are in place to properly handle those cases. Some data will require very restrictive policies, while others will require a very open policy around usage and access. Often time separate filesystems can be utilized to ensure data is stored in manageable locations and access granted pursuant to company policies.

There are two primary components to developing these security policies, first is to assess the risk associated with each component of the system, this risk assessment can include costs in dollars, costs in time and public perception if data was to be handled incorrectly per industry best practices or legal guidelines. Policies can then be developed to mitigate that risk to acceptable levels.

Some common methods to mitigate risk across the above components are:

Data Isolation – Within a shared computing environment data can be isolated in a variety of ways including physical isolation using different storage arrays, logical isolation using technology like VLANs and access restrictions like file permissions.

Audit Trails – Considering audit trails and how to implement them is important. This ensures that there is both a path to isolating and resolving problems, but also that legal compliance regulations are met. Audit trails can include system log files, authentication log files,resource manager logs and many others to provide end to end documentation of a user and their activities.

Consistent Identity Management – To properly ensure that data is accessed by the correct individuals and audit trails are consistent it is important to ensure identity management solutions are in place that handle HPC environments, as well as other enterprise type computing resources in a consistent method. Identity Management can be provided by tools like LDAP and Kerberos, as well as more advanced authentication and authorization systems.

Notifications
– Notifications are an important part of the change management process within an HPC environment. Notifications can include alerts to security staff, administrators or management that portions of the cluster are out or company compliance, or attempts to access restricted resources have occurred. Notifications can come from a variety of tools within an HPC environment, but should be uniform in format and information so that staff can respond rapidly to unexpected cluster issues.

Data Cleanup
– Often jobs within an HPC environment will create temporary files on individual nodes, as well as on shared filesystems. These files have an impact to a systems risk assessment and should be properly cleaned up after they are no longer needed. By removing all data that is not needed, it limits that data that needs to be accounted for, as well as the potential exposure if a single system is compromised.

We have just finished reviewing risk assessments within an HPC environment. These allow management and administrators of HPC systems to understand the costs (political, financial, time) of any failure in security plans or processes. In addition to understanding risk, there is the added complication of enforcing these policies in a way that is consistent across the cluster, consistent across the company and provides a proper audit trail. The most common methods of software implementation for these security policies are:

File System Permissions
– File system permissions are the most common place to implement security controls, as well as one of the easiest items to complete and ensure compliance with. These permissions allow administrators at the lowest level to grant and deny access to data based on need. These do not assist with restricting back access to unauthorized individuals, but do contribute to ensuring that day to day operation of the system is done reliably and security.

Centralized Monitoring
– Centralized monitoring and policy management are key to ensuring consistent security and minimizing human error. By using a central repository for all log entries, it allows staff to implement tools to rapidly catch any activity that is unauthorized or unexpected and respond with the proper speed. Centralized policy management through the use of tools like Identity management allow staff to quickly add or remove access based on business needs. By centralizing this policy management a company can ensure that the often manual process of removing access is removed and proper checks are in place to ensure access changes are updated accordingly.

Resource Manager
– Most modern clusters make use of a job scheduler, or resource manager to allocate nodes and other resources to individual users to complete jobs. Most schedulers allow the allocation of resource groups and restrictions on those groups to an individual user or users. By extending this functionality it is possible to restrict users jobs to run on systems that have data they are allowed to see, and ensure they can not access nodes with filesystems they do not have permissions to utilize. The resource manager is a centralized tool that provides great flexibility in ensuring users have access to the resources they need, but no other resources or data.

Mounted File Systems – Often times HPC environments will utilize a variety of automated tools to unmount and remount filesystems based on user access needs. By un-mounting a filesystem that is not required for a given user, it adds an additional level of access protection above file permissions to ensure only authorized users access the data contained on a given filesystem.


Shared infrastructure is a challenge in all environments when assessing security solutions. A shared infrastructure means that additional precaution must be taken in implementation and security policies to ensure that data and resources are used when expected and by only authorized individuals. When planning a shared environment the initial process should begin with a risk assessment to understand what components of the solutions could be exploited and what the costs in time and money would be if that were to occur. That risk assessment can then be used to ensure the proper safeguards are implemented with available technologies to reduce the risk to a manageable and acceptable level for the company. Ultimately all safeguards should be implemented in a way that limits the potential for accidental failures in safeguards and reduces the need for manual administration and intervention. Shared resources are a challenge, but when properly managed, can ensure better overall utilization for a company without sacrificing on security.

No comments: