Merging Business and IT: November 2008

Todays high performance compute (HPC) solutions have many components including compute nodes, shared storage systems, high capacity tape archiving systems and shared interconnects including ethernet and Infiniband. One primary reason companies are turning to HPC solutions is the cost benefits of shared infrastructure that can be leveraged across many different projects and teams. While this shared usage model can allow for managed, cost effective growth, it also introduces new security risks and requirements for policies and tools to ensure previously separate data is managed properly in a shared environment.

This shared infrastructure model that is often used in HPC has several areas around data security that should be addressed prior to deploying shared solutions. Often times companies will have departments working on sensitive work while others work on very public projects, other firms could be working with their customers proprietary data and most companies have a threat from outside competitors trying to gain access to confidential work. All of these issues must be addressed in shared HPC solutions to ensure data is always secure, a reliable audit platform is in place and that security policies can be changed in a rapid fashion as company needs and policies change.

When evaluating an HPC solution to ensure data access is managed within company policy, there are several components within the cluster that should be reviewed separately:

Shared file systems – Todays HPC solutions have become rapidly successfully because of the availably of massively parallel file systems. These are scalable solutions for doing very high speed I/O and are often times available on all nodes within a cluster.

Databases – More often then ever companies are utilizing databases as a way to organize massive amounts of both transactional and reporting data. Often times these databases are paired with HPC solutions to evaluate the data in a very scalable and reliable method. These databases often contain a variety of data including sales, forecasting, payroll, procurement and scheduling just to name a few.

Local disk – More often then not, compute nodes have local disk in them to provide a local operating system and swap space. This swap space and possibly temporary file systems can provide a space for users to store data while jobs are running, but is also a location that must be assessed to ensure access is provided to those that need it.

Compute node memory – Compute nodes also have local physical memory that could be exploited by software flaws to allow unexpected access.

Interconnects – Todays HPC systems often use a high speed interconnect like Infiniband or 10Gbit Ethernet, these, like any other type of network connections present the opportunity for sniffing or otherwise monitoring traffic.

Policies
Todays companies often work for a variety of customers, as well as work on internal projects. It can be a complicated balancing act ensuring that data access policies are in place to properly handle those cases. Some data will require very restrictive policies, while others will require a very open policy around usage and access. Often time separate filesystems can be utilized to ensure data is stored in manageable locations and access granted pursuant to company policies.

There are two primary components to developing these security policies, first is to assess the risk associated with each component of the system, this risk assessment can include costs in dollars, costs in time and public perception if data was to be handled incorrectly per industry best practices or legal guidelines. Policies can then be developed to mitigate that risk to acceptable levels.

Some common methods to mitigate risk across the above components are:

Data Isolation – Within a shared computing environment data can be isolated in a variety of ways including physical isolation using different storage arrays, logical isolation using technology like VLANs and access restrictions like file permissions.

Audit Trails – Considering audit trails and how to implement them is important. This ensures that there is both a path to isolating and resolving problems, but also that legal compliance regulations are met. Audit trails can include system log files, authentication log files,resource manager logs and many others to provide end to end documentation of a user and their activities.

Consistent Identity Management – To properly ensure that data is accessed by the correct individuals and audit trails are consistent it is important to ensure identity management solutions are in place that handle HPC environments, as well as other enterprise type computing resources in a consistent method. Identity Management can be provided by tools like LDAP and Kerberos, as well as more advanced authentication and authorization systems.

Notifications – Notifications are an important part of the change management process within an HPC environment. Notifications can include alerts to security staff, administrators or management that portions of the cluster are out or company compliance, or attempts to access restricted resources have occurred. Notifications can come from a variety of tools within an HPC environment, but should be uniform in format and information so that staff can respond rapidly to unexpected cluster issues.

Data Cleanup – Often jobs within an HPC environment will create temporary files on individual nodes, as well as on shared filesystems. These files have an impact to a systems risk assessment and should be properly cleaned up after they are no longer needed. By removing all data that is not needed, it limits that data that needs to be accounted for, as well as the potential exposure if a single system is compromised.

We have just finished reviewing risk assessments within an HPC environment. These allow management and administrators of HPC systems to understand the costs (political, financial, time) of any failure in security plans or processes. In addition to understanding risk, there is the added complication of enforcing these policies in a way that is consistent across the cluster, consistent across the company and provides a proper audit trail. The most common methods of software implementation for these security policies are:

File System Permissions – File system permissions are the most common place to implement security controls, as well as one of the easiest items to complete and ensure compliance with. These permissions allow administrators at the lowest level to grant and deny access to data based on need. These do not assist with restricting back access to unauthorized individuals, but do contribute to ensuring that day to day operation of the system is done reliably and security.

Centralized Monitoring – Centralized monitoring and policy management are key to ensuring consistent security and minimizing human error. By using a central repository for all log entries, it allows staff to implement tools to rapidly catch any activity that is unauthorized or unexpected and respond with the proper speed. Centralized policy management through the use of tools like Identity management allow staff to quickly add or remove access based on business needs. By centralizing this policy management a company can ensure that the often manual process of removing access is removed and proper checks are in place to ensure access changes are updated accordingly.

Resource Manager – Most modern clusters make use of a job scheduler, or resource manager to allocate nodes and other resources to individual users to complete jobs. Most schedulers allow the allocation of resource groups and restrictions on those groups to an individual user or users. By extending this functionality it is possible to restrict users jobs to run on systems that have data they are allowed to see, and ensure they can not access nodes with filesystems they do not have permissions to utilize. The resource manager is a centralized tool that provides great flexibility in ensuring users have access to the resources they need, but no other resources or data.

Mounted File Systems – Often times HPC environments will utilize a variety of automated tools to unmount and remount filesystems based on user access needs. By un-mounting a filesystem that is not required for a given user, it adds an additional level of access protection above file permissions to ensure only authorized users access the data contained on a given filesystem.

Shared infrastructure is a challenge in all environments when assessing security solutions. A shared infrastructure means that additional precaution must be taken in implementation and security policies to ensure that data and resources are used when expected and by only authorized individuals. When planning a shared environment the initial process should begin with a risk assessment to understand what components of the solutions could be exploited and what the costs in time and money would be if that were to occur. That risk assessment can then be used to ensure the proper safeguards are implemented with available technologies to reduce the risk to a manageable and acceptable level for the company. Ultimately all safeguards should be implemented in a way that limits the potential for accidental failures in safeguards and reduces the need for manual administration and intervention. Shared resources are a challenge, but when properly managed, can ensure better overall utilization for a company without sacrificing on security.

Risk management is an important component of a complete security plan for any company. In the area of cyber security this often has two fronts; assessing security threats and documenting responses. Both are equally valuable, and if planned correctly can ensure that no matter the threat a company faces, there is processes in place to properly manage, communicate and eliminate the threat. In todays security environment, a threat can mean a variety of things including viruses, data compromises, lost laptops, network intrusion attempts, insider threats and physical compromises.

This risk planning also has other purposes outside of planning for and responding too threats. This information, once gathered, can also be used as a basis for understanding risk around different types of threats. Often lower level threats have such a low level of risk that responding too all of them would be a waste of company resources, yet more complex attacks require faster, more urgent response. These risk assessments can also help staff plan the appropriate solutions around patch management, firewalls, network controls and other tools meant to stop intrusion. By properly understanding threats and there potential impact to production services, and staffs' time, a proper mitigation plan can also be developed.

Threat Matrix
Planning for threats can often be a daunting task, even for the most seasoned of security professionals. The challenge comes from the inability to know what exact threats are in the wild every day, and the new threats that are constantly emerging. There are many details that will need to be documented and considered when planning for the various known threats that are in the wild, these include who is causing the threat and who is the target, how is the attack being carried out, what safeguards are being affected as part of the threat, what changes will be needed to eliminate the threat, what is the cost of responding to the threat, both in prevention and if the threat is successful.

By planning and carefully documenting the process for responding for known threats, we develop experience that can then be used for responding to unknown threats. The following questions and information gathering can assist in developing this threat matrix. This matrix serves as a starting point when responding to known threats, and will be added to as new threats are encountered.

Who is causing the threat?
This question can have multiple components, it looks at whether the threat is being caused by someone internally or someone externally, as well as is the threat caused by a person, or rogue software. This is an important point to assess all threats to ensure that the safeguards added are in the correct place to mitigate the threat, and that resources are in the appropriate places to respond to the threat.

Who is the target?
A proper understanding of the target is important so that the impact of the threat is understood. From the target we can ascertain if sensitive customer information is at risk, if availability of public services is at risk or if we are at risk for a legal compliance issue. By who, I specifically mean what server, host, application, database, router, firewall or any other device that could be attacked.

This information can also be used to track developing patterns of attack. As these threats are rolled into response plans, a plan of documentation can ensure that patterns of threats are tracked and managed properly.

Avenue of attack?
This question evaluates how the threat is affecting a companies infrastructure. This could be technical avenues like via outside network connections or email, but can also be physical level like a person in your building or outside the building. Understanding the avenue of attack is critical to responding so that an attach response does not cause undue outages to other portions of the infrastructure or unnecessary outages to customer facing services.

Safeguards affected?
It is important to understand what safeguards are potentially compromised by a given threat. This could include firewalls, application validation checks, database encryption or filesystem encryption. Understanding the affected safeguards will later allow processes to be developed that mitigate the threat as quickly and efficiently as possible by understanding how best to stop the threat.

Changes to stop the threat?
This is a detailed list of what configuration changes or otherwise will need to be implemented to stop the threat. These are used to develop the process in the Response Matrix to slow and eliminate the threat. Understanding these responses is also important so that a risk versus reward analysis can be done. Often times, the change to eliminate the threat is so drastic that other problems with services result. By understanding what changes are required, management can make informed decisions about ignoring the threat, or what various response to use against the threat.

Cost of responding?
Understanding the implications of a threat are important when developing an appropriate level of response to the threat. The cost of responding can be communicated in multiple ways including cost in dollars or cost in time. All are important factors to use when deciding on response plans to threats, and the level of risk associated with various response plans.

Cost if threat is successful?
The other important cost associated with responding to a threat is the cost if the threat is successful. This could mean many things depending on the type of threat; it could be an outage in customer facing services, a loss of customer data or the financial impact of not providing the services to customers.

Current mitigation plan in place?
This item assess the safeguards that are in place to mitigate a given threat. This can include firewalls, security patches, passwords, identity management solutions or a host of application layer safeguards related to data scrubbing and input validation.

By no means are these all the items that should be assessed in a threat matrix. These are the most common ones that most companies will have documented for all threats. Additional information can be included in the threat matrix for specific applications, operating systems, network types and levels of data that a company processes. When developing a threat matrix, a company should evaluate all applications, hosts, networks, network connections and associated tools. By evaluating these items a list can be developed that includes the items that could be attacked, and what methods could be used in an attack.

Response Matrix
After defining the known threats, a response matrix can be developed specific to the companies needs and risks. These needs and risks can be calculated from the cost portion of the above developed matrix of threats. These needs and risks can be used as a basis for planning resources around responses, legal obligations when threats are encountered and documentation policies around responding to threats. This response matrix should contain detailed procedures for responding to several categories of threats.

The initial component of all response matrices is a list of types of incidents, this can usually be broken down into the follow categories, known as threat types:

Known incidents that have previously been experienced and have a documented response plan
Known incidents that have not been experienced, but have a response plan in place
Unknown incidents that do not have an associated response plan

Responding to threats and attacks must be a methodical process that encompasses many challenges including speed, communication, documentation and follow up. All of these must be managed while ensuring that customers and staff are impacted as little as necessary. Responding must be a coordinated effort between the various teams within a company that are responsible for system administration, data security, compliance, network administration and enterprise architecture.

Threat Type 1
Threat type one is the category that a company will know the most about and be able to plan in a detailed way how to respond. This category will be threats that have previously been responded too and resolved. Each time they have previously occurred a follow up meeting should have been done to revise and improve the process for the specific threat.

Threat Type 2
The threats that will be listed in the second category will include well known exploits and attack avenues as well as threats that other companies have actively faced. These threats will also be listed in the threat matrix, although with a side note that the company has not previously had to respond to them, but does anticipate the threat.

Threat Type 3
This category is often the most complex to respond too because the actual threats are unknown. The process for responding to unknown threats must be dynamic enough to handle a wide enough range that all threats are properly responded too, but rigid enough that legal implications are handled and communication channels do not break down in the face on unknown threats.

Internal versus External
One important component to all response plans is understanding if the threat you are planning for is internal or external. Internal threats come from staff that are either intentionally out to cause the company harm, or systems that are setup in a way they allow staff access that was not intended and subsequently has negative consequences.

Often times the threat, be it internal or external, plays an important role in how the company responds. If the threat is internal, it is often important to bring in outside resources to assess the problems and develop a mitigation plan, ensuing that the company is not vulnerable from future insider threats.

Response Teams
Another key component of all response matrices is a carefully planned list of individuals and teams to be included in response activities. As a company, you should evaluate if you have the appropriate level of technical capability in house to respond to known and unknown threats, as well as what additional capabilities need to be brought in when responding to threats. Outside resources could be technical staff specialized in security, or marketing staff focused on public relations issues, it could even be law enforcement to track down the source of threats. Today a lot of companies also ensure that when responding to major threats, legal council is brought in to ensure compliance with data handling and reporting requirements.

Response Training
Each type of threat that might be responded too should have associated required training for staff. This could include computer forensics, data analysis, legal implications or technical skills. This training, done on regular intervals, ensures that staff have both a process and the proper training to effectively respond to threats.

While these are not all the possible categories for each response within the response matrix, they begin to provide a basis for response planning. Additional items can be included in the response plan based on company specifics, industry legalities and management preferences. All response plans should be detailed enough that staff have clear directions to follow in possibly chaotic situations. Response plans should also have regular reviews of the process to ensure they are updated to reflect changes in company management structure, changes in technology and changes in industry trends.

Response Methods
The majority of this document has been focused at defining a matrix of threats that a companies information systems face. These threats can then used as a basis when defining responses in a coordinated fashion, while most of the writing was about defining manual processes, this is only the first step to automating the responses and processes. After defining the manual processes for responding to threats, a system of automation can be put in place for the responses that make sense. Automated solutions work very well for defined threats that have clear responses; a good example is a service on the company network being attacked by an outside system, an automated system that blocks the source of the attack and notifies staff ensures that the threat is immediately contained and staff notified in a timely manner.

Automating the response plan can also aid in the communication of threats and coordination of activities. There are a variety of tools available for tracking incidents, most include the ability to automatically notify the correct staff about status changes of an incident, and provide automated methods to escalate issues between groups and individuals. These tools ensure not only smooth communication during a normally chaotic time, but also a good audit trail after the fact to review an incident to plan for better responses the next time.

Responding to security threats within a company can often be a chaotic time. The more time that is spent up front identifying threats, and developing response processes, the more effectively a company can both understand and respond to threats. A clearly defined process for threat response can ensure that no steps are missed, lessons are documented for future use and communication between teams is effective and efficient. The constantly changing security threats in todays environments means that process is critical to ensure staff are prepared and respond accordingly to all threats, known and unknown.

Merging Business and IT

Monday, November 17, 2008

Security Planning in HPC

Monday, November 3, 2008

Security Threats and Response Plans

Joey Jablonski

Blog Archive