Merging Business and IT: 2009

Wednesday, October 28, 2009

Risk Workshops

Risk workshops are an important part of managing risk for all projects. They are typically done at the beginning of a project and any time a major change is made to the requirements or acceptance documents for the project. Risk workshops are brainstorming sessions to develop a list of risks for a project and determine the factors associated with those risks so that they can be mitigated.

Throughout this posting I use the word project a lot. In this context, project is a defined set of activities with a given beginning and end. Projects are common in all companies to take a unit of work and properly manage that unit of work to completion.

A Risk Workshop is a sit-down, face to face, thorough review of all aspects of a project. This commonly includes delivery schedules, acceptance plans, project financials and contracts associated with project delivery. The primary purpose of the risk workshop is to allow both project participants and outside observers to brainstorm all possible risks that could come up and come to agreement on risk level and mitigation strategies.

There are 2 priorities for all risk workshops. All participants should enter with these at the top of their mind as they look for mitigations strategies and project plan changes:

Protecting the company from financial loss. This can include penalties for delays or missed features, or having to redo work because of low quality.
Delivering the project on-time. All risk mitigation strategies should be worked together with developing the project plan so that estimations are realistic.

The primary deliverable for all risk workshops will be a risk workbook. The most common categories to track for each risk in this workbook are:

Risk Number – A unique identifier for the project so that team members can clearly communicate about each individual risk.
Raised By – The name of the individual who first brought up the risk. This should be tracked in the event clarification is needed about the risk or impact.
Date raised – The date that the risk was first discussed. All notes associated with the risk should have associated dates as well to track the progression of the risk discussions.
E/B/C – Engagement/Business/Customer – This category will define the risk type. Engagement is a risk associated with contractual details or the relationship between customer and vendor. Business risks are a risk that a product lifecycle may change or priories shift for a team. Customer risks are associated with delays on the customer side, either because of pre-requisites not being met or changes to the customer's requirements.
Description – This is the detailed description of the risk, what it affects and any supporting details to what could trigger it.
Risk Cost – This is the monetary cost of correcting the risk should it become a problem during the project. This includes time, facilities, and all associated resources needed to resolve the risk should it become a problem.
Risk % - This is the chance that the risk will occur during the project. This is used in pair with the risk cost to determine a risk budget for the project.
Mitigation Strategy – This category defines the solution for mitigating the risk. This should define what steps will be taken to lessen the chance of the risk turning into a problem. This could include additional staff on the project, earlier testing, or a chance in architecture.
Mitigation Cost – Mitigation cost documents the cost to minimize the chance of a risk occurring. This cost is then compared to the chance of the risk occurring and the cost of the risk occurring to determine if the mitigation cost should be spent, or continuing on the project and managing the risk if it does occur.
Risk Owner – The risk owner is the individual that best understands the risk and associated mitigation strategies. This is most commonly the person responsible for monitoring for the risk occurring and documenting the risk mitigation strategies.
Risk Trigger – Not all risks will become problems and impact a project. The risk trigger is what defines when the risk does become a problem so that staff can take steps to address the problem.

The first part of any risk workshop is to discuss the objective and purpose with the participants. All risk workshops should begin with a discussion of why the team has come together and what deliverable is expected at the end of the meeting. This deliverable will most often be a risk workbook containing all risks and their associated risk level, potential cost and mitigation strategies. All risk workshops should set time limits to ensure that if a discussion occurs on one risk, the meeting time is not overwhelmed. This time limit is too ensure everyone has a chance to speak on the topic. If consensus is not reached in that time frame, someone delegated by management should be responsible for getting input from all parties and making a decision on the risk level and other details.

I can not remember a project that I have worked on that had zero risk, or a list of zero risks. All projects have some level of risk, and the purpose of a risk workshop is to clearly define them and the plan for avoiding delays because of them. A long list of risks coming out of the risk workshop shows that the team was successful in thinking of possible pitfalls and mitigating them. The purpose of the meeting should not be to have a list of zero risks, that is not the same thing as a zero risk project.

As part of the risk mitigation portion of the risk workshop, there are two primary strategies for handling high risk components of a project:

Redesign – Often times a design can be redone to limit, or minimize the risk of a project. The redesign may have other impacts including cost of delivery or schedule impact that must be weighed against the potential risk.
Risk Mitigation – Mitigation is the most common strategy for managing risk. This is the early planning of how to handle a risk, should it become a problem. Mitigation often involves having clearly defined paths for escalation to other teams or additional resources available.

Ultimately the risk workbook will be used to develop a risk budget. This risk budget will be built into the project financials to ensure adequate resources are available to respond to risks if they do become problems, as well as providing funding to cover risk mitigation as necessary.

Risk workshops are a critical component to all successful projects. A risk workshop allows for all interested parties to express any risks they foresee and how to properly plan for and mitigate these risks. Risk workshops should not consume an unlimited amount of time, but should allow everyone to express an opinion to risk levels and allow that to be documented in the risk workbook for the project. Risk workbooks are living documents for the duration of a project and provide a single reference for developing the risk budget and showing mitigation strategies for a project.

Saturday, October 3, 2009

Time scheduling for IT Staff

Information Technology (IT) staff often must juggle both daily demands of user requests and daily repair activities, with long term projects like upgrade testing, capacity planning and new feature evaluation. These two distinct types of work are difficult to juggle, in addition to a never ending array of meetings, office interruptions and service outages. Many IT jobs today are high stress, both because of the level of work to be completed, but as well as the chronic mis-management of time, creating both higher stress levels and lower productivity levels.

As with all professions, the goal with time management, by both staff and management should be to minimize context switching. A context switch is each time a person must change from one task to another; this can include changing project focus, phone calls, office interruptions or stopping a task to goto a meeting. By limiting context switching IT management can allow more time for staff to focus, and provide them clearer blocks of time to complete their work, in a more efficient way.

It is quite common within the IT space to schedule meetings mid-day as well as pull staff into meetings during the day. This is quite disruptive and often not necessary. It is important that managers within IT organizations clearly define what constitutes an emergency and how to properly justify pulling staff away from their daily work load versus planning for a meeting in the future.

Suggestions for minimizing interruptions and increasing time utilization:

Meeting Free Days – Blocking out days specifically for meetings will allow the remaining days to be used by staff to focus, free of interruptions on long term projects, research and other work that is more efficiently completed during a focused period of time.

Set Aside Time for Ticket Based Work – It is very common for IT organizations to have a ticket tracking system to handle incoming requests and common tasks. This should be monitored by a dedicated person; if that is not possible time should be dedicated for other staff for monitoring. Tracking and managing many small requests in the middle of project based work is very disruptive and negatively affects productivity on the long term projects.

Clearly Defined Office Hours – Clearly defining staff's office hours can set a stage for limiting interruptions to minimal times within the day and giving staff dedicated time for focusing on ticket based work and project based work. This will ensure that staff are available for drop in discussions, but that these do not dominate their available time.

Staff Privacy – One method to ensure IT staff can focus and ensure time is used properly is giving IT staff a private office and workspace. All IT jobs require some level of collaboration, but they also require time to focus on projects and work as an individual. This focus requires a place free of interruptions like ringing phones, conference calls, others talking in the hall way and side discussions.

Within IT, time management is important to ensure staff can properly focus on both daily needs as well as long term projects and goals. By minimizing context switching by the use for dedicated blocks of time, staff can have better focus and concentration on their projects, and ensuring completion on time and minimal delay and interruptions.

Saturday, September 19, 2009

Importance of Code Reviews

Code reviews are an important part of the software development process. They are the period during development where a more senior team member reviews the code written by another team member, prior to submission into a companies version control system. Code reviews are a formal process to both improve the quality of submitted code, as well as to allow for mentoring of all developers on the team.

Any time a piece of code is being submitted for eventual inclusion in an application, a code review should be part of the process prior to formal inclusion. This ensures that a minimum of two people review all changes to the software to check for defects. This code review process also ensures that knowledge is duplicated within the enterprise to better manage project transition and long term support responsibilities for all applications.

There are several primary areas that should be of focus for all code reviews:

Company Coding Standards
All companies should have standards for software development. These should include the libraries used during development, the documentation of the code base and the languages used for development. This is the first item that should be reviewed during all code reviews. By reviewing all code for adherence to company standards, it ensures all team members not only follow the standards, but have a chance to learn any standards that they may not be aware of or that may have changed.

Company Enterprise Architecture Standards
In addition to company coding standards, all firms should have a formal set of Enterprise Architecture (EA) standards. These often include how data is stored, managed, tagged, backed up and secured during transport and manipulation. All code reviews should ensure that new code being submitted follows existing company EA standards for ease of interoperability, as well as long term software life cycle management.

Mentoring
Mentoring is a key component of all code reviews. Code reviews allow senior staff to review code of their teammates, and provide them suggestions for improvement based on experience. This mentoring is key to ensuring better long term quality from all produced code, as well as for providing staff a path for development. Each staff member that is having their code reviewed could potentially be reviewing code in the future, so it is key that this mentoring process be official, and an important part of the software development teams culture.

Security
In today's IT environments, security is a critical component of software development. All code reviews should include a portion of time for reviewing security to ensure that input and variables are handled securely, that temp data is cleaned up properly and that host to host communication is handled in a secure fashion, just to name a few.

Security is a complex topic, especially in the software development arena because of the wide range of attacks, challenges and threats. Code reviews allow for a formal process to ensure common mistakes are not made, previous mistakes are not made again and that staff have a forum for discussion of implementation details.

Scalability
Today, many applications are scaling to levels of usage never first envisioned when the application was written. This causes many problems for both the administrators of these applications, as well as the users. Code reviews should ensure that applications are properly handling resources like CPU time, system memory and disk bandwidth as to allow the application to properly scale over time. Scalability is a combination of many components, both the responsibility of the developer and other IT administrators; code reviews should ensure that all code written is properly prepared to scale over time and handle even the most extreme loads on the system.

Coding Quality
Ultimately, the final key of all code reviews is ensuring quality. Quality can come from many aspects of the code base including documentation, ease of understanding of the code and the maintainability of the code. These are all key aspects that if properly addressed and corrected during code reviews can ensure not only better developers, but more manageable code over time.

Code reviews are an important process component for all companies developing software, either for internal use or external sale. Code reviews ensure that staff are formally mentored on the code they contribute, allowing them to increase their skills and experience as developers and become more valuable to the organization over time. A side effect of this mentoring is higher quality code submissions, with fewer defects and better long term manageability of the code base.

Sunday, September 6, 2009

Migrating Applications between OS Platforms

At some point in time most Information Technology (IT) departments have had to migrate an application or service from one platform to another, in this case I mean a different operating system as the platform. This is most often driven by a cost savings that can be obtained on the new platform, either through lower hardware maintenance costs, or lower support costs for the software on this new platform. The challenge with these migrations is that often times, the application is stable on the existing platform, and any migration introduces the risk of introducing instability.

The points of review documented below are not specific to any operating system (OS) on the market, but rather are a guide for migrating from any single OS to a different OS. Currently the IT world is seeing the largest percentage of these types of application migrations from UNIX-based platforms to a Linux based platform. But, just because this is occurring now does not mean this will always be the most common migration path, in time a new OS could come on the market providing advantages not currently available.

Many modern programming languages are portable in the sense that they can very easily be migrated from one host OS to another. This is not true for some legacy programming languages; this framework is meat to cover both these cases. Even with modern programming languages, some underlying libraries can vary from OS to OS and will require detailed migration planning.

Below is a framework for the process for reviewing the application being migrated and developing a plan for the migration. This framework is structured to ensure that the same steps can be used, regardless of the original and future OS.

Application Source Code
When initially reviewing an application to migrate from one OS platform to another, the source code must be checked from a process, availability and legal standpoint. This is the first phase to determine if the application can even be migrated to a new platform.

Is the source code available?
This is often an overlooked component of legacy applications. Often times the source code is not available either because it was lost, or because the intellectual property for the application has been transferred to another party. This is an important part of porting an application, and can cause alternate applications to be looked at or developing the application from scratch.

Legal obligations?
As part of reviewing the availability of the source code, it is also important to review legal obligations around that source code. Specifically open source applications often have requirements for submitting changes to the community, depending on the usage model of the application. These legal obligations are also important regarding trademarks, copyrights, and their implications on staff that previously worked on the application being reviewed.

Review of Application Source Code
After determining if the application source code is available, and determining what changes can be made and how to communicate that to external parties that may be required, it is time to review the source code technically to develop a plan for the migration and porting activities later.

What language?
Looking at what language the application is developed in is a first critical step. This will enable the planning team to determine if the company has the necessary skills to port the application, or if external resources will be needed for the migration. Knowing the language can also assist with planing supportability on the new OS based on how the well the language is supported and used in the community.

What libraries?
As part of reviewing the source code, a review of the libraries used should be done. This review should be done to ensure that the libraries will properly work on the new OS, that they are still available, and that they are compatible with other libraries that will need to be installed. This is the time to ensure no dependency problems are found later in the migration.

Deprecated calls?
The source code review should also include an assessment of what calls and functions are now deprecated; this can include external libraries, kernel functions and other external resources. Any section of the application code that references deprecated functions should be reviewed to determine the best supportable path forward to ensure that functionality is not compromised.

Define Testing and Roll out Strategy
Now that the source code has been reviewed, it is time to define success for the migration. This component of the process is to ensure that relevant metrics are clearly defined for the time period of the migration, and after the migration so that staff using the application are not negatively impacted by the migration.

Data Integrity
Defining data integrity standards should be the first metric for all migrations from OS to OS. This is critical to ensure that data is consistent both during the migration, and handled in the proper way after the migration. A migration of an application from one OS to another should not ever require the compromise of data integrity standards.

Functionality
Second to data integrity is functionality. Staff become used to the tools they use on daily basis, and any change in the capability or functionality of those tools can cause a significant drop in performance. All migrations should include reviews to ensure that all utilized features will continue to be available for staff to utilize.

Performance
Performance is an important metric to define prior to migrating an application from one OS to another. Performance can change dramatically between OS platforms and should be planned for both testing and proper application tuning during the migration process. Performance can include many metrics including response time, reporting generation time and response time under heavy loads.

Security
While one OS is not necessarily more or less secure then another, each have their own methods for setting permissions, logging system activity and patching against known vulnerabilities. The migration plan should include a proper review of these differences to ensure that staff are properly trained to handle securing the application once it is running on the new OS.

Stability
Stability is commonly defined as uptime or availability of an application. Introducing a new OS to an environment can change the availability characteristics, either because of new, unfamiliar processes, or because of a misplaced expectation about an OS's capability. A plan should be developed to define what availability is required of the application, and documentation on monitoring those metrics.

Porting of Code
After defining the above metrics, we can begin the longest portion of any application migration. The actual porting and testing of the application to the new OS platform. This phase will include both making modifications to the code base to ensure it works on the new OS platform, as well as testing the application on the new OS platform to ensure it properly meets the metrics defined above for success.

Maintenance Cycle Definition
During the porting of the application data can be gathered about necessary maintenance that will need to be done regularly on the new platform. This maintenance cycle will need to include time to update patches to both the underlying OS, as well as do maintenance on the data supporting the application. This maintenance cycle should be defined prior to roll out so that staff can be properly trained on this maintenance cycle and end users can be prepared for a possible change in availability policies for the application.

Update DR Processes and Tools
Disaster Recovery (DR) is an important component of all application migrations, ensuring that a properly plan is in place to recover from catastrophic failures and ensure the data and application are available for use. As part of the application migration, the DR processes should be reviewed to adequately reflect the changes in how the application is hosted and what precautions should be taken for backup, replication and training for recovery.

Training
Training is a two part activity, both the administrators for the application as well as the end users will need to be trained on the changes in administering and using the application. Training should be provided to the appropriate staff prior to migrating the application, this will ensure that staff are ready for all change that come as part of the migration. Training should additionally be made available for staff to reference back after the migration to answer questions that could come up about the migration.

Application Roll out
After the above metrics for success are defined, the code is ported and tested and staff are trained, the application migration can be completed. This migration will include the migration of any necessary data for the application, as well as the application delivery infrastructure. This migration can be done is phases if the architecture of the application will support it, or may require an extended outage to properly migrate and test all components.

Migrating an application from one hosting OS to another is a common practice, yet, very often it is done with very little planning. As IT continues to evolve, it is inevitable that new OSs with innovative features will become available, necessitating the need to migration applications between them. Keeping a solid process that is followed each and every time will ensure stability in the migration, integrity of the data and maintaining productivity of the end users.

Monday, August 10, 2009

Importance of IT Audits

In todays business world, many companies think that audits are constrained to company financial reports only. Companies often overlook external audits as a way to increase security, productivity and reduce costs for other company operations, most notably information technology (IT). Financial audits are required by law for many types of companies, particularly publicly traded firms. This formal audit done on a regular basis ensures that end consumers of this financial information can be confident that the results are consistently reported according to industry standards.

IT is often overlooked in audits, most commonly the only IT components reviewed are the specific applications for housing financial data and the software for reporting that data. All companies can benefit from a change in this mindset and begin to utilize regular external IT audits as a way to provide a neutral, third-party opinion about the controls and safeguards in place for the IT systems that a company relies on to conduct business

Audits can provide a variety of useful information to an organization, but most importantly they remove the risk associated with unknowns within an IT environment. Audits allow a company to say in a very certain way that their controls and safeguards meet industry standards. Regular audits ensure that each year controls and safeguards are updated to accommodate changes in the industry for standards for IT operations.

Thorough audits cover a variety of components of an IT environment, both technical and procedural. The most important part of an IT audit is not the validation of those processes, but the thorough testing of the environment to determine that everything is configured per the policies, as well as everything is configured per industry standards. The second component ensures an IT environment that can be compliant with legal requirements, and safe from the most common and expected threats.

When looking for an audit firm to complete an IT audit, here are the most common items you should ask them about how they handle, review and report on:

Staff Competencies

The technical skills and experience are the most important part of an external IT audit. The external firm should be reviewed to ensure they provide the highest quality of staff, with a diverse and relevant background to your organizations needs.

Audit Firm's Reputation

Ultimately, your company is going to rely on the reputation of the audit firm if any part of the audit ever comes into question by partner companies or other organizations. It is important to choose a firm with a solid reputation of quality work, quality reports and the willingness to follow up on questions after the audit.

Security

Security has several angles that must be considered when choosing an external audit firm. The first is the security they will provide for your company confidential data, both data they collect while conducting an audit, while also providing confidentiality as part of the audit.

Second, the firm must provide a solid review of security within your organization as part of the audit. This audit should include reviewing physical security, security policies, off site storage, data in transit and penetration testing of the network from an internal and external perspective. All audits should cover these aspects of security at a minimum, and use them as a basis for reviewing the rest of the enterprise for compliance with industry standards around encryption, authentication, logging, monitoring, alerting and incident response.

Current Controls

A complete audit will include a thorough review of all controls around access of data, change management, upgrades and staff responsibilities.

Controls include all aspects of change management. Ensuring that a proper plan is in place to approve, and track changes will ensure that consequences are fully planned and recovery plans are in place prior to upgrades, changes or migrations. Outside audit firms can provide experienced third-party recommendations about the level of process and it's adequacy within your organization.

Controls also include staff responsibilities and how responsibilities are delegated and enforced through both process and technical safeguards. An experienced audit firm will review these for accuracy as well as implementation details to ensure controls work as designed and are implemented where necessary.

Suggested Controls

As part of the controls review, an experienced audit firm will document controls that are needed, but not currently in place. The recommendations come from experience in the industry, as well as solid knowledge of compliance regulations.

Staff Training

An important part of all technical audits is a review of staff skills sets. Most external audit firms will do a review of current staff and their skill sets, this information will then be used when reviewing recommendations for additional technologies or controls within the organization. It is important that all suggested changes include a required list of skills so that your organization can properly train and equip your staff to implement an outside firms suggestions.

Company Culture

Often times, a companies culture is the reason for non-compliance with accepted industry standards, particularly in IT. External IT audits provide your company an opportunity to have external, experienced professionals observe how your staff operate. The external perspective is often very useful in isolating unanticipated challenges that may come because of a specific culture within your company.

Data Protection

External audits should include a detailed review of how data within your organization is categorized, and subsequently protected from loss and disclosure. This review will be both technical and procedural to ensure that gaps are not present in the current solutions. This portion of the audit should include not only how data is managed on a daily basis within your company, but should also include how data is backed up, replicated and protected from loss in the event of a serious facility failure or loss.

Legal and Compliance

This is often the most difficult portion of an audit because of the highly specialized skills needed to complete a compliance review. Reputable audit firms will be able to provide the necessary legal knowledge as part of the audit to ensure that policies are in accordance with legal requirements. These regulatory requirements are most common in financial services and health care industries.

Cost Analysis

All findings from an IT audit will have specific costs associated with them. These costs could include both the cost to fix the problem with additional training, hardware or software; as well as the potential cost to the company if the problem is not corrected. Audit firms should be able to work with your organization to determine and document these costs for use in determining a remediation plan and prioritizing the findings from the audit.

Penetration Testing

Most IT audits will include penetration testing of your organizations network, applications, servers and data storage facilities. This is an important part of all audits because it tests the active controls in place, as well as allows for the locating of additional controls that are needed. It is important to find an audit firm with experience with these types of audits; this experience will both increase the potential for findings, as well as limit the chances for adverse consequences during the testing process.

In a perfect world, an external IT audit will cover an entire company, not just specific departments. This provides the most thorough results because an external entity is reviewing all departments and organizations in a consistent manner and providing documentation to senior management of how the various organizations interact and affect one another. Often times companies will do focused audits, only looking at a specific department or subset of the IT infrastructure. While these can yield important information, they should be used with caution because they will potentially miss other important areas for improvement.

Finally, be open minded at the end of any audit when reviewing the results from the external firm. It is possible that you will be shocked after the first audit at the shear number of findings. This is not necessarily bad. A long list of recommendations could show that the firm doing your audit was very thorough and provide you with a solid basis for improvement. The most important part to reviewing the audit results is repetition – you want to make sure that a long list of recommendations is no repeated on subsequent years. Use the list as a chance to improve so that the audit firm is not continually finding the same problems year after year.

Monday, August 3, 2009

Lustre 1.8 and Pools

Beginning with Lustre 1.8, the concept of pools was introduced. Pools are a method for isolating groups of OSTs based on common characteristics. This is most commonly used to group OSTs based on similar hardware type or RAID configuration. An example would be to have a pool of very high performance SAS disks, and a lower performance set of SATA disks, within the same filesystem. Pools will allow users to specify which pool their files are read from and written too.

Next to each section of commands is the system they must be run from.

For these commands, 'lusfs01' is the name of the lustre file system. pool1 and pool2 are the names of the example pools, and we have a total of 10 OSTs within this file system.

Creating a new pool (MGS)
# lctl pool_new lusfs01.pool1
# lctl pool_new lusfs01.pool2

Assigning OSTs to a pool (MGS)
# lctl pool_add lusfs01.pool1 lustre-OST000[0-3]_UUID
# lctl pool_add lusfs01.pool2 lustre-OST000[4-7]_UUID

Listing Available pools (MGS)
# lfs pool_list lusfs01

List OSTs in a given pool (MGS)
# lfs pool_list lusfs01.pool1
# lfs pool_list lusfs01.pool2

Setting a file/directory strip to use a specific pool (Client)
# lfs setstripe -p pool1 /lusfs01/dir1
# lfs setstripe -p pool1 /lusfs01/dir1/file1
# lfs setstripe -p pool2 /lusfs01/dir2
# lfs setstripe -p pool2 /lusfs01/dir2/file1

Tuesday, July 7, 2009

Interviewing in IT – Finding Solid Candidates

One challenge within all IT organizations is finding and interviewing candidates to ensure that those hired into an organization will not only bring necessary skills, but also allow the organization to grow and evolve. Interviewing methods vary from company to company based on a variety of factors including culture, past experiences and human resource department influences. I hope to explore some factors I believe lead to successfully interviewing of candidates.

In my mind, there are four primary purposes of interviews:

Determine if the candidate has the proper technical skills to be successful. This includes a combination of technical knowledge and past experiences.
Determine if the candidates' personality is compatible with both existing team members as well as the company culture.
Determine if the candidate has the necessary willingness to learn new skills and learn from their team to evolve and grow in the position.
(Sometimes) Determine if the candidate has specific domain knowledge or industry connections that can only be obtained by hiring from outside the organization.

Here are the key items I keep in mind when interviewing, as well as encourage others to think about when interviewing candidates for my team and other teams I work with.

How Long?
Don't focus on how long the individual will be at the company; assigning a time frame will defer focus from the more critical aspects of the candidates' place within the organization. Focus instead on what the individuals career goals are. Can those be met at this company? What types of technology do they want to work on and learn? If these are technologies your company uses and will be using, the chances of a long term candidate increase greatly.

Technology is not the only factor, focus on other aspects of the individuals career goals. Are they expecting promotions into management or technical lead positions? These types of advancement opportunities will determine how long the candidate will be at the company.

2-way Street
I have been to many interviews that the individual conducting the interview forgot that interviews are 2-way streets. While it is important for a company to determine if an individual will perform well within the company structure and culture, it is equally important that the individual get a realistic feel for the company and see the company in a good light. You don't want to risk loosing a strong candidate because the focus was too much on asking the individual questions, and not enough on allowing them to see the possibilities your company has to offer.

Brain teasers are fine, but.....
Many organizations today use brain-teaser type questions to understand how a candidate solves problems and their methods for thinking and responding to stressful situations. These tests have a valid place within interviews, but need to be used properly to ensure you are getting valid results from the tests. It is important to only ask brain teasers that the individual will have the domain knowledge to approach answering. I have been in many Network Architect interviews and asked questions about algorithms around managing large datasets. This was not a good use of time for me or the interviewer. It is important to ensure that even if the candidate can not answer the brain-teasers, they have enough base knowledge to share their through process for solving the puzzle.

Personality Tests
Personality Tests tend to fade in and out of popularity as an interviewing tool. Many companies argue that they provide a glimpse into the candidates tendencies and habits and can allow the human resources department to determine their compatibility with existing personality types at the company. The problem is that many of these tests are quite easy to “study” for and this can heavily skew the results. My feeling is that taking the candidate to lunch or coffee with a small group of team members will show much more then a standard test ever will. The key with an interview is to make sure both parties are comfortable enough that they communicate as they would in any standard work situation.

By conducting the interview in places other then your standard conference room, you have a better chance of understanding what the individuals work habits are and how they converse with others. It is easy to watch how a candidate talks with folks at the office; it is an entirely different view to see how they interact with the wait staff at a restaurant and a very useful data point on their personality.

Work Location
“Remote” working is all the rage these days, especially in IT. You know my thoughts on it from previous postings, I think that it has it's place, but not all positions can function effectively while being isolated at home. Deciding if a position is going to be “remote” should be done before interviewing potential candidates, this decision should be based on the type of work this position will be doing, as well as that of the team and how well that work can be done in a distributed fashion.

Not only does work location include being “remote” or not, it also includes possible relocation of the candidate. It is important to set expectations up front on the companies' policies for relocation. If there is no relocation budget and the candidate is three-states away, it is probably not worth pursuing.

Finally, it is important to understand an individuals' constraints around relocation, both time frames and potential locations. If a candidate does not like cold, yet your company is based in northern Alaska, there is probably little need to continue the interview process.

Technical Questions
I do not suggest jumping immediately into complex technical questions, no matter what job you are interviewing a candidate for. I suggest rather working up to the point they appear to be at from their resume. This means checking for the base knowledge and experience, this is a good method to see how the candidate responds. Do they seem passionate about the work? Do they speak more about knowledge or experiences? Do they speak about where they go to stay current on the industry?

One of my favorite interview tactics is to ask a technical question about a problem we recently encountered. This is a great opportunity to see not only the candidates thought process, but also what level of detail they go into for solving the problem and what tools they would use to approach the problem. Continuing to ask questions about the problem will also show a lot about how they respond under pressure and how they communicate to other team members in the face of a stressful situation.

Technical questions also provide a good basis for seeing what knowledge the candidate has about the industry as a whole. Do they understand the benefits of one vendor or another? Or do they focus on using what they know and are comfortable with?

Domain/Vertical Knowledge
Most information technology professionals are specialized, that is, they work in a specific industry; these can include Financial Services, Oil & Gas, High Performance Computing, etc. Each of these domains has specific tools and applications, as well as industry accepted methods of accomplishing tasks.

When interviewing it is important to decide up front if you are looking for a candidate that is experienced in your specific domain, or if you are looking for a strong IT candidate that could bring an outside perspective to your organization. This will drive what questions you ask, as well as where you go to look for potential candidates to fill the position.

Interviewing is an art, it takes time to develop a process that works for you and your organization. Starting with hard technical questions will not help you determine if a candidate is qualified, there is much more to how a candidate will succeed or fail in your organization. You must treat all interviews as 2-way streets; this allows both parties to get an accurate picture all all aspects of the position including location, expectations, team dynamics and technical knowledge as well as desire to learn and evolve.

In my mind, technical skills are secondary for today's information technology positions. More important then all the technical knowledge in the world is how well the candidate communicates with team members, responds under pressure, understands trade offs and benefits analysis and ultimately the candidates desire to learn and grow and IT evolves. No matter how technical someone is, if they can not get along with them team they will be a detriment to the organization.

Monday, June 1, 2009

Talent Management in Information Technology

The Information Technology (IT) sector is evolving very rapidly, over time it has developed a reputation as a high stress career field, with low personal rewards, little chance of advancement and ultimately an area of low morale. While this is not the case within all IT departments, it has become a common perception of the industry as a whole. This perception has been driven by a variety of factors including fewer staff available to complete tasks and the use of outsourcing for lower cost resources. We as an industry need to take more ownership of our staff and skills and work to develop them internally in a way that provides companies maximum value from their IT organization, while ensuring staff are successful, stable and ultimately happy with their work environments.

First, we should define the “standard” IT employee, and I am not talking about the system administrator that works all hours of the night while drinking Red Bull. I am talking about those traits that make IT staff want to work in the technology field. While this description will not cover all IT employees, there are certain traits that stand out more in the IT space then other professions:

Curious by nature
Detail focused
Quest for understanding
Prefer the technology to the business
Like to build things
Opinionated
Seeking Recognition

We now have a better understanding of what IT staff do and expect by nature. Lets explore the primary drivers that go into their ability to focus on their job and deliver successfully to the companies bottom line. This is by no means an exhaustive list, but in my experience these are the top items within an IT department that contribute to employees truly enjoying their job; employees just staying for the paycheck versus employees that are looking to move on to other opportunities.

Pay – There is a common misconception in IT that folks will work at a given company because it is interesting and exciting work, but this is only true to a point. We all have bills to pay for rent, utilities, food, entertainment, and student loans. An employee paid market average for a given area will often stay at a company if the work is exciting. But no level of excitement and interest is enough to make up for below average pay when an individual is struggling to pay their bills. Companies should strive to ensure their salaries are consistent with the local market for the level of skills an employee is utilizing at work each day.

There are also other costs that go directly with pay. It is very expensive to loose an employee, they take lots of company experience and knowledge with them. Transferring that knowledge to a new employee is costly in the time they are learning the business, they are contributing less to the bottom line. Companies should always evaluate the cost of raises for staff to ensure they stay at the area average to the cost of loosing one employee and having to hire another.

Opportunities – Staff within IT organizations are curious by nature, with that trait they constantly want to expand their knowledge and experience. Some of the most successful staff I have worked with in IT move on a regular basis, not always up in the corporate ladder, but most often laterally to other jobs that are of interest to them. This provides them a benefit of increasing their skill sets, and provides a benefit to the company because corporate knowledge is not lost when an individual moves within the company. IT staff should be provided opportunities to move both within the IT department and within the company. This movement and change of jobs is often what IT staff need to ensure they do not get burned out while allowing them to stay engaged with their jobs.

Interesting Work – Very few people are content doing the same activity every day. Because of the general trait for IT staff being on a constant quest for understanding, most IT staff are always looking for new and exciting projects. It allows them to be creative and develop new solutions to the problem. Sadly, there is always going to be some tasks that are more interesting then others within IT departments. Effort should be used to ensure that any less then desirable tasks be evenly spread across available team members, and that team members understand that while they may have been assigned a less then desirable project, so were their teammates.

Staff should be encouraged to not only take on projects assigned to them, but to come forward with ideas they have for improvement within the organization. This encourages all team members to have a stake in the organization and feel ownership of not only their projects, but other tasks that they may see a need for completing.

Flexibility – Employees appreciate when their management allows them the flexibility to work when they are most productive. I am not necessarily talking about allowing staff to work in their pajamas from home, but more referring to ensuring employees do not feel tied down with a specific schedule that causes them less productivity.

Working from home has become very common in many organizations, especially IT. I believe that a lot of organizations have taken it a step too far and staff are beginning to feel the isolation of working by themselves each and every day. I believe that staff should be provided the tools and flexibility that if they choose to work remote for an afternoon, that should be allowed. I believe that the majority of a 40-hour work week should be spent in the office, it encourages staff to communicate with their coworkers, take a vested ownership in the daily operations of the business and ensures the company develops a culture of it's own.

No two people are alike when it comes to sleeping schedules. This has a very negative effect when employees are asked to begin their day at a time that is not natural for them. I believe that staff should be given the flexibility to arrive and begin their day when they will be most productive. This does not mean everyone should sleep in until noon and begin work at 1PM. It does mean that should a staff member prefer to work later in the evening because that is when they are most productive, the company should encourage this behavior.

Ultimately an organization is only as strong as the communication between it's team members. Above I mentioned that working from home all the time is a suboptimal choice; the primary reason for that belief is that communication can be challenging when everyone is so spread out. The ability to quickly gather team members in the office and discuss a topic can ensure minimal time is wasted when a decision must be made. I believe that office hours, also called core hours, is an optimal method to ensure staff have the flexibility to work remotely, while encouraging team communication. By having all staff in the office for certain set periods, often 10-12AM and 2-4PM, it ensures that if a staff member is needed for a discussion, they can be found. This use of core hours, I believe, provides a good balance of allowing flexibility for staff and ensuring a solid team dynamic takes hold for the team.

Work Space – One common perception within IT over the years has been that putting staff in shared space will allow better collaboration. The challenge is that very little IT work is collaborative in nature, the bulk of the work that must be accomplished is individuals working on their pieces of a project, and this type of activity requires that staff be able to focus. Shared spaces have a lot of benefits in terms of quick access to others, but at the cost of decreased focus due to noise and other distractions. I believe that companies need two primary types of space available for their IT staff, private offices to allow for focus and concentration, and shared collaboration areas to allow for quick meetings and discussions.

Private office space allows staff to have an area that is their own to focus on their work and not be distracted by outside noise, phone calls, hall way discussions or projects. Each staff member should have an office that allows them to close the door and focus free of distractions.

Common areas should be available to encourage team discussions and impromptu meetings. Very rarely can a decision be made faster then by pulling the team together for a quick discussion in the hallway. These common areas within an office space should have enough white board space that notes can be kept on any design ideas or other notes from the discussion. These common areas will also encourage inclusion of all project members, and not just a subset that may discuss the matter in a private office or on a conference call.

Loyalty – Companies often expect a certain level of loyalty from all staff, but do not necessarily show that level of loyalty back to their staff. Having the CEO walk around at the company holiday party and shake hands only goes so far to telling employees that they are valued. I encourage all managers within IT to regularly call out their accomplishments of their staff to the rest of the organization. IT staff strive for recognition, it is what encourages them to do their best every day. When managers publicly acknowledge a job well done it tells the employee and all their coworkers that the effort and work are appreciated.

Now that we have those out of the way, lets explore the deep dark truth of IT. Even if a company does each of those perfectly, some staff are going to leave. This is just the nature of the business. No matter how hard a company and it's managers try, there will always be staff that are looking for something that the company can not provide. When this case occurs, and it will, it is important that the employee and company both act as professional as possible. There is an old saying in HR, “don't burn your bridges.” That applies to both the company and the individual. IT is such a rapidly evolving industry that even if the match between a company and employee is not correct now, it is very possible that a match will be made down the road after either the company or the individual evolve. “Don't burn your bridges” applies to both the employee and the employer, even if a staff member leaves, they still have institutional knowledge that could be of value down the road.

Someone I have worked with many times over the years has a very clear way to sum up the relationship between employee and employer, “Pay me well, Treat me well, Wish me well.” Meaning that if you pay your employees fairly and treat them wonderfully they will do quality work, if you pay them exceptionally well and have higher stress levels they will do quality work, but should an employee not be treated fairly and is not being paid well enough to compensate, the company should “Wish them well” in new opportunities.

Thursday, May 21, 2009

Understanding Lustre Internals

Lustre can be a complex package to manage and understand. The folks at ORNL, with assistance from the Lustre Center of Excellence have put out a wonderful paper on Understanding Lustre Internals.

I recommend that all Lustre administrators read it, it is very useful information for understanding how all the Lustre pieces plug together.

Tuesday, May 5, 2009

"Cloud" and HPC?, Huh?

I have tried for the most part to not post on this phenomenon known as "cloud computing." "Cloud" is still evolving and as such has many different meanings. The reason this whitepaper caught my attention is it's attempt at connecting high performance computing (HPC) with "cloud computing." The way I see it, "cloud" is still more of an evolving idea then a true product. True, many companies are offering "cloud" products, but the standards are still evolving, as is the true meaning of "cloud computing."

In my mind "cloud" is the next logical evolution of computing - better resource management through enabling applications to better communicate with their supporting infrastructures (servers, storage, network, cpu and memory resources) to allow applications to have the intelligence to scale up and down based on demand. "Cloud Computing" also has a valid connection to outsourcing in the sense that shared infrastructures will at some point over take privately managed information technology (IT) infrastrucures that are common today.

There are several points about the above listed whitepaper from UnivaUD that caught my attention:

MPI was only mentioned once. The Message Passing Interface (MPI) is the standard on which most HPC applications and platforms are built. For a paper to truly look at the potential of outsourcing HPC to a "cloud" environment, an indepth review of MPI will need to be done to ensure the proper updates are made to handle the additional physical layer errors that could occur in a shared environment, as well as the added challenges of communication in an unknown environment.
There was very little mention of the actual applications that are common in HPC. Applications like Fluent, NAMD, NWChem, Gaussian, and FFTW are commonly used on clusters built in house to meet the specific needs of a given community. Moving those applications from these small, in-house envirronments will take time and review to ensure they are able to scale in shared environments, as well as properly handle the increased variation possible in hardware and configurations.
There was no mention of parallel file systems. This is a fundamental requirement of modern HPC environments. To truly move common HPC environments into the "cloud" a solution will be needed for data management and transfer at the high speeds required of todays applications.

In short, the above linked whitepaper is common of what I am seeing in the "cloud" space; lots of talk of the possible benefits around the use of shared environemnts. What we need to stop doing as a community is trying to associate all things IT with "cloud." I have no doubt that in time we will evolve to more use of shared resources - this has been occuring for quite a while with the migration to larger clusters within universities and national laboratories, as well as the ongoing outsourcing of email and specific applications - but as a community we need to ensure that each time we change how we do things for a given area of IT it is with specific goals in mind. Without those clearly defined goals we will not know if we were successful.

As time allows I hope to explore the above issues, particularly looking at alternatives for parallel file systems in environments that may have varying latency, and are distributed over various data centers.

Monday, May 4, 2009

Balancing Security and Productivity – Part 4 of 4

Proxy Internet Connections

Companies often look to proxy servers as a method to monitor and block harmful traffic from their networks. Proxy servers provide a gateway between company networks and outside networks to ensure that all connections are logged, filtered and denied per company policies. Proxy servers can present a challenge because they can often slow access for staff, and inadvertently limit access to sites that are authorized, but may initially appear unauthorized to the automated tools limiting access.

Open Internet Access – Open internet access is allowing staff unrestricted connections from a corporate network to the outside world; these connections are free from any proxy servers, bandwidth restrictions or other traffic filters. While this can allow for maximum ability for the staff to conduct their jobs, the question must be asked, is this too much access? When a network allows that level of connectivity going out, there is inevitable risk that confidential information could be transmitted out of the company with little or no record of the event.
Limited Internet Access – Outside access can be limited by a variety of methods including blocking specific ports, utilizing proxy servers or utilizing other network traffic monitoring solutions. When used correctly, these tools can not only prevent company confidential information from being inappropriately transmitted outside the company, but they can also provide a solid audit trail in the event an investigation is needed. The trade off is that staff's performance will be affected by possible slowdowns due to the overhead of the tools as well as the potential that the traffic being blocked or targeted does have a requirement for conducting business and an employees productivity will be affected adversely.

In part 1 of this discussion we asked the question; how balance allowing employees to access company data with a personal device that connections to proprietary company information? The answer will ultimately be different for every company. But there are some common criteria that will be consistent across all solutions:

Consistency of security policies - It is critical that just when a staff member is using a personal laptop, the security policies are not being compromised for this benefit. This means that personal systems must adhere to the same policies for storage of company data, use of virus scanning applications and use and storage of company passwords.
Centralization of storage – By utilizing central, company controlled storage, it allows the information technology (IT) department to ensure all company data is regularly backed up, archived and available in the event of laptop or mobile device loss. There are many tools on the market that can automatically replicate data from remote devices to a company managed data center. This ensures data is always available, regardless of the type of device connecting or ownership of the device.

Finding the proper balance of security and productivity is a complicated, dynamic process for both the end users and those forming company policies. Any company today must ensure that they have the proper IT resources at their disposal to do their job and that those tools are open enough for staff to utilize in the most efficient way, but closed enough that propriety or otherwise confidential data is not put at unnecessary risk. All risks have a potential downside and all functionality has a potential benefit, both of which can be expressed in dollars. It is important to ensure that the balance of that risk and benefit is on the side of benefits, and that the risk is not so great as to cause harm to your company.

Friday, May 1, 2009

Balancing Security and Productivity – Part 3 of 4

Database Encryption

Often companies will encrypt data stored within a database. This ensures that data is secure from simple eavesdropping by requiring a key to manipulate or view the data.

Encrypted Databases – Encrypted databases are becoming more common, either encrypted in their entirety, or portions of the database that are particularly sensitive. While encrypted databases to provide a lot of protection to unauthorized users, they do potentially provide slower access because of the additional CPU time needed to decrypt the data for use. Encrypted databases also pose a hazard for data loss in the event the keys necessary for data encryption and decryption are lost or otherwise must be regenerated.
Non-Encrypted Database – Standard databases are most common today, essentially databases that store the data in traditional ways without encryption. The risk they pose is that if the clients of the database are compromised, or backups of the database are compromised it is quite trivial to read the data contained in that database, which could contain personal information like user names, passwords and addresses. While traditional, non-encrypted databases can scale much larger because of the lower CPU usage, they do have significant risk to data compromise.

Device Ownership

Device ownership is often a big topic of discussion, especially within companies hiring younger workers right out of college. Individuals will often get very comfortable with a platform while in school and expect to be using that same platform when they enter the workforce. When they later find out that their employer has a different OS or brand of laptop, employees will often use their personal devices for company business.

Company Devices – From a security standpoint, company owned devices are the most secure option, but at a cost. Employees will be less productive if they are forced to use a platform they are uncomfortable with or new too using. Company owned devices ensure that the company can recover the device should an employee leave and ensures that all software being used is licensed, virus free and properly monitored by corporate IT staff.
Personal Devices – While personal devices can allow workers to be more productive and comfortable with their operating environment, it comes at the cost of very decentralized IT management. Personal devices may not necessarily be covered by corporate software licensing agreements, and may not be kept up to date for security patches per company policy.
Combination – Most firms have settled on a combination of allowing personal hardware, but putting policies and tools in place to ensure it is managed by a centralized IT organization. This ensures that staff can have the tools they a are most familiar with, but data integrity, security and virus scanning is updated as company policies evolve.

File Transfer Policies

All companies have the need to transfer files, both internally and externally for review, collaboration and company communication. These documents present a risk to the company because confidential information could inadvertently be sent to unauthorized parties.

File Attachments to Email – Attaching files to email has several risks including a large need for capacity in the mail servers to handle the volume of traffic, as well as the potential that files could be inadvertently sent outside the company. While some modern email systems have the ability to scan out going email for specific content, this is often time consuming and can slow down the flow of communication.
Collaboration Tools – Limiting employee's ability to send files via email attachments is becoming much more common; as a solution to the need to share files, many companies are beginning to use collaboration tools like Trac, Twiki or Sharepoint. These solutions allow files to be stored internally, access to be restricted back and to ensure proper versions of files are available for those that need them, with out the risk of outsiders having email and attachments inadvertently forward to them.

Wednesday, April 22, 2009

Balancing Security and Productivity – Part 2 of 4

Chat Applications and Boundaries

Many companies are looking to real-time communication tools like instant messenger and other chat applications to enable staff to communicate real time, either internally or with external customers or partners. These tools can enable staff to be very efficient at communication and issue escalations, but the risks of information being shared incorrectly, or not properly archived present a risk that should be evaluated.

Internal-only – Internal only chat solutions provide staff the ability to quickly communicate internally, while limiting the change of accidental exposure of customer data outside the company. What internal-only chat solutions lack is the ability to communicate real time with customers or partners. By eliminating this capability, staff could have to use other, more time consuming solutions for external communication.
Internal and external – By providing staff with the ability to chat real time both internally, and externally they are enabled to communicate real time with customers, partners and other outside groups that contribute to the bottom line. The potential risk is a staff member could send an incorrect file, or cut/paste incorrect text into a chat window and reveal company proprietary data to an external entity.
No-chat – On one end of the extreme is to block all real-time chat communication, limiting staff to communication using standard email or phone conversations. While this can ensure no company sponsored tools are used for external communication, todays tech-savvy employees will often attempt to circumvent this limitation and use their own tools, potentially creating larger security implications because of non-centralized management. While eliminating chat applications can contribute to a more secure environment, the potential effect on employee productivity can be negative.
Compliance – Compliance is the other large factor for chat and other instant messenger type applications. Compliance can include a variety of items include detailed record keeping, legal documentation of discussions and industry-standard policies for data storage and handling. Most chat applications offer the option of storing an archive of all discussions, this feature should be evaluated against compliance requirements to ensure that necessary records are kept and unnecessary information is purged.

File Storage Locations

Storing of company files, including email archives, customer communications and other company documents must be done in a way that files can be recovered if lost, but also to ensure that access to those files is only grated to those requiring access to complete their assigned job. Few companies have a consistent method for file storage and sharing; most companies have differing policies for each department. It is important that a company have a defined policy that becomes part of the corporate culture to ensure collaboration and exchange of ideas, as well as compliance for document storage.

Local – Local file storage is individual employees storing company documents on the computers and other devices they use for conducting company business. Local file storage presents a challenge in all facets of security because of a lack of an audit trail for file access, a lack of recovery capabilities if an employee accidentally deletes a file, a lack of a recovery mechanism for lost laptops and ultimately a lack of recoverability if an employee were to leave and take their laptop with them. While local only storage provides an individual employee with the easiest access to the files they work with regularly, the company as a whole has very limited visibility into that employees archive of company data.
Network Shares – Network shares provide a loosely controlled environment for storing files that individual staff members have worked on or created. Network shares provide minimal levels of recoverability because they can be backed up more easily then individual laptops and desktops, and they can also do minimal revision control. They do lack real audit capabilities for file access and updates and do not provide staff a formal method for communicating who is working on any given document at any given time. Because of the lack of real auditing paired with the lack of real capability around access controls, network shares are not a good long term strategy for a company that could have many documents to manage.
Shared Collaboration Sites – Shared collaboration sites are the most common method in companies today to share files and documents internally. They provide a very robust method for storing documents, managing multiple revisions and managing access controls for documents based on a variety of factors including need-to-know, manager approval, project participation and department ownership.

Operating System Usage

Many companies will evaluate a given operating system (OS) as part of a security review, when the actual OS in use is a very minor component of the equation. At some point in time a security vulnerability has been found in all major operating systems. The risk posed by these various vulnerabilities has much more to do with how the vulnerability is responded too then the actual OS with the vulnerability.

Staff Skill Level – Probably the most important topic when addressing what operating systems (OS) to use in any environment is skill set of the system administration team, yet it is often not looked at in depth. Staff are most efficient at administering operating systems that they are familiar with and have experience with. If new operating systems are introduced, the initial ramp up time to be proficient for staff can be on the order of months. During this time there is risk that best practices will not be followed and work could potentially have to be redone. When evaluating operating systems for a given environment, the time consideration for training staff with the necessary skills must be considered.
Patch Process – The process to install performance, security and feature upgrade packages differs very widely from OS to OS. This has significant implications to the security of a system, the longer it takes the administration team to install patches, the longer a vulnerability could be exploited. When reviewing new operating systems, the tools they offer for installing and managing patches should be reviewed to ensure that patches can be installed and tested in a timely manner.
Vendor Relationship and commitment – A vendor's commitment to a particular OS and application stack is critical to ensuring a secure environment. When reviewing operating systems for use in your environment, it is important to understand the vendors commitment to the platform; this has implications for the speed of patches being released, as well as the capabilities a vendor has for developing patches in a timely manner.

Tuesday, April 21, 2009

Lustre Users Group 2009

Last week we held the 2009 Lustre Users Group. It was a success; we had the largest user turn out ever.

All slides can be found here.

I did a presentation on Best Practices for the Sun Lustre Storage System, those slides can be found here.

Friday, April 17, 2009

Balancing Security and Productivity – Part 1 of 4

This is the first part of an ongoing discussion. The additional parts will be posted in the coming weeks.

An often challenging debate in any IT organization is the proper balance of security and productivity. Most organizations struggle to balance a loss in productivity for staff due to tighter security restrictions around passwords, data access, allowed applications, automated monitoring and threat detection. People at various levels within an organization will have differing solutions for balancing risk and ease of completing work for various staff. Every risk that must be understood for security changes has an associated cost, either in the cost of lost data, lost capability or bad publicity. On the flip side, every change made in the name of security and lowering risk could potentially lower employee productivity which can both affect output and have a cost, as well as affect morale if tasks become more difficult to complete.

In addition to evaluating risk for security policies and it's impact on staff and their productivity is assessing that impact across different staff with different duties at the company. Often times staff with more tightly controlled tasks are easier to limit impact for then staff that have a larger range of duties that may require off hours work, remote work or constantly changing duties and tasks.

With any activity within an enterprise, be it adding an application, adding a new mobile device or adding a new network connection poses a level of risk. That risk must be weighed against the benefits gained by adding that network connection. Take one of the most common tasks for an IT department; adding a new active network connection to someones office within a company facility. This activity has little risk associated with it because most often only staff will be in the area and able to physically use the connection. The benefit of this can be great by allowing an additional productive staff member, an additional printer for staff use or allowing faster network access then existing connections would allow. In this case this risk to reward balance is reasonable. Now take an activity that is just as common; installing VPN software on a laptop so that a staff member can connect to the company network remotely. What if this laptop is then lost and has company data on it? What if this laptop is infected with a virus that could infect other corporate machines? I intend to explore various trade offs that must often be reviewed by IT departments and the associated risks and rewards that go with each.

Passwords versus Tokens

One of the most common methods for increasing security within a computing environment is by eliminating one-time passwords and replacing them with a token based approach for non-reusable passwords. In this forum I call any authentication solution that provides a challenge response or requires an external token to be the alternative to standard passwords. There are several trade offs that must be considered for this approach to provide a high-level of assurance that accounts are only used by the designated owners:

Login Speed – Using tokens or other 2-factor methods for logins has the potential to slow down staffs' ability to login. If a staff member can not find their token for login that will slow down their ability to complete tasks. Additionally, the time needed to use a token is often longer then the time required to enter a traditional password from memory and be authenticated.

Seamless Integration – Integration company wide can pose a challenge for tokens and 2-factor authentication solutions. While much improvement has been made on this level with modern identity management tools, most firms still have a diverse range of applications and integration with all of them is often not possible. This leaves companies in a situation where they must decide which applications and tools make sense for token based authentication and which should remain password based.
Ease of Memory – Tokens often use a pin number that is shorter then common passwords. This shorted pin paired with a specific token that is time specific creates a combination of information that is easier to remember, and thus less likely to be written down by staff. This ease of memory of necessary login information can ensure a situation where staff passwords are

VPN versus Public Secure Web Sites

There are two primary methods for ensuring that company data is secure when being accessed by employees and authorized personnel; the primary method is to use web based applications that run over encrypted channels, the https protocol is the most common. Often times companies will implement a virtual private network (VPN) solution to further ensure that all data transmitted is secure.

The primary issue being discussed here is providing access to company applications to staff that are located in remote locations, this could be working from home, while on travel or via remote devices.

VPN Assurances – VPNs, when properly used can ensure compliance with a variety of company security policies around virus protection, password length and expiration and a systems patch status. These policies can ensure all hosts connected to the companies network are secure. The trade off is that VPNs are often difficult for users to utilize because of the time necessary to connect and the technical challenge in ensuring users can always connect to the VPN when necessary.
VPN Restrictions – While VPNs ensure that systems connected to the network meet compliance, they restrict an employees ability to login quickly and complete a task. If an employee needs access but does not have a company computer, a VPN only approach may limit their ability to use nearby computers to complete the task.
Availability of Web Based Applications – Web based applications that are encrypted and outside of company VPN infrastructure allow staff to connect in a secure fashion, regardless of who's computer they are using. While this does enable productive work to be done in more locations, it increases the potential that data or passwords could be compromised by keystroke loggers on non-company controlled machines.

Wednesday, April 1, 2009

Security considerations in a virtualized environment

Virtualization is becoming the standard method for consolidating large information technology (IT) environments down to less hardware then was once required. Because of the rapid increase in both processor performance, and memory density, paired with increased disk capacities, a single server can handle the load that it used to take many to accomplish.

This consolidation effort has presented multiple challenges, including:

Increased complexity of IT environments
Increased requirements for System Administrator's skills sets
Unknown quantities around security within virtualized environments
Increased need for processes to ensure compliance with applicable industry regulations
Increased need for executives to understand resource utilization and allocation across the environment(s)
Increased need for disaster recovery planning so that single hardware outages do not cripple an environment

I am going to talk primarily about the security aspect, and some mitigation techniques used with virtualization. Security is a difficult subject within virtualization because the topic is in it's infancy and because of that we are still learning the proper processes that are needed to secure virtual environments at the same level our traditional physical infrastructures are secured at. The introduction of hypervisors within an IT environment add a level of complexity to the environment, and create an entirely new tier where data access, user authorization and monitoring must be implemented to ensure security.

Lets also talk about the boundaries for our discussion and the definition of security I will use for the remainder of this posting. Security can mean many things to many different people. The boundaries for what falls within the realm of a security team within a company will also vary greatly from firm to firm. Security as I describe it is the actions and processes that ensure an individual can only access and modify data that management has approved them access too. This includes ensuring permissions and other configuration settings are only changed by those authorized, and private information is only accessed by those that management feel have a valid reason to access it.

Definitions

Physical Host – A physical server running a hypervisor and having one or more virtual machines active on it
Virtual Machine – A single running instance of an operating system (OS) sharing physical resources with other running OS instance
Hypervisor – The software layer that resides on a physical host and allows multiple concurrent virtual machines to effectively share the same physical resources
System Administrator – An individual with root or administrative level rights on one or more physical or virtual hosts
SAN Administrator – An individual with the ability to manipulate shared storage devices or switch configuration between shared storage and servers using that storage
VLANs – Virtual Local Area Networks, a method to logically partition a single physical network into multiple logical networks
LUNSs – Logical Units, a unit of storage exported from a shared storage device to one or more hosts

Now, lets discuss some scenarios that are specific to virtualization, and some techniques to mitigate these threats.

Administrators with full access to hypervisors
Probably the best known and most thought about security vulnerability within virtualized environments is the hypervisor and it's inherent access to the virtual machines above it. Most current virtualization solutions have a single root user at the hypervisor level with access to power virtual machines up and down, modify virtual machines (VM) boot parameters and gain console access to those VMs.

This type of model requires both a high level of trust for system administrators, as well good processes in place to ensure all changes are approved, properly tested and periodically reviewed by staff other then those responsible for making them. All administrators within a virtual environment should only have access privileges on systems required to complete their job, and systems that contain data they are authorized to see and handle. Management should implement audit policies to periodically review logs and ensure that all changes were approved, properly tested and meet all IT policies.

Console access to VMs
Most hypervisors by default will allow anyone with administrative rights on the host system to access the console for all VMs hosted on that system. This creates a situation where an unauthorized party could access the console of a system and perform password recovery activities, or see system output to the console.

Ensuring that administrators have the least amount of access to successfully complete their job is key to ensuring that console access is limited to those that need it. Often times, administrators will rarely need to access the console of a system because of technologies like remote desktop and remote shells for managing a virtual system. Modern hypervisors will allow permissions to be set so that console access is only given to those that are authorized. It is suggested this be enabled so that an administrator can only access the console for systems they are immediately responsible for.

Patches at the hypervisor level
The hypervisor within a virtual environment creates a single tier with essentially administrator level access to many more systems then the administrator would have before virtualization. This hypervisor layer has access too all VM data, the ability to power VMs up and down and the ability to see the console for all VMs on a single physical server. This hypervisor layer adds a single tier of access, that if compromised could create a path to easy compromise of many additional systems.

Ensuring security now requires additional levels of testing during the phase that was traditionally penetration testing. New applications must also include load testing from a security standpoint to ensure that new applications, if compromised would not affect the performance or response time of remaining applications. This all means that a security patch at the hypervisor level has much more sever implications then patches on individual VMs because of the increased threat.

Ultimately, the most important aspect with hypervisor security is ensuring that only those that require access to it, can connect to management tools. This means using host based and network based firewalls to explicitly allow traffic that is allowed and deny all other connections to the hypervisor for VM management. In addition to restricting access, companies should have an efficient process to test patches when they are released from the vendor to ensure they are implemented, particularly at the hypervisor as quick as possible to limit any windows of opportunity.

Complexity
Any addition of new technology, tools or features has the potential to add more complexity to an already complex IT environment. Complexity creates a variety of long term problems including making upgrades harder to manage, creating the potential for mistakes and configuration errors, creating the potential for one change adversely affecting other aspects of the environment, and most notable putting a higher workload on IT staff.

As virtual environments grow, testing and validating all processes becomes only more critical. The best defense to complexity is careful documentation that has been tested, and is continually updated to reflect changes in the environment or methods of management around that environment or the company as a whole. The more carefully things are documented, the more efficiently actions can then be automated, ensuring that the potential for human error is further removed. By automating processes around auditing, patch testing, software deployment and VM creation, IT staff can be left to focus on operational efficiencies, while ensuring that all systems will operate within the boundaries of company policy with minimal intervention.

LUNs Zoned to Hypervisor
It is common to utilize a SAN in todays virtualized environment to simplify management of data growth, movement of virtual machines and increase performance of backups. This use of a SAN creates a level within the hypervisor, that anyone with administrative access to the hypervisor can manipulate the LUNs destined for virtual machines. This creates the potential for not only having people access data they do not have the need to access, but the potential that data is manipulated without proper authorization.

Properly encrypting data at the file system level will ensure that data is only accessed by authorized applications and users. Encrypting data ensures that only the authorized application and administrators can manipulate production data, this level of assurance also ensures that if any physical disks were to become unaccounted for, management can be assured the data will not be read by unauthorized parties.

Ability to power VMs up and down
Virtual machines share an underlying management infrastructure and physical machine infrastructure. This creates the potential that a rouge system administrator or staff member can cause harm to one segment of the infrastructure, simply because they have access to another. Having a shared hypervisor creates the potential that if the administrator account is abused, systems can be stopped, started and rebooted at unexpected time.

Critical services should not be hosted in virtual environments. This will ensure an added layer of protection for things like LDAP, Kerberos, Active Directory, DNS and critical web servers. By hosting these critical services on dedicated virtual machines, you ensure that security problems within the hypervisor environment, or rogue staff do not cause harm to the services that are most critical to the stability of your enterprise.

Staff accounts with permissions to power up and down VMs should be closely monitored and restricted to only allow access to the systems an administrator needs to access to complete their job. This limiting of access will ensure that if an account is abused, the damage it can incur is limited in scope.

Shared networks on physical machines
Companies often times will use VLANs as a way to separate systems based on usage, security risk, data type and physical site. This reliance on VLANs often times extends as far out as the firewalls at the edge of a corporate network. When using virtual machines, there is the added risk of having multiple virtual machines on a single physical machine that require separate VLANs to function and adhere to existing network policies. Mistakes with initial virtual machine setup, as well as system compromises can create a situation where VMs add unexpected paths between networks.

When initially planning the use of virtual machines, it is vital to include the staff responsible for both security, as well as network routing and switching implementation. They can provide valuable insight into the reasons for using VLANs or other network separation techniques. By including them, you can review what physical systems will house what virtual machines, and if network changes will be required to ensure security is not compromised and unexpected paths are not created between separate networks.

Implementing a new VM
Implementing new virtual machines has an inherent risk in both the threats posed by any new applications, but additionally the necessity to manage and patch an additional host within the environment. Every new virtual machine is a full OS that could potentially compromised, or otherwise used to launch attacks on your network, or others' networks.

A toolkit should be implemented before any virtual machines are activated that is used for two primary purposes:

Penetration Testing on new systems – All new hosts should be properly tested to ensure they meet company security policies. This testing process should include a review of running services, a review of host level firewall policies, a review of active system accounts and passwords and finally, ensure the system is integrated in with corporate monitoring and patch management tools
Patch management and monitoring on all systems – A corporate wide patch management suite should be used and inclusive off all virtual machines. This centralization will ensure staff are aware of all virtual machines that are active, and aware of systems that are not up to date on security patches. More advanced tools can also provide staff with the ability to quickly audit systems for other security policies like password length, password expiration and firewall policies.

All virtual machines should be retired as soon as they are no longer needed. This removes the overhead on staff of managing the system, and removes the risk of having the system sit potentially unmonitored and used. Virtual machines should be considered the same as the sprawl of old, unused physical servers, and removed as soon as practically possible.

Application layer vulnerabilities
Ultimately a server is only as strong as it's weakest active service, and most often servers are compromises not because of a lack of OS patches, but because of failed application implementations or configuration errors. VMs are vulnerable to this same risk around application level security problems. Virtual machines have the added risk of being compromised that if their load increases, they put other virtual machines on the same physical infrastructure at risk

Boundaries should be enforced across all tiers of an infrastructure; storage, physical systems, network connections, management tools and applications. An application is an extension of the OS from a security perspective, and an applications residing on a physical system via virtual machines should have similar security characteristics including risk, data classification and company policies.

Externally facing VMs
The location and use of VMs must be closely tracked. If a physical host has VMs with both internal access and access from external users, the threat of outside attacks affecting internal resources increases dramatically. Any VM on a single physical host is vulnerable to a host of threats because of the other VMs it shares physical resources with.

By working with the networking and security teams before implementing virtual machines, system administrators can ensure that physical hosts only host common virtual machines, grouped by access levels, data classification and risk. Most companies do not cross network boundaries with virtual machines. Separate physical machines will be places in each separate security environment to host virtual machines for that security and access level.

Audits and Tools
Auditing is a critical function in all IT environments. By properly auditing an environment, administrators can be notified to problems before they become serious or data is potentially compromised. A solid audit trail is often required by outside firms that may certify a companies ability to house or process certain types of data. Auditing is an entire topic on its own, but some common items to monitor and alert in a consistent fashion are:
System level logs from all hosts, both physical and virtual
Monitoring network traffic for unexpected changes to typical traffic pasterns
Logging of all manipulation of VMs including console usage, powering on and off of systems, installation of patches and changes to configuration files
Changes to storage configuration that could include LUNs, zoning or encryption characteristics

Security within a virtual environment has the same underlying principals as the traditional physical environments we are accustomed too. Least access must be ensured so that compromised accounts or rogue staff have a limited amount of damage that can be caused. Process is the most important way to ensure access is limited in a way that staff can successful complete their job, yet not access resources they do not have an immediate need to work with. Clear process can ensure new systems are thoroughly tested, reviewed and put into service, and then managed for the life of the application or host. Staff are more effective at overall administration if consistency is ensured across an environment.

Merging Business and IT