There is no doubt, we are seeing a challenging time in the economy, and it is trickling down to the Information Technology (IT) sector. One of the most important parts of weathering a storm like this in the IT industry is focusing on ensuring your network of coworkers, alumni and friends is strong. By developing a strong network, you have a team to folks you can turn to for advise, recommendations, job postings and the inside track on potential job leads.
The other benefit to networking, beyond looking for a new position, is to develop professionally in your current position. By interacting with others in your field and similar fields you can build your personal toolbox by learning from other peoples' experiences and skills. By regularly working with others you can see what methodologies they use to be successful and what tools they have developed and found to ensure they are efficient in their roles.
Here are some common methods myself and others have used to build a community within the IT space:
Users Groups – Most cities today have multiple users groups including Linux, Oracle, MySQL, Dell and DB2 just to name a few. These organizations are always looking for speakers and folks to hold lab sessions. Volunteer to present, volunteer to organize meetings and volunteer to recruit other speakers. It is a wonderful way to meet folks in similar roles as well as share your knowledge and experience with others.
Brown Bag Events – Host a brown bag at your office, invite your coworkers and do a short talk about a topic that interests you or you think would be of relevance in your environment. This gives you publicity within the company, and allows members of other teams to see the skill and experience they have available when new projects come up.
Operate your companies test bed – Often times companies will have a test and quality assurance environment to use for testing new software deployments, to complete software builds and to test new vendor hardware. Often times this environment does not fall on corporate IT, or the quality departments to manage, but is somewhere in the middle. Volunteer to manage this environment and take real ownership of it. This will give you a great forum to meet people in other departments, as well as share your ideas in a way that will allow them to be utilized in production for the company.
Blogging – Blogging is a simple, effective method to put your ideas out in the public for comments, development and to show your level of expertise in a field. Blogging allows you to share ideas and findings as you write them. Blogging provides communication in a forum that while not peer-reviewed, others can comment on your postings and post additional follow up information.
Conferences – Presenting at conferences is a wonderful way to show both your level of experience, as well as new ideas and methodologies you can bring to your field. Conferences provide a peer-reviewed environment that you can submit papers and do talks. These type of environments show not only your level of expertise, but that others in your field value your contributions and capabilities.
Wednesday, December 10, 2008
Thursday, December 4, 2008
Defining High Availability
In todays business computing environments a wide variety of terms are used to describe systems management, systems performance and system availability. One commonly used term is High Availability (HA). This is a very broad term that can encompass many different levels of availability and the costs associated with the various levels of availability. This term is open to quite a bit of interpretation and this interpretation often leads to confusion about exactly what level of HA an application, device or service provides. Below are the items to factor in when assessing the actual availability of a given service to ensure that it meets your specific interpretation of HA.
Level-setting Expectations
High Availability can mean something different for each person that says or hears the term. It is important to level set expectations about HA and its meaning before having an in-depth discussion about how to meet the objectives laid out in an HA environment. Properly defining HA and calculating the costs associated with implementing HA has four components:
Time to recovery – It is important to understand how long a failure will take to recover from, this will allow you to properly choose solutions that can identify and recover from a failure within a given time frame. A failure can be a hardware problem, a software malfunction or a human error that causes the specific service to act in a way other then it was designed. There are many valid cases where time to recovery can take on the order of minutes or hours, there are other valid cases where recovery should be near-instantaneous.
Method of recovery – Method of recovery is an important component of planning and HA solution and it's associated cost. Many times recovering from failures is an automated fashion, but it is not uncommon to have an error that requires manual intervention to clear the problem. This is often done for categories of problems that are not critical to the operation of a business or customer impacting.
Data Loss and Corruption – Data loss and corruption is an important part of developing a strategy for HA. Data loss and corruption can occur during a failover of services between nodes, while the network works to get into a state of equilibrium after a change or during periods when a given service is down. All data has a value associated with it and when calculating the maximum allowable downtime for a service, data value should be calculated in as well.
Performance Impact– Often times a failure of a service component will cause a degradation in service, yet leave the service online for users. This degradation if often times acceptable assuming it is for a short, limited period of time. Understanding how users will use the service will enable you to understand what level of performance loss is acceptable.
A Perfect World
Before we continue into a discussion about how to achieve a given level of High Availability, I want to define my expectation when I hear the term High Availability. When I use the term HA I expect and application or service that can transparently handle failures from a user perspective. I expect an application that despite a failure on the back-end including a server, disk, network connection or otherwise will automatically failover in a way that the end user does not see a disruption in how they are used to interacting with the application. The user should see no degradation in service or loss of data because of the failure.
My definition of HA is assuming a perfect world and adequate funding to architect and implement such a solution. But as we know, IT is not always funded with the necessary money to make dreams into reality. In these cases we must refer back to the first list of components that make up HA to determine which items can be compromised on.
Defining HA for your Environment
Now that we have covered what items are used to define HA, and my definition of HA in a perfect IT world, lets discuss the process for defining a level of HA appropriate for your needs and balancing that with the associated costs with a given level of HA. First is to understand your user base and what their expectations are around application performance, response time and recovery. Things to consider are when your users use the application, how they enter data and what response time they are used to when interacting with the application.
Second is to define what the technical solution will look like for the above customer requirements. This stage is where you will evaluate various levels of redundancy and capability in any database servers, network components, data centers and application capabilities. This stage should include an evaluation of both vendor packaged solutions, and home grown solutions that will meet your needs. This assessment should also include a review of staff capabilities to determine if training will be needed for staff when implementing new technologies.
Third we will define the cost for each component of the above developed architecture. This cost is the cost for an optimal solution, broken down by each individual component. This cost should include all hardware, implementation and software licensing costs for a given period of time. A three year costing is standard within IT and is a good basis to compare several different solutions in an equal fashion.
Finally, we must evaluate the potential cost savings for each component of the solution if we were to cut back from an optimal solution to a more cost effective one. This evaluation should show the portions of the solution that can be implemented via multiple methods, and the associated costs for each method. This information is then used for comparison to balance the required level of HA with the budget available for the project. By properly understanding how much each component of the solution will cost, you can properly evaluate what the possible level of HA will be with each potential increase or decrease in project funding.
Methods for implementing HA
For most of this document I have avoided discussions actual technical solutions available on the market for implementing HAs. This omission was to ensure that HA was defined per your specific needs before defining possible hardware and software solutions. Now I am going to dive into several popular options on the market for assisting to make applications HA capable.
Linux-HA – Linux-HA is an open source solution for managing services across multiple nodes within a cluster for providing a basic high availability solution. Linux-HA is often used to provide automated failover for applications like Jboss, Apache, Lustre or FTP. While Linux-HA will not provide the sub-second failover that some environments need, it will allow administrators to easily setup a pair of servers to act as hot-standbys for one another.
Redundant Switch Fabrics – Modern ethernet switches have multiple levels of redundant capability including redundant controllers within a switch, redundant power supplies, and at the high end redundant switch fabrics that should one complete set of switches and routers fail, a second will seamlessly handle the failover and subsequent network traffic. Technologies like OSPF will ensure that routing of IP traffic continues uninterrupted and protocols like spanning tree will ensure that switches with multiple paths will utilize them in an optimal fashion during both regular and failover scenarios.
RAID – Redundant Arrays of Independent Disks (RAID) is a common method of ensuring that a single disk failure within a server does not cause data loss or corruption. RAID capability can be added through specialized hardware solutions or via low cost software solutions. Both provide a level of protection above standard disks, while keeping total solution costs low.
Oracle RAC – Oracle's Real Application Cluster (RAC) is a clustering solution, often associated with Oracle's database products for both providing high availability functionality, as well as a platform to scale a databases performance. While Oracle RAC is often more expensive then other clustering solutions from MySQL, it provides a very scalable and reliable platform for ensuring very high levels of availability for applications and their associated databases.
Fiber Channel – Fiber Channel solutions for attaching storage to servers often implement redundancy via dual, redundant fiber channel fabrics. These are often implemented utilizing completely separate switches, cables and power connections. This type of solution can ensure that common failures like cables and PCI cards will not cause a server to loose access to its storage or data corruption.
High Availability is often taken to mean something different for each person. Ultimately, HA is ensuring that customer and end user expectations are met for how an application performs and recovers in the event of a failure. When setting up an application, you must first define HA for your specific needs, you can then properly develop a solution that will meet those expectations. As with most projects within Information Technology, you will then have to assess each component of the solution and make possible tradeoffs to ensure the projects budget is met. Ensuring an application is available and properly recovers is a part of all major Information Technology projects, today there are many possible technical solutions to ensure your customers expectation of HA is met.
Level-setting Expectations
High Availability can mean something different for each person that says or hears the term. It is important to level set expectations about HA and its meaning before having an in-depth discussion about how to meet the objectives laid out in an HA environment. Properly defining HA and calculating the costs associated with implementing HA has four components:
Time to recovery – It is important to understand how long a failure will take to recover from, this will allow you to properly choose solutions that can identify and recover from a failure within a given time frame. A failure can be a hardware problem, a software malfunction or a human error that causes the specific service to act in a way other then it was designed. There are many valid cases where time to recovery can take on the order of minutes or hours, there are other valid cases where recovery should be near-instantaneous.
Method of recovery – Method of recovery is an important component of planning and HA solution and it's associated cost. Many times recovering from failures is an automated fashion, but it is not uncommon to have an error that requires manual intervention to clear the problem. This is often done for categories of problems that are not critical to the operation of a business or customer impacting.
Data Loss and Corruption – Data loss and corruption is an important part of developing a strategy for HA. Data loss and corruption can occur during a failover of services between nodes, while the network works to get into a state of equilibrium after a change or during periods when a given service is down. All data has a value associated with it and when calculating the maximum allowable downtime for a service, data value should be calculated in as well.
Performance Impact– Often times a failure of a service component will cause a degradation in service, yet leave the service online for users. This degradation if often times acceptable assuming it is for a short, limited period of time. Understanding how users will use the service will enable you to understand what level of performance loss is acceptable.
A Perfect World
Before we continue into a discussion about how to achieve a given level of High Availability, I want to define my expectation when I hear the term High Availability. When I use the term HA I expect and application or service that can transparently handle failures from a user perspective. I expect an application that despite a failure on the back-end including a server, disk, network connection or otherwise will automatically failover in a way that the end user does not see a disruption in how they are used to interacting with the application. The user should see no degradation in service or loss of data because of the failure.
My definition of HA is assuming a perfect world and adequate funding to architect and implement such a solution. But as we know, IT is not always funded with the necessary money to make dreams into reality. In these cases we must refer back to the first list of components that make up HA to determine which items can be compromised on.
Defining HA for your Environment
Now that we have covered what items are used to define HA, and my definition of HA in a perfect IT world, lets discuss the process for defining a level of HA appropriate for your needs and balancing that with the associated costs with a given level of HA. First is to understand your user base and what their expectations are around application performance, response time and recovery. Things to consider are when your users use the application, how they enter data and what response time they are used to when interacting with the application.
Second is to define what the technical solution will look like for the above customer requirements. This stage is where you will evaluate various levels of redundancy and capability in any database servers, network components, data centers and application capabilities. This stage should include an evaluation of both vendor packaged solutions, and home grown solutions that will meet your needs. This assessment should also include a review of staff capabilities to determine if training will be needed for staff when implementing new technologies.
Third we will define the cost for each component of the above developed architecture. This cost is the cost for an optimal solution, broken down by each individual component. This cost should include all hardware, implementation and software licensing costs for a given period of time. A three year costing is standard within IT and is a good basis to compare several different solutions in an equal fashion.
Finally, we must evaluate the potential cost savings for each component of the solution if we were to cut back from an optimal solution to a more cost effective one. This evaluation should show the portions of the solution that can be implemented via multiple methods, and the associated costs for each method. This information is then used for comparison to balance the required level of HA with the budget available for the project. By properly understanding how much each component of the solution will cost, you can properly evaluate what the possible level of HA will be with each potential increase or decrease in project funding.
Methods for implementing HA
For most of this document I have avoided discussions actual technical solutions available on the market for implementing HAs. This omission was to ensure that HA was defined per your specific needs before defining possible hardware and software solutions. Now I am going to dive into several popular options on the market for assisting to make applications HA capable.
Linux-HA – Linux-HA is an open source solution for managing services across multiple nodes within a cluster for providing a basic high availability solution. Linux-HA is often used to provide automated failover for applications like Jboss, Apache, Lustre or FTP. While Linux-HA will not provide the sub-second failover that some environments need, it will allow administrators to easily setup a pair of servers to act as hot-standbys for one another.
Redundant Switch Fabrics – Modern ethernet switches have multiple levels of redundant capability including redundant controllers within a switch, redundant power supplies, and at the high end redundant switch fabrics that should one complete set of switches and routers fail, a second will seamlessly handle the failover and subsequent network traffic. Technologies like OSPF will ensure that routing of IP traffic continues uninterrupted and protocols like spanning tree will ensure that switches with multiple paths will utilize them in an optimal fashion during both regular and failover scenarios.
RAID – Redundant Arrays of Independent Disks (RAID) is a common method of ensuring that a single disk failure within a server does not cause data loss or corruption. RAID capability can be added through specialized hardware solutions or via low cost software solutions. Both provide a level of protection above standard disks, while keeping total solution costs low.
Oracle RAC – Oracle's Real Application Cluster (RAC) is a clustering solution, often associated with Oracle's database products for both providing high availability functionality, as well as a platform to scale a databases performance. While Oracle RAC is often more expensive then other clustering solutions from MySQL, it provides a very scalable and reliable platform for ensuring very high levels of availability for applications and their associated databases.
Fiber Channel – Fiber Channel solutions for attaching storage to servers often implement redundancy via dual, redundant fiber channel fabrics. These are often implemented utilizing completely separate switches, cables and power connections. This type of solution can ensure that common failures like cables and PCI cards will not cause a server to loose access to its storage or data corruption.
High Availability is often taken to mean something different for each person. Ultimately, HA is ensuring that customer and end user expectations are met for how an application performs and recovers in the event of a failure. When setting up an application, you must first define HA for your specific needs, you can then properly develop a solution that will meet those expectations. As with most projects within Information Technology, you will then have to assess each component of the solution and make possible tradeoffs to ensure the projects budget is met. Ensuring an application is available and properly recovers is a part of all major Information Technology projects, today there are many possible technical solutions to ensure your customers expectation of HA is met.
Building a Lustre Patchless Client
One common need within Lustre environments is the requirement to build Lustre clients using standard Linux kernels. Lustre servers commonly have a custom kernel with specific patches to optimize performance, but clients do not always require these kernel patches.
These directions will enable you to build the RPMs necessary to install the Lustre client bits on a system with a standard Redhat kernel.
1) Umount all Lustre clients and comment entries from /etc/fstab
2) Reboot a node into the standard redhat kernel you would like to build the client for. Assumption for these directions is RHEL 2.6.18-92.1.13 x86_64.
3) Install the full kernel source tree for the running kernel
- cd ~
- yum install rpm-build redhat-rpm-config unifdef
- mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
- rpm -i http://mirror.centos.org/centos/5/updates/SRPMS/kernel-2.6.18-92.1.13.el5.src.rpm
4) Unzip the lustre bits
- Download from http://www.sun.com/software/products/lustre/get.jsp
- mv lustre-1.6.6.tar.gz /usr/src
- gunzip lustre-1.6.6.tar.gz
- tar -xvf lustre-1.6.6.tar
5) Prep the kernel tree for building Lustre
- cd /usr/src/linux
- cp /boot/config-'uname -r' .config
- make oldconfig || make menuconfig
- make include/asm
- make include/linux/version.h
- make SUBDIRS=scripts
6) Configure the build - configure will detect an unpatched kernel and only build the client
- cd lustre
- ./configure --with-linux=/usr/src/linux
7) Create RPMs
- make rpms
8) You should get a set of Lustre RPMs in the build directory.
- ls ~/rpmbuild/RPMS
9) rpm -e lustre*
10) Install new client bits
- rpm -ivh lustre-client-1.6.6-2.6.18_92.1.1.13.el5.x86_64.rpm
- rpm -ivh lustre-modules-1.6.6-2.6.18_92.1.1.13.el5.x86_64.rpm
11) Remount all Lustre mounts
- vi /etc/fstab
uncomment lustre lines
- mount -a
These directions will enable you to build the RPMs necessary to install the Lustre client bits on a system with a standard Redhat kernel.
1) Umount all Lustre clients and comment entries from /etc/fstab
2) Reboot a node into the standard redhat kernel you would like to build the client for. Assumption for these directions is RHEL 2.6.18-92.1.13 x86_64.
3) Install the full kernel source tree for the running kernel
- cd ~
- yum install rpm-build redhat-rpm-config unifdef
- mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
- rpm -i http://mirror.centos.org/centos/5/updates/SRPMS/kernel-2.6.18-92.1.13.el5.src.rpm
4) Unzip the lustre bits
- Download from http://www.sun.com/software/products/lustre/get.jsp
- mv lustre-1.6.6.tar.gz /usr/src
- gunzip lustre-1.6.6.tar.gz
- tar -xvf lustre-1.6.6.tar
5) Prep the kernel tree for building Lustre
- cd /usr/src/linux
- cp /boot/config-'uname -r' .config
- make oldconfig || make menuconfig
- make include/asm
- make include/linux/version.h
- make SUBDIRS=scripts
6) Configure the build - configure will detect an unpatched kernel and only build the client
- cd lustre
- ./configure --with-linux=/usr/src/linux
7) Create RPMs
- make rpms
8) You should get a set of Lustre RPMs in the build directory.
- ls ~/rpmbuild/RPMS
9) rpm -e lustre*
10) Install new client bits
- rpm -ivh lustre-client-1.6.6-2.6.18_92.1.1.13.el5.x86_64.rpm
- rpm -ivh lustre-modules-1.6.6-2.6.18_92.1.1.13.el5.x86_64.rpm
11) Remount all Lustre mounts
- vi /etc/fstab
uncomment lustre lines
- mount -a
Monday, December 1, 2008
Implementing Lustre Failover
Linux-HA, also referred too as Heartbeat is an OpenSource tool for managing services across multiple nodes within a cluster. Linux-HA ensures that a given service or disk is only running or mounted on a single server within the cluster at a given time. Linux-HA ensures that if a server within the cluster was to fail, the other server was become active for the service automatically, minimizing downtime for the users.
A default install, as I will document today, only catches problems with a server in the cluster not responding to Linux-HA communication. If a node was to have other problems like failed disks, failed network auxiliary connections or errors in I/O access, Heartbeat would not catch and respond to those failures without additional instrumentation.
These below directions are how to implement Lustre-HA to provide for more automated failover of Lustre services. These directions were developed and tested with Lustre version 1.6.5.1 and Linux-HA version 2.1.4.
Assumptions
1) Install Linux-HA
# yum -y install heartbeat
2) Comment out all Lustre mounts from /etc/fstab and umount existing Lustre server and client filesystems. This will ensure no data corruption or contention issues when starting Heartbeat.
MGS/MDS Pair
mgs # cat /etc/fstab | grep lus
#/dev/MGTDISK /mnt/lustre/mgs lustre defaults,_netdev 0 0
mds # cat /etc/fstab | grep lus
#/dev/MDTDISK /mnt/lustre/mds lustre defaults,_netdev 0 0
OSS Pair
oss01 # cat /etc/fstab | grep lus
#/dev/OST00DISK /mnt/lustre/oss00 lustre defaults,_netdev 0 0
#/dev/OST02DISK /mnt/lustre/oss02 lustre defaults,_netdev 0 0
oss02 # cat /etc/fstab | grep lus
#/dev/OST01DISK /mnt/lustre/oss01 lustre defaults,_netdev 0 0
#/dev/OST03DISK /mnt/lustre/oss03 lustre defaults,_netdev 0 0
3) Create all mount points on both nodes in each node-pair
MGS/MDS Pair
# mkdir /mnt/lustre/mgt
# mkdir /mnt/lustre/mdt
OSS Pair
# mkdir /mnt/lustre/ost00
# mkdir /mnt/lustre/ost01
# mkdir /mnt/lustre/ost02
# mkdir /mnt/lustre/ost03
4) Execute '/sbin/chkconfig –level 345 heartbeat on' on all 4 nodes
5) /etc/ha.d/ha.cf changes
MGS/MDS Pair
# cat ha.cf | grep -v '#'
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 10100
auto_failback off
stonith_host mgs external/ipmi mds 10.0.1.100 admin adminpassword
stonith_host mds external/ipmi mgs 10.0.1.101 admin adminpassword
node mgs
node mds
OSS Pair
# cat ha.cf | grep -v '#'
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
initdead 120
# different from MGS/MDS node-pair
udpport 10101
auto_failback off
stonith_host oss01 external/ipmi oss02 10.0.1.102 admin adminpassword
stonith_host oss02 external/ipmi oss01 10.0.1.103 admin adminpassword
node oss01
node oss02
6) /etc/ha.d/authkeys changes
MGS/MDS Pair
# cat authkeys | grep -v '#'
auth 2
2 sha1 SetYourMGSMDSPhasphraseHere
OSS Pair
# cat authkeys | grep -v '#'
auth 2
2 sha1 SetYourOSSPhasphraseHere
7) /etc/ha.d/haresources changes
MGS/MDS Pair
# cat haresources | grep -v '#'
mgs Filesystem::/dev/MGTDISK::/mnt/lustre/mgt::lustre
mds Filesystem::/dev/MDTDISK::/mnt/lustre/mdt::lustre
OSS Pair
# cat haresource | grep -v '#'
oss01 Filesystem::/dev/OST00DISK::/mnt/lustre/ost00::lustre
oss02 Filesystem::/dev/OST01DISK::/mnt/lustre/ost01::lustre
oss01 Filesystem::/dev/OST02DISK::/mnt/lustre/ost02::lustre
oss02 Filesystem::/dev/OST03DISK::/mnt/lustre/ost03::lustre
8) Specify the address of the failover MGS node for all Lustre filesystem components
mds # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/MDTDISK
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST00DISK
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST01DISK
oss02 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST01
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST02DISK
9) Execute 'service heartbeat start' on MGS/MDS pair
10) Execute 'service heartbeat start' on OSS pair
11) Mount the Lustre filesystem on all clients
client # mount -t lustre 10.0.0.101@tcp0,10.0.0.100@tcp0:/lustre /mnt/lustre
client # cat /etc/fstab | grep lustre
10.0.0.101@tcp0,10.0.0.100@tcp0:/lustre /mnt/lustre lustre defaults 0 0
With the above setup, if a single node within each pair (MGS/MDS and OSS01/OSS02) were to fail, after the specified timeout period the clients would be able successfully recover and continue their I/O operations. Linux-HA is not designed for immediate failover, and a recovery can often take on the order of minutes when resources need to move from one node in a pair to another. While this solution will not provide immediate failover, it will allow administrators to setup an inexpensive system that will automatically recovery from hardware failures without lengthy downtimes and impacts to users.
A default install, as I will document today, only catches problems with a server in the cluster not responding to Linux-HA communication. If a node was to have other problems like failed disks, failed network auxiliary connections or errors in I/O access, Heartbeat would not catch and respond to those failures without additional instrumentation.
These below directions are how to implement Lustre-HA to provide for more automated failover of Lustre services. These directions were developed and tested with Lustre version 1.6.5.1 and Linux-HA version 2.1.4.
Assumptions
- 4 total nodes (2 node-pairs)
- 1 MGS (Lustre Management Servers)
- 1 MDS (Lustre Metadata Server)
- 1 MDT (Metadata Target) on the MDS
- 2 OSSs (Lustre Object Storage Servers) (OSS01 and OSS02)
- 2 OSTs (Object Storage Targets) per OSS (OST00-OST03)
- The MGS and MDS will be on a pair of clustered servers
- Nodes MGS and MDS have access to the same shared physical disks
- Nodes OSS01 and OSS02 have access to the same shared physical disks
- The name of the filesystem is 'lustre'
- STONITH method is IPMI and the IPMI interface is configured for remote access
- No software RAID, all RAID is implemented via hardware solutions
1) Install Linux-HA
# yum -y install heartbeat
2) Comment out all Lustre mounts from /etc/fstab and umount existing Lustre server and client filesystems. This will ensure no data corruption or contention issues when starting Heartbeat.
MGS/MDS Pair
mgs # cat /etc/fstab | grep lus
#/dev/MGTDISK /mnt/lustre/mgs lustre defaults,_netdev 0 0
mds # cat /etc/fstab | grep lus
#/dev/MDTDISK /mnt/lustre/mds lustre defaults,_netdev 0 0
OSS Pair
oss01 # cat /etc/fstab | grep lus
#/dev/OST00DISK /mnt/lustre/oss00 lustre defaults,_netdev 0 0
#/dev/OST02DISK /mnt/lustre/oss02 lustre defaults,_netdev 0 0
oss02 # cat /etc/fstab | grep lus
#/dev/OST01DISK /mnt/lustre/oss01 lustre defaults,_netdev 0 0
#/dev/OST03DISK /mnt/lustre/oss03 lustre defaults,_netdev 0 0
3) Create all mount points on both nodes in each node-pair
MGS/MDS Pair
# mkdir /mnt/lustre/mgt
# mkdir /mnt/lustre/mdt
OSS Pair
# mkdir /mnt/lustre/ost00
# mkdir /mnt/lustre/ost01
# mkdir /mnt/lustre/ost02
# mkdir /mnt/lustre/ost03
4) Execute '/sbin/chkconfig –level 345 heartbeat on' on all 4 nodes
5) /etc/ha.d/ha.cf changes
MGS/MDS Pair
# cat ha.cf | grep -v '#'
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 10100
auto_failback off
stonith_host mgs external/ipmi mds 10.0.1.100 admin adminpassword
stonith_host mds external/ipmi mgs 10.0.1.101 admin adminpassword
node mgs
node mds
OSS Pair
# cat ha.cf | grep -v '#'
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
initdead 120
# different from MGS/MDS node-pair
udpport 10101
auto_failback off
stonith_host oss01 external/ipmi oss02 10.0.1.102 admin adminpassword
stonith_host oss02 external/ipmi oss01 10.0.1.103 admin adminpassword
node oss01
node oss02
6) /etc/ha.d/authkeys changes
MGS/MDS Pair
# cat authkeys | grep -v '#'
auth 2
2 sha1 SetYourMGSMDSPhasphraseHere
OSS Pair
# cat authkeys | grep -v '#'
auth 2
2 sha1 SetYourOSSPhasphraseHere
7) /etc/ha.d/haresources changes
MGS/MDS Pair
# cat haresources | grep -v '#'
mgs Filesystem::/dev/MGTDISK::/mnt/lustre/mgt::lustre
mds Filesystem::/dev/MDTDISK::/mnt/lustre/mdt::lustre
OSS Pair
# cat haresource | grep -v '#'
oss01 Filesystem::/dev/OST00DISK::/mnt/lustre/ost00::lustre
oss02 Filesystem::/dev/OST01DISK::/mnt/lustre/ost01::lustre
oss01 Filesystem::/dev/OST02DISK::/mnt/lustre/ost02::lustre
oss02 Filesystem::/dev/OST03DISK::/mnt/lustre/ost03::lustre
8) Specify the address of the failover MGS node for all Lustre filesystem components
mds # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/MDTDISK
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST00DISK
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST01DISK
oss02 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST01
oss01 # tunefs.lustre --writeconf --erase-params --fsname=lustre --failnode=10.0.0.101 /dev/OST02DISK
9) Execute 'service heartbeat start' on MGS/MDS pair
10) Execute 'service heartbeat start' on OSS pair
11) Mount the Lustre filesystem on all clients
client # mount -t lustre 10.0.0.101@tcp0,10.0.0.100@tcp0:/lustre /mnt/lustre
client # cat /etc/fstab | grep lustre
10.0.0.101@tcp0,10.0.0.100@tcp0:/lustre /mnt/lustre lustre defaults 0 0
With the above setup, if a single node within each pair (MGS/MDS and OSS01/OSS02) were to fail, after the specified timeout period the clients would be able successfully recover and continue their I/O operations. Linux-HA is not designed for immediate failover, and a recovery can often take on the order of minutes when resources need to move from one node in a pair to another. While this solution will not provide immediate failover, it will allow administrators to setup an inexpensive system that will automatically recovery from hardware failures without lengthy downtimes and impacts to users.
Monday, November 17, 2008
Security Planning in HPC
Todays high performance compute (HPC) solutions have many components including compute nodes, shared storage systems, high capacity tape archiving systems and shared interconnects including ethernet and Infiniband. One primary reason companies are turning to HPC solutions is the cost benefits of shared infrastructure that can be leveraged across many different projects and teams. While this shared usage model can allow for managed, cost effective growth, it also introduces new security risks and requirements for policies and tools to ensure previously separate data is managed properly in a shared environment.
This shared infrastructure model that is often used in HPC has several areas around data security that should be addressed prior to deploying shared solutions. Often times companies will have departments working on sensitive work while others work on very public projects, other firms could be working with their customers proprietary data and most companies have a threat from outside competitors trying to gain access to confidential work. All of these issues must be addressed in shared HPC solutions to ensure data is always secure, a reliable audit platform is in place and that security policies can be changed in a rapid fashion as company needs and policies change.
When evaluating an HPC solution to ensure data access is managed within company policy, there are several components within the cluster that should be reviewed separately:
Shared file systems – Todays HPC solutions have become rapidly successfully because of the availably of massively parallel file systems. These are scalable solutions for doing very high speed I/O and are often times available on all nodes within a cluster.
Databases – More often then ever companies are utilizing databases as a way to organize massive amounts of both transactional and reporting data. Often times these databases are paired with HPC solutions to evaluate the data in a very scalable and reliable method. These databases often contain a variety of data including sales, forecasting, payroll, procurement and scheduling just to name a few.
Local disk – More often then not, compute nodes have local disk in them to provide a local operating system and swap space. This swap space and possibly temporary file systems can provide a space for users to store data while jobs are running, but is also a location that must be assessed to ensure access is provided to those that need it.
Compute node memory – Compute nodes also have local physical memory that could be exploited by software flaws to allow unexpected access.
Interconnects – Todays HPC systems often use a high speed interconnect like Infiniband or 10Gbit Ethernet, these, like any other type of network connections present the opportunity for sniffing or otherwise monitoring traffic.
Policies
Todays companies often work for a variety of customers, as well as work on internal projects. It can be a complicated balancing act ensuring that data access policies are in place to properly handle those cases. Some data will require very restrictive policies, while others will require a very open policy around usage and access. Often time separate filesystems can be utilized to ensure data is stored in manageable locations and access granted pursuant to company policies.
There are two primary components to developing these security policies, first is to assess the risk associated with each component of the system, this risk assessment can include costs in dollars, costs in time and public perception if data was to be handled incorrectly per industry best practices or legal guidelines. Policies can then be developed to mitigate that risk to acceptable levels.
Some common methods to mitigate risk across the above components are:
Data Isolation – Within a shared computing environment data can be isolated in a variety of ways including physical isolation using different storage arrays, logical isolation using technology like VLANs and access restrictions like file permissions.
Audit Trails – Considering audit trails and how to implement them is important. This ensures that there is both a path to isolating and resolving problems, but also that legal compliance regulations are met. Audit trails can include system log files, authentication log files,resource manager logs and many others to provide end to end documentation of a user and their activities.
Consistent Identity Management – To properly ensure that data is accessed by the correct individuals and audit trails are consistent it is important to ensure identity management solutions are in place that handle HPC environments, as well as other enterprise type computing resources in a consistent method. Identity Management can be provided by tools like LDAP and Kerberos, as well as more advanced authentication and authorization systems.
Notifications – Notifications are an important part of the change management process within an HPC environment. Notifications can include alerts to security staff, administrators or management that portions of the cluster are out or company compliance, or attempts to access restricted resources have occurred. Notifications can come from a variety of tools within an HPC environment, but should be uniform in format and information so that staff can respond rapidly to unexpected cluster issues.
Data Cleanup – Often jobs within an HPC environment will create temporary files on individual nodes, as well as on shared filesystems. These files have an impact to a systems risk assessment and should be properly cleaned up after they are no longer needed. By removing all data that is not needed, it limits that data that needs to be accounted for, as well as the potential exposure if a single system is compromised.
We have just finished reviewing risk assessments within an HPC environment. These allow management and administrators of HPC systems to understand the costs (political, financial, time) of any failure in security plans or processes. In addition to understanding risk, there is the added complication of enforcing these policies in a way that is consistent across the cluster, consistent across the company and provides a proper audit trail. The most common methods of software implementation for these security policies are:
File System Permissions – File system permissions are the most common place to implement security controls, as well as one of the easiest items to complete and ensure compliance with. These permissions allow administrators at the lowest level to grant and deny access to data based on need. These do not assist with restricting back access to unauthorized individuals, but do contribute to ensuring that day to day operation of the system is done reliably and security.
Centralized Monitoring – Centralized monitoring and policy management are key to ensuring consistent security and minimizing human error. By using a central repository for all log entries, it allows staff to implement tools to rapidly catch any activity that is unauthorized or unexpected and respond with the proper speed. Centralized policy management through the use of tools like Identity management allow staff to quickly add or remove access based on business needs. By centralizing this policy management a company can ensure that the often manual process of removing access is removed and proper checks are in place to ensure access changes are updated accordingly.
Resource Manager – Most modern clusters make use of a job scheduler, or resource manager to allocate nodes and other resources to individual users to complete jobs. Most schedulers allow the allocation of resource groups and restrictions on those groups to an individual user or users. By extending this functionality it is possible to restrict users jobs to run on systems that have data they are allowed to see, and ensure they can not access nodes with filesystems they do not have permissions to utilize. The resource manager is a centralized tool that provides great flexibility in ensuring users have access to the resources they need, but no other resources or data.
Mounted File Systems – Often times HPC environments will utilize a variety of automated tools to unmount and remount filesystems based on user access needs. By un-mounting a filesystem that is not required for a given user, it adds an additional level of access protection above file permissions to ensure only authorized users access the data contained on a given filesystem.
Shared infrastructure is a challenge in all environments when assessing security solutions. A shared infrastructure means that additional precaution must be taken in implementation and security policies to ensure that data and resources are used when expected and by only authorized individuals. When planning a shared environment the initial process should begin with a risk assessment to understand what components of the solutions could be exploited and what the costs in time and money would be if that were to occur. That risk assessment can then be used to ensure the proper safeguards are implemented with available technologies to reduce the risk to a manageable and acceptable level for the company. Ultimately all safeguards should be implemented in a way that limits the potential for accidental failures in safeguards and reduces the need for manual administration and intervention. Shared resources are a challenge, but when properly managed, can ensure better overall utilization for a company without sacrificing on security.
This shared infrastructure model that is often used in HPC has several areas around data security that should be addressed prior to deploying shared solutions. Often times companies will have departments working on sensitive work while others work on very public projects, other firms could be working with their customers proprietary data and most companies have a threat from outside competitors trying to gain access to confidential work. All of these issues must be addressed in shared HPC solutions to ensure data is always secure, a reliable audit platform is in place and that security policies can be changed in a rapid fashion as company needs and policies change.
When evaluating an HPC solution to ensure data access is managed within company policy, there are several components within the cluster that should be reviewed separately:
Shared file systems – Todays HPC solutions have become rapidly successfully because of the availably of massively parallel file systems. These are scalable solutions for doing very high speed I/O and are often times available on all nodes within a cluster.
Databases – More often then ever companies are utilizing databases as a way to organize massive amounts of both transactional and reporting data. Often times these databases are paired with HPC solutions to evaluate the data in a very scalable and reliable method. These databases often contain a variety of data including sales, forecasting, payroll, procurement and scheduling just to name a few.
Local disk – More often then not, compute nodes have local disk in them to provide a local operating system and swap space. This swap space and possibly temporary file systems can provide a space for users to store data while jobs are running, but is also a location that must be assessed to ensure access is provided to those that need it.
Compute node memory – Compute nodes also have local physical memory that could be exploited by software flaws to allow unexpected access.
Interconnects – Todays HPC systems often use a high speed interconnect like Infiniband or 10Gbit Ethernet, these, like any other type of network connections present the opportunity for sniffing or otherwise monitoring traffic.
Policies
Todays companies often work for a variety of customers, as well as work on internal projects. It can be a complicated balancing act ensuring that data access policies are in place to properly handle those cases. Some data will require very restrictive policies, while others will require a very open policy around usage and access. Often time separate filesystems can be utilized to ensure data is stored in manageable locations and access granted pursuant to company policies.
There are two primary components to developing these security policies, first is to assess the risk associated with each component of the system, this risk assessment can include costs in dollars, costs in time and public perception if data was to be handled incorrectly per industry best practices or legal guidelines. Policies can then be developed to mitigate that risk to acceptable levels.
Some common methods to mitigate risk across the above components are:
Data Isolation – Within a shared computing environment data can be isolated in a variety of ways including physical isolation using different storage arrays, logical isolation using technology like VLANs and access restrictions like file permissions.
Audit Trails – Considering audit trails and how to implement them is important. This ensures that there is both a path to isolating and resolving problems, but also that legal compliance regulations are met. Audit trails can include system log files, authentication log files,resource manager logs and many others to provide end to end documentation of a user and their activities.
Consistent Identity Management – To properly ensure that data is accessed by the correct individuals and audit trails are consistent it is important to ensure identity management solutions are in place that handle HPC environments, as well as other enterprise type computing resources in a consistent method. Identity Management can be provided by tools like LDAP and Kerberos, as well as more advanced authentication and authorization systems.
Notifications – Notifications are an important part of the change management process within an HPC environment. Notifications can include alerts to security staff, administrators or management that portions of the cluster are out or company compliance, or attempts to access restricted resources have occurred. Notifications can come from a variety of tools within an HPC environment, but should be uniform in format and information so that staff can respond rapidly to unexpected cluster issues.
Data Cleanup – Often jobs within an HPC environment will create temporary files on individual nodes, as well as on shared filesystems. These files have an impact to a systems risk assessment and should be properly cleaned up after they are no longer needed. By removing all data that is not needed, it limits that data that needs to be accounted for, as well as the potential exposure if a single system is compromised.
We have just finished reviewing risk assessments within an HPC environment. These allow management and administrators of HPC systems to understand the costs (political, financial, time) of any failure in security plans or processes. In addition to understanding risk, there is the added complication of enforcing these policies in a way that is consistent across the cluster, consistent across the company and provides a proper audit trail. The most common methods of software implementation for these security policies are:
File System Permissions – File system permissions are the most common place to implement security controls, as well as one of the easiest items to complete and ensure compliance with. These permissions allow administrators at the lowest level to grant and deny access to data based on need. These do not assist with restricting back access to unauthorized individuals, but do contribute to ensuring that day to day operation of the system is done reliably and security.
Centralized Monitoring – Centralized monitoring and policy management are key to ensuring consistent security and minimizing human error. By using a central repository for all log entries, it allows staff to implement tools to rapidly catch any activity that is unauthorized or unexpected and respond with the proper speed. Centralized policy management through the use of tools like Identity management allow staff to quickly add or remove access based on business needs. By centralizing this policy management a company can ensure that the often manual process of removing access is removed and proper checks are in place to ensure access changes are updated accordingly.
Resource Manager – Most modern clusters make use of a job scheduler, or resource manager to allocate nodes and other resources to individual users to complete jobs. Most schedulers allow the allocation of resource groups and restrictions on those groups to an individual user or users. By extending this functionality it is possible to restrict users jobs to run on systems that have data they are allowed to see, and ensure they can not access nodes with filesystems they do not have permissions to utilize. The resource manager is a centralized tool that provides great flexibility in ensuring users have access to the resources they need, but no other resources or data.
Mounted File Systems – Often times HPC environments will utilize a variety of automated tools to unmount and remount filesystems based on user access needs. By un-mounting a filesystem that is not required for a given user, it adds an additional level of access protection above file permissions to ensure only authorized users access the data contained on a given filesystem.
Shared infrastructure is a challenge in all environments when assessing security solutions. A shared infrastructure means that additional precaution must be taken in implementation and security policies to ensure that data and resources are used when expected and by only authorized individuals. When planning a shared environment the initial process should begin with a risk assessment to understand what components of the solutions could be exploited and what the costs in time and money would be if that were to occur. That risk assessment can then be used to ensure the proper safeguards are implemented with available technologies to reduce the risk to a manageable and acceptable level for the company. Ultimately all safeguards should be implemented in a way that limits the potential for accidental failures in safeguards and reduces the need for manual administration and intervention. Shared resources are a challenge, but when properly managed, can ensure better overall utilization for a company without sacrificing on security.
Monday, November 3, 2008
Security Threats and Response Plans
Risk management is an important component of a complete security plan for any company. In the area of cyber security this often has two fronts; assessing security threats and documenting responses. Both are equally valuable, and if planned correctly can ensure that no matter the threat a company faces, there is processes in place to properly manage, communicate and eliminate the threat. In todays security environment, a threat can mean a variety of things including viruses, data compromises, lost laptops, network intrusion attempts, insider threats and physical compromises.
This risk planning also has other purposes outside of planning for and responding too threats. This information, once gathered, can also be used as a basis for understanding risk around different types of threats. Often lower level threats have such a low level of risk that responding too all of them would be a waste of company resources, yet more complex attacks require faster, more urgent response. These risk assessments can also help staff plan the appropriate solutions around patch management, firewalls, network controls and other tools meant to stop intrusion. By properly understanding threats and there potential impact to production services, and staffs' time, a proper mitigation plan can also be developed.
Threat Matrix
Planning for threats can often be a daunting task, even for the most seasoned of security professionals. The challenge comes from the inability to know what exact threats are in the wild every day, and the new threats that are constantly emerging. There are many details that will need to be documented and considered when planning for the various known threats that are in the wild, these include who is causing the threat and who is the target, how is the attack being carried out, what safeguards are being affected as part of the threat, what changes will be needed to eliminate the threat, what is the cost of responding to the threat, both in prevention and if the threat is successful.
By planning and carefully documenting the process for responding for known threats, we develop experience that can then be used for responding to unknown threats. The following questions and information gathering can assist in developing this threat matrix. This matrix serves as a starting point when responding to known threats, and will be added to as new threats are encountered.
Who is causing the threat?
This question can have multiple components, it looks at whether the threat is being caused by someone internally or someone externally, as well as is the threat caused by a person, or rogue software. This is an important point to assess all threats to ensure that the safeguards added are in the correct place to mitigate the threat, and that resources are in the appropriate places to respond to the threat.
Who is the target?
A proper understanding of the target is important so that the impact of the threat is understood. From the target we can ascertain if sensitive customer information is at risk, if availability of public services is at risk or if we are at risk for a legal compliance issue. By who, I specifically mean what server, host, application, database, router, firewall or any other device that could be attacked.
This information can also be used to track developing patterns of attack. As these threats are rolled into response plans, a plan of documentation can ensure that patterns of threats are tracked and managed properly.
Avenue of attack?
This question evaluates how the threat is affecting a companies infrastructure. This could be technical avenues like via outside network connections or email, but can also be physical level like a person in your building or outside the building. Understanding the avenue of attack is critical to responding so that an attach response does not cause undue outages to other portions of the infrastructure or unnecessary outages to customer facing services.
Safeguards affected?
It is important to understand what safeguards are potentially compromised by a given threat. This could include firewalls, application validation checks, database encryption or filesystem encryption. Understanding the affected safeguards will later allow processes to be developed that mitigate the threat as quickly and efficiently as possible by understanding how best to stop the threat.
Changes to stop the threat?
This is a detailed list of what configuration changes or otherwise will need to be implemented to stop the threat. These are used to develop the process in the Response Matrix to slow and eliminate the threat. Understanding these responses is also important so that a risk versus reward analysis can be done. Often times, the change to eliminate the threat is so drastic that other problems with services result. By understanding what changes are required, management can make informed decisions about ignoring the threat, or what various response to use against the threat.
Cost of responding?
Understanding the implications of a threat are important when developing an appropriate level of response to the threat. The cost of responding can be communicated in multiple ways including cost in dollars or cost in time. All are important factors to use when deciding on response plans to threats, and the level of risk associated with various response plans.
Cost if threat is successful?
The other important cost associated with responding to a threat is the cost if the threat is successful. This could mean many things depending on the type of threat; it could be an outage in customer facing services, a loss of customer data or the financial impact of not providing the services to customers.
Current mitigation plan in place?
This item assess the safeguards that are in place to mitigate a given threat. This can include firewalls, security patches, passwords, identity management solutions or a host of application layer safeguards related to data scrubbing and input validation.
By no means are these all the items that should be assessed in a threat matrix. These are the most common ones that most companies will have documented for all threats. Additional information can be included in the threat matrix for specific applications, operating systems, network types and levels of data that a company processes. When developing a threat matrix, a company should evaluate all applications, hosts, networks, network connections and associated tools. By evaluating these items a list can be developed that includes the items that could be attacked, and what methods could be used in an attack.
Response Matrix
After defining the known threats, a response matrix can be developed specific to the companies needs and risks. These needs and risks can be calculated from the cost portion of the above developed matrix of threats. These needs and risks can be used as a basis for planning resources around responses, legal obligations when threats are encountered and documentation policies around responding to threats. This response matrix should contain detailed procedures for responding to several categories of threats.
The initial component of all response matrices is a list of types of incidents, this can usually be broken down into the follow categories, known as threat types:
Responding to threats and attacks must be a methodical process that encompasses many challenges including speed, communication, documentation and follow up. All of these must be managed while ensuring that customers and staff are impacted as little as necessary. Responding must be a coordinated effort between the various teams within a company that are responsible for system administration, data security, compliance, network administration and enterprise architecture.
Threat Type 1
Threat type one is the category that a company will know the most about and be able to plan in a detailed way how to respond. This category will be threats that have previously been responded too and resolved. Each time they have previously occurred a follow up meeting should have been done to revise and improve the process for the specific threat.
Threat Type 2
The threats that will be listed in the second category will include well known exploits and attack avenues as well as threats that other companies have actively faced. These threats will also be listed in the threat matrix, although with a side note that the company has not previously had to respond to them, but does anticipate the threat.
Threat Type 3
This category is often the most complex to respond too because the actual threats are unknown. The process for responding to unknown threats must be dynamic enough to handle a wide enough range that all threats are properly responded too, but rigid enough that legal implications are handled and communication channels do not break down in the face on unknown threats.
Internal versus External
One important component to all response plans is understanding if the threat you are planning for is internal or external. Internal threats come from staff that are either intentionally out to cause the company harm, or systems that are setup in a way they allow staff access that was not intended and subsequently has negative consequences.
Often times the threat, be it internal or external, plays an important role in how the company responds. If the threat is internal, it is often important to bring in outside resources to assess the problems and develop a mitigation plan, ensuing that the company is not vulnerable from future insider threats.
Response Teams
Another key component of all response matrices is a carefully planned list of individuals and teams to be included in response activities. As a company, you should evaluate if you have the appropriate level of technical capability in house to respond to known and unknown threats, as well as what additional capabilities need to be brought in when responding to threats. Outside resources could be technical staff specialized in security, or marketing staff focused on public relations issues, it could even be law enforcement to track down the source of threats. Today a lot of companies also ensure that when responding to major threats, legal council is brought in to ensure compliance with data handling and reporting requirements.
Response Training
Each type of threat that might be responded too should have associated required training for staff. This could include computer forensics, data analysis, legal implications or technical skills. This training, done on regular intervals, ensures that staff have both a process and the proper training to effectively respond to threats.
While these are not all the possible categories for each response within the response matrix, they begin to provide a basis for response planning. Additional items can be included in the response plan based on company specifics, industry legalities and management preferences. All response plans should be detailed enough that staff have clear directions to follow in possibly chaotic situations. Response plans should also have regular reviews of the process to ensure they are updated to reflect changes in company management structure, changes in technology and changes in industry trends.
Response Methods
The majority of this document has been focused at defining a matrix of threats that a companies information systems face. These threats can then used as a basis when defining responses in a coordinated fashion, while most of the writing was about defining manual processes, this is only the first step to automating the responses and processes. After defining the manual processes for responding to threats, a system of automation can be put in place for the responses that make sense. Automated solutions work very well for defined threats that have clear responses; a good example is a service on the company network being attacked by an outside system, an automated system that blocks the source of the attack and notifies staff ensures that the threat is immediately contained and staff notified in a timely manner.
Automating the response plan can also aid in the communication of threats and coordination of activities. There are a variety of tools available for tracking incidents, most include the ability to automatically notify the correct staff about status changes of an incident, and provide automated methods to escalate issues between groups and individuals. These tools ensure not only smooth communication during a normally chaotic time, but also a good audit trail after the fact to review an incident to plan for better responses the next time.
Responding to security threats within a company can often be a chaotic time. The more time that is spent up front identifying threats, and developing response processes, the more effectively a company can both understand and respond to threats. A clearly defined process for threat response can ensure that no steps are missed, lessons are documented for future use and communication between teams is effective and efficient. The constantly changing security threats in todays environments means that process is critical to ensure staff are prepared and respond accordingly to all threats, known and unknown.
This risk planning also has other purposes outside of planning for and responding too threats. This information, once gathered, can also be used as a basis for understanding risk around different types of threats. Often lower level threats have such a low level of risk that responding too all of them would be a waste of company resources, yet more complex attacks require faster, more urgent response. These risk assessments can also help staff plan the appropriate solutions around patch management, firewalls, network controls and other tools meant to stop intrusion. By properly understanding threats and there potential impact to production services, and staffs' time, a proper mitigation plan can also be developed.
Threat Matrix
Planning for threats can often be a daunting task, even for the most seasoned of security professionals. The challenge comes from the inability to know what exact threats are in the wild every day, and the new threats that are constantly emerging. There are many details that will need to be documented and considered when planning for the various known threats that are in the wild, these include who is causing the threat and who is the target, how is the attack being carried out, what safeguards are being affected as part of the threat, what changes will be needed to eliminate the threat, what is the cost of responding to the threat, both in prevention and if the threat is successful.
By planning and carefully documenting the process for responding for known threats, we develop experience that can then be used for responding to unknown threats. The following questions and information gathering can assist in developing this threat matrix. This matrix serves as a starting point when responding to known threats, and will be added to as new threats are encountered.
Who is causing the threat?
This question can have multiple components, it looks at whether the threat is being caused by someone internally or someone externally, as well as is the threat caused by a person, or rogue software. This is an important point to assess all threats to ensure that the safeguards added are in the correct place to mitigate the threat, and that resources are in the appropriate places to respond to the threat.
Who is the target?
A proper understanding of the target is important so that the impact of the threat is understood. From the target we can ascertain if sensitive customer information is at risk, if availability of public services is at risk or if we are at risk for a legal compliance issue. By who, I specifically mean what server, host, application, database, router, firewall or any other device that could be attacked.
This information can also be used to track developing patterns of attack. As these threats are rolled into response plans, a plan of documentation can ensure that patterns of threats are tracked and managed properly.
Avenue of attack?
This question evaluates how the threat is affecting a companies infrastructure. This could be technical avenues like via outside network connections or email, but can also be physical level like a person in your building or outside the building. Understanding the avenue of attack is critical to responding so that an attach response does not cause undue outages to other portions of the infrastructure or unnecessary outages to customer facing services.
Safeguards affected?
It is important to understand what safeguards are potentially compromised by a given threat. This could include firewalls, application validation checks, database encryption or filesystem encryption. Understanding the affected safeguards will later allow processes to be developed that mitigate the threat as quickly and efficiently as possible by understanding how best to stop the threat.
Changes to stop the threat?
This is a detailed list of what configuration changes or otherwise will need to be implemented to stop the threat. These are used to develop the process in the Response Matrix to slow and eliminate the threat. Understanding these responses is also important so that a risk versus reward analysis can be done. Often times, the change to eliminate the threat is so drastic that other problems with services result. By understanding what changes are required, management can make informed decisions about ignoring the threat, or what various response to use against the threat.
Cost of responding?
Understanding the implications of a threat are important when developing an appropriate level of response to the threat. The cost of responding can be communicated in multiple ways including cost in dollars or cost in time. All are important factors to use when deciding on response plans to threats, and the level of risk associated with various response plans.
Cost if threat is successful?
The other important cost associated with responding to a threat is the cost if the threat is successful. This could mean many things depending on the type of threat; it could be an outage in customer facing services, a loss of customer data or the financial impact of not providing the services to customers.
Current mitigation plan in place?
This item assess the safeguards that are in place to mitigate a given threat. This can include firewalls, security patches, passwords, identity management solutions or a host of application layer safeguards related to data scrubbing and input validation.
By no means are these all the items that should be assessed in a threat matrix. These are the most common ones that most companies will have documented for all threats. Additional information can be included in the threat matrix for specific applications, operating systems, network types and levels of data that a company processes. When developing a threat matrix, a company should evaluate all applications, hosts, networks, network connections and associated tools. By evaluating these items a list can be developed that includes the items that could be attacked, and what methods could be used in an attack.
Response Matrix
After defining the known threats, a response matrix can be developed specific to the companies needs and risks. These needs and risks can be calculated from the cost portion of the above developed matrix of threats. These needs and risks can be used as a basis for planning resources around responses, legal obligations when threats are encountered and documentation policies around responding to threats. This response matrix should contain detailed procedures for responding to several categories of threats.
The initial component of all response matrices is a list of types of incidents, this can usually be broken down into the follow categories, known as threat types:
- Known incidents that have previously been experienced and have a documented response plan
- Known incidents that have not been experienced, but have a response plan in place
- Unknown incidents that do not have an associated response plan
Responding to threats and attacks must be a methodical process that encompasses many challenges including speed, communication, documentation and follow up. All of these must be managed while ensuring that customers and staff are impacted as little as necessary. Responding must be a coordinated effort between the various teams within a company that are responsible for system administration, data security, compliance, network administration and enterprise architecture.
Threat Type 1
Threat type one is the category that a company will know the most about and be able to plan in a detailed way how to respond. This category will be threats that have previously been responded too and resolved. Each time they have previously occurred a follow up meeting should have been done to revise and improve the process for the specific threat.
Threat Type 2
The threats that will be listed in the second category will include well known exploits and attack avenues as well as threats that other companies have actively faced. These threats will also be listed in the threat matrix, although with a side note that the company has not previously had to respond to them, but does anticipate the threat.
Threat Type 3
This category is often the most complex to respond too because the actual threats are unknown. The process for responding to unknown threats must be dynamic enough to handle a wide enough range that all threats are properly responded too, but rigid enough that legal implications are handled and communication channels do not break down in the face on unknown threats.
Internal versus External
One important component to all response plans is understanding if the threat you are planning for is internal or external. Internal threats come from staff that are either intentionally out to cause the company harm, or systems that are setup in a way they allow staff access that was not intended and subsequently has negative consequences.
Often times the threat, be it internal or external, plays an important role in how the company responds. If the threat is internal, it is often important to bring in outside resources to assess the problems and develop a mitigation plan, ensuing that the company is not vulnerable from future insider threats.
Response Teams
Another key component of all response matrices is a carefully planned list of individuals and teams to be included in response activities. As a company, you should evaluate if you have the appropriate level of technical capability in house to respond to known and unknown threats, as well as what additional capabilities need to be brought in when responding to threats. Outside resources could be technical staff specialized in security, or marketing staff focused on public relations issues, it could even be law enforcement to track down the source of threats. Today a lot of companies also ensure that when responding to major threats, legal council is brought in to ensure compliance with data handling and reporting requirements.
Response Training
Each type of threat that might be responded too should have associated required training for staff. This could include computer forensics, data analysis, legal implications or technical skills. This training, done on regular intervals, ensures that staff have both a process and the proper training to effectively respond to threats.
While these are not all the possible categories for each response within the response matrix, they begin to provide a basis for response planning. Additional items can be included in the response plan based on company specifics, industry legalities and management preferences. All response plans should be detailed enough that staff have clear directions to follow in possibly chaotic situations. Response plans should also have regular reviews of the process to ensure they are updated to reflect changes in company management structure, changes in technology and changes in industry trends.
Response Methods
The majority of this document has been focused at defining a matrix of threats that a companies information systems face. These threats can then used as a basis when defining responses in a coordinated fashion, while most of the writing was about defining manual processes, this is only the first step to automating the responses and processes. After defining the manual processes for responding to threats, a system of automation can be put in place for the responses that make sense. Automated solutions work very well for defined threats that have clear responses; a good example is a service on the company network being attacked by an outside system, an automated system that blocks the source of the attack and notifies staff ensures that the threat is immediately contained and staff notified in a timely manner.
Automating the response plan can also aid in the communication of threats and coordination of activities. There are a variety of tools available for tracking incidents, most include the ability to automatically notify the correct staff about status changes of an incident, and provide automated methods to escalate issues between groups and individuals. These tools ensure not only smooth communication during a normally chaotic time, but also a good audit trail after the fact to review an incident to plan for better responses the next time.
Responding to security threats within a company can often be a chaotic time. The more time that is spent up front identifying threats, and developing response processes, the more effectively a company can both understand and respond to threats. A clearly defined process for threat response can ensure that no steps are missed, lessons are documented for future use and communication between teams is effective and efficient. The constantly changing security threats in todays environments means that process is critical to ensure staff are prepared and respond accordingly to all threats, known and unknown.
Friday, October 17, 2008
Building a new Lustre Filesystem
Here are the quick and dirty steps to create a new Lustre filesystem for testing purposes. I use this at times to test out commands and test benchmarking tools, not to test performance, but to ensure they operate correctly.
This is a simple test environment on a single system with a single physical disk. Lustre is designed for scalability so these commands can be run on multiple machines and across many disks to ensure that a bottleneck does not occur in larger environments. The purposes of this is to generate a working Lustre filesystem for testing and sandbox work.
This set of directions assumes you have compiled and installed both the Lustre kernel and the Lustre userspace bits. Check my previous blog posting for how to complete those items if necessary. This also assumes that you have a spare physical disk that can be partitioned to create the various components of the filesystem. In the example case below I created the filesystem within a Xen virtual machine.
1) Create a script to partition the disk that will be used for testing (using /dev/xvdb for example purposes)
#!/bin/sh
sfdisk /dev/xvdb << EOF
,1ooo,L
,1000,L
,2000,L
,2000,L
EOF
2) Format the MGS Partition
- mkfs.lustre --mgs --reformat /dev/xvdb1
- mkdir -p /mnt/mgs
- mount -t lustre /dev/xvdb1 /mnt/mgs
3) Format the MDT Partition
- mkfs.lustre --mdt --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb2
- mkdir -p /mnt/lusfs01/mdt
- mount -t lustre /dev/xvdb2 /mnt/lusfs01/mdt
4) Format the First OST Partition
- mkfs.lustre --ost --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb3
- mkdir -p /mnt/lusfs01/ost00
- mount -t lustre /dev/xvdb3 /mnt/lusfs01/ost00
5) Format the Second OST Partition
- mkfs.lustre --ost --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb4
- mkdir -p /mnt/lusfs01/ost01
- mount -t lustre /dev/xvdb4 /mnt/lusfs01/ost01
6) Mount the client view of the filesystem
- mkdir -p /mnt/lusfs01/client
- mount -t lustre 127.0.0.1@tcp0:/lusfs01 /mnt/lusfs01/client
At this point you should be able to do an ls, touch, rm or any other standard file manipulation command on files in /mnt/lusfs01/client.
This is a simple test environment on a single system with a single physical disk. Lustre is designed for scalability so these commands can be run on multiple machines and across many disks to ensure that a bottleneck does not occur in larger environments. The purposes of this is to generate a working Lustre filesystem for testing and sandbox work.
This set of directions assumes you have compiled and installed both the Lustre kernel and the Lustre userspace bits. Check my previous blog posting for how to complete those items if necessary. This also assumes that you have a spare physical disk that can be partitioned to create the various components of the filesystem. In the example case below I created the filesystem within a Xen virtual machine.
1) Create a script to partition the disk that will be used for testing (using /dev/xvdb for example purposes)
#!/bin/sh
sfdisk /dev/xvdb << EOF
,1ooo,L
,1000,L
,2000,L
,2000,L
EOF
2) Format the MGS Partition
- mkfs.lustre --mgs --reformat /dev/xvdb1
- mkdir -p /mnt/mgs
- mount -t lustre /dev/xvdb1 /mnt/mgs
3) Format the MDT Partition
- mkfs.lustre --mdt --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb2
- mkdir -p /mnt/lusfs01/mdt
- mount -t lustre /dev/xvdb2 /mnt/lusfs01/mdt
4) Format the First OST Partition
- mkfs.lustre --ost --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb3
- mkdir -p /mnt/lusfs01/ost00
- mount -t lustre /dev/xvdb3 /mnt/lusfs01/ost00
5) Format the Second OST Partition
- mkfs.lustre --ost --reformat --mgsnid=127.0.0.1 --fsname=lusfs01 /dev/xvdb4
- mkdir -p /mnt/lusfs01/ost01
- mount -t lustre /dev/xvdb4 /mnt/lusfs01/ost01
6) Mount the client view of the filesystem
- mkdir -p /mnt/lusfs01/client
- mount -t lustre 127.0.0.1@tcp0:/lusfs01 /mnt/lusfs01/client
At this point you should be able to do an ls, touch, rm or any other standard file manipulation command on files in /mnt/lusfs01/client.
Thursday, October 16, 2008
Building Lustre 1.6.5.1 against the latest Redhat Kernel
I was at a customer site this week and had the need to build Lustre 1.6.5.1 against the latest kernel from Redhat, 2.6.18-92.1.13. Being this process has multiple steps, I thought I would document it so that others do not have to reinvent the wheel.
1) Prep a build environment
- cd ~
- yum install rpm-build redhat-rpm-config unifdef
- mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
- echo '%_topdir %(echo $HOME)/rpmbuild' > .rpmmacros
- rpm -i http://mirror.centos.org/centos/5/updates/SRPMS/kernel-2.6.18-92.1.13.el5.src.rpm
- cd ~/rpmbuild/SPECS
- rpmbuild -bp --target=`uname -m` kernel-2.6.spec 2> prep-err.log | tee prep-out.log
2) Download and install quilt (quilt is used for applying kernel patches from a series file)
- cd ~
- wget http://download.savannah.gnu.org/releases/quilt/quilt-0.47.tar.gz
- gunzip quilt-0.47.tar.gz
- tar -xvf quilt-0.47.tar
- cd quilt-0.47
- ./configure
- make
- make install
3) Prepare the Lustre source code
- Download from http://www.sun.com/software/products/lustre/get.jsp
- mv lustre-1.6.5.1.tar.gz /usr/src
- gunzip lustre-1.6.5.1.tar.gz
- tar -xvf lustre-1.6.5.1.tar
4) Apply the Lustre kernel-space patches to the kernel source tree
- cd /root/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/
- ln -s /usr/src/lustre-1.6.5.1/lustre/kernel_patches/series/2.6-rhel5.series series (there are several diffrent series files in the series dir, choose the one closest to your environment)
- ln -s /usr/src/lustre-1.6.5.1/lustre/kernel_patches/patches patches
- quilt push -av
5) Compile a new kernel from source
- make distclean
- make oldconfig dep bzImage modules
- cp /boot/config-`uname -r` .config
- make oldconfig || make menuconfig
- make include/asm
- make include/linux/version.h
- make SUBDIRS=scripts
- make rpm
- rpm -ivh ~/rpmbuild/RPMS/kernel-2.6.18prep-1.x86_64.rpm
- mkinitrd /boot/initrd-2.6.18-prep.img 2.6.18-prep
- Update /etc/grub.conf with new kernel boot information
6) Reboot system with new, patched kernel
7) Compile Lustre with the new kernel running
- cd /usr/src/lustre-1.6.5.1
- ./configure --with-linux=/root/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18.x86_64
- make rpms (Build RPMs will be in ~/rpmbuild/RPMS)
8) Install the appropriate RPMs for your environment
1) Prep a build environment
- cd ~
- yum install rpm-build redhat-rpm-config unifdef
- mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
- echo '%_topdir %(echo $HOME)/rpmbuild' > .rpmmacros
- rpm -i http://mirror.centos.org/centos/5/updates/SRPMS/kernel-2.6.18-92.1.13.el5.src.rpm
- cd ~/rpmbuild/SPECS
- rpmbuild -bp --target=`uname -m` kernel-2.6.spec 2> prep-err.log | tee prep-out.log
2) Download and install quilt (quilt is used for applying kernel patches from a series file)
- cd ~
- wget http://download.savannah.gnu.org/releases/quilt/quilt-0.47.tar.gz
- gunzip quilt-0.47.tar.gz
- tar -xvf quilt-0.47.tar
- cd quilt-0.47
- ./configure
- make
- make install
3) Prepare the Lustre source code
- Download from http://www.sun.com/software/products/lustre/get.jsp
- mv lustre-1.6.5.1.tar.gz /usr/src
- gunzip lustre-1.6.5.1.tar.gz
- tar -xvf lustre-1.6.5.1.tar
4) Apply the Lustre kernel-space patches to the kernel source tree
- cd /root/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/
- ln -s /usr/src/lustre-1.6.5.1/lustre/kernel_patches/series/2.6-rhel5.series series (there are several diffrent series files in the series dir, choose the one closest to your environment)
- ln -s /usr/src/lustre-1.6.5.1/lustre/kernel_patches/patches patches
- quilt push -av
5) Compile a new kernel from source
- make distclean
- make oldconfig dep bzImage modules
- cp /boot/config-`uname -r` .config
- make oldconfig || make menuconfig
- make include/asm
- make include/linux/version.h
- make SUBDIRS=scripts
- make rpm
- rpm -ivh ~/rpmbuild/RPMS/kernel-2.6.18prep-1.x86_64.rpm
- mkinitrd /boot/initrd-2.6.18-prep.img 2.6.18-prep
- Update /etc/grub.conf with new kernel boot information
6) Reboot system with new, patched kernel
7) Compile Lustre with the new kernel running
- cd /usr/src/lustre-1.6.5.1
- ./configure --with-linux=/root/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18.x86_64
- make rpms (Build RPMs will be in ~/rpmbuild/RPMS)
8) Install the appropriate RPMs for your environment
Wednesday, October 8, 2008
Getting Ahead in IT
I speak with folks in Information Technology (IT) regularly that tell me how hard it is to get ahead in IT and that there are too few opportunities for promotion in IT. I hear this from folks at all levels including architect, developer, administrator and team management. While getting ahead in IT can be difficult, it is not impossible. IT is often the last organization within a company that Human Resources (HR) considers when developing career paths, career training and mentoring plans. There are a variety of things that people in IT can do to ensure they get noticed by management and advance as a result.
Have a career path
First and foremost in IT it is important to have a career path, know what you want to get out of your career in the immediate future, in 5 years, and beyond. This will allow you to strategically pick projects that increase your ability to meet these goals. This plan will also allow you to speak with your manager, HR and other company leaders about training, mentoring, and other activities to increase your skill set.
A career path is not necessarily a goal to be promoted or to obtain a title that you would like. While it can be those things, it is more often targets for growth in technical capability, a goal to become management, or a goal to develop skills in one role that will enable you to deliver more efficiently in another role. Each of these goals requires a different focus on skills development, but all require the same open communication with your current management about your goals.
Focus on Business Needs
Second, being in IT does not preclude you from participating in the business. IT is an integral part of any business, and the IT folks that excel are the ones that understand how IT can help the business grow and be more efficient. Working to establish an understanding of the companies core principals will allow you to suggest improvements in IT, as well as establish a value within the company.
Very seldom do IT folks participate in business focused meetings. This is a shame. This is a very good opportunity for IT folks to not only learn about a companies' operations, but it is also a good chance to suggest new and better methods that IT can provide for the core business of a company. The IT staff that take the initiative and participate in the business discussions will be the ones most noticed as new opportunities within the company open up.
Bring Forward Ideas
Being noticed is important in all careers, especially in IT where management often forgets how critical IT is to the success of a company. The easiest way to be noticed is to speak up, if you have an idea for improvements, be it a new tool, an improved process or a new piece of hardware, suggest it to management. If your immediate manger does not see the value, mention it next time you are talking to other managers. Management at all companies appreciate individuals that are proactive enough to suggest ways for improvement before being asked.
Set Boundaries
At high stress times when projects are due and deadlines are tight it may not seem like it, but managers will notice and respect you for holding to your principals. Setting boundaries is important in all jobs, especially in IT where often long hours are the norm and expected. Work with your management to let them know about outside obligations, either organizations or family. This will ensure your management is aware of other things you are involved in. Most importantly hold to those boundaries. It is normal to have a long evening when a project is due, or the systems are down. The important thing is to stick to your boundaries and do not allow one late evening to morph into constantly working excessively long days.
Setting boundaries will also help keep stress within a reasonable level. If you can keep your stress lower, you will not only be a more efficient employee, but you will be able to more effectively assist others and complete your own projects.
Foster an environment of personal development
Often times folks say that a work environment is created by management, this is only partially true. A dedicated staff member can also create an environment where others share, feel comfortable asking questions and learn. By letting others know you are available for questions or conversation, you are letting others know that you are willing to help others develop their skills and experience. This attitude can be used to influence management at all levels to formalize personal development plans.
Staying positive is an important component of developing an environment people want to work in. No one wants to be known as the angry employee. It can be tough to keep your cool at all times, but being the level headed employee goes a long way to making yourself approachable and creating a comfortable working environment.
Be a Team Player
Everyone is told from childhood to be a team player, but what does that mean in todays business environment? Being a team player is a combination of sharing the work load, accepting projects that benefit the team as a whole and ensuring information is shared for the benefit of the team.
Most importantly, especially for folks that manage teams and projects, do not pass along a task because you do not want to do it. The quickest way to get noticed is step up and complete the work that needs to be done, but would otherwise fall by the way side because folks do not find it interesting. You certainly do not want to make a career out of working on uninteresting projects, but picking up one now and again will not be a career killer and will get you some recognition.
The most efficient way to show you are a team player is to not work in a vacuum. As you and your team work on projects, solicit the input from other teams. Use there input to carefully evaluate your teams' assumptions and project goals, and make adjustments as necessary. This shows that you value their input and experience, and will ultimately enable you to create a better product for the company.
Document, document, document. In todays world where people regularly move roles and companies, it is critical to ensure that all tasks, no matter how trivial, are documented so that others can complete them. If your company has a wiki, use it. If your company does not have a wiki, put one online for others to use. The fastest way to develop mindshare within a company is to be the person who has contributed the most to internal repositories. By ensuring your tasks are carefully documented, you are telling company management you are not trying to force them into keeping you, you are telling them you care about the companies long term success and letting them know you can be moved to other roles and new staff can take over your tasks.
Moving On
Ultimately, some combinations of employees and companies will not mesh well. In times like that it is appropriate to look at opportunities elsewhere. There are a lot of companies in IT today, and often times a different one will provide you the opportunities you are looking for. But make sure that any time you are looking to move companies you closely assess why your current company is not providing what you need to meet your goals, and work to find a place that will assist in meeting those goals.
When looking for a new role or starting a new role, remember that these things take time. It can often take months or years to feel at home at a new company and really feel like you are a highly contributing member of the team. When looking for a new role, discuss with your potential manager how long staff have been at the company, what types of development opportunities they offer and how the team dynamics work. This will ensure that any job change is meaningful and a path to newer and better things.
IT is not the dead end that it is often made out to be. It is also not a simple process to make yourself known in a large pool of people and advance in IT. To succeed in IT you must have a clear set of goals for your career, and use those to develop a clear list of activities to meet those goals. Moving up is not an immediate process, but by committing the time to development and communication, you can let management know you are willing and capable to take on new challenges and meet your career goals in the process.
Have a career path
First and foremost in IT it is important to have a career path, know what you want to get out of your career in the immediate future, in 5 years, and beyond. This will allow you to strategically pick projects that increase your ability to meet these goals. This plan will also allow you to speak with your manager, HR and other company leaders about training, mentoring, and other activities to increase your skill set.
A career path is not necessarily a goal to be promoted or to obtain a title that you would like. While it can be those things, it is more often targets for growth in technical capability, a goal to become management, or a goal to develop skills in one role that will enable you to deliver more efficiently in another role. Each of these goals requires a different focus on skills development, but all require the same open communication with your current management about your goals.
Focus on Business Needs
Second, being in IT does not preclude you from participating in the business. IT is an integral part of any business, and the IT folks that excel are the ones that understand how IT can help the business grow and be more efficient. Working to establish an understanding of the companies core principals will allow you to suggest improvements in IT, as well as establish a value within the company.
Very seldom do IT folks participate in business focused meetings. This is a shame. This is a very good opportunity for IT folks to not only learn about a companies' operations, but it is also a good chance to suggest new and better methods that IT can provide for the core business of a company. The IT staff that take the initiative and participate in the business discussions will be the ones most noticed as new opportunities within the company open up.
Bring Forward Ideas
Being noticed is important in all careers, especially in IT where management often forgets how critical IT is to the success of a company. The easiest way to be noticed is to speak up, if you have an idea for improvements, be it a new tool, an improved process or a new piece of hardware, suggest it to management. If your immediate manger does not see the value, mention it next time you are talking to other managers. Management at all companies appreciate individuals that are proactive enough to suggest ways for improvement before being asked.
Set Boundaries
At high stress times when projects are due and deadlines are tight it may not seem like it, but managers will notice and respect you for holding to your principals. Setting boundaries is important in all jobs, especially in IT where often long hours are the norm and expected. Work with your management to let them know about outside obligations, either organizations or family. This will ensure your management is aware of other things you are involved in. Most importantly hold to those boundaries. It is normal to have a long evening when a project is due, or the systems are down. The important thing is to stick to your boundaries and do not allow one late evening to morph into constantly working excessively long days.
Setting boundaries will also help keep stress within a reasonable level. If you can keep your stress lower, you will not only be a more efficient employee, but you will be able to more effectively assist others and complete your own projects.
Foster an environment of personal development
Often times folks say that a work environment is created by management, this is only partially true. A dedicated staff member can also create an environment where others share, feel comfortable asking questions and learn. By letting others know you are available for questions or conversation, you are letting others know that you are willing to help others develop their skills and experience. This attitude can be used to influence management at all levels to formalize personal development plans.
Staying positive is an important component of developing an environment people want to work in. No one wants to be known as the angry employee. It can be tough to keep your cool at all times, but being the level headed employee goes a long way to making yourself approachable and creating a comfortable working environment.
Be a Team Player
Everyone is told from childhood to be a team player, but what does that mean in todays business environment? Being a team player is a combination of sharing the work load, accepting projects that benefit the team as a whole and ensuring information is shared for the benefit of the team.
Most importantly, especially for folks that manage teams and projects, do not pass along a task because you do not want to do it. The quickest way to get noticed is step up and complete the work that needs to be done, but would otherwise fall by the way side because folks do not find it interesting. You certainly do not want to make a career out of working on uninteresting projects, but picking up one now and again will not be a career killer and will get you some recognition.
The most efficient way to show you are a team player is to not work in a vacuum. As you and your team work on projects, solicit the input from other teams. Use there input to carefully evaluate your teams' assumptions and project goals, and make adjustments as necessary. This shows that you value their input and experience, and will ultimately enable you to create a better product for the company.
Document, document, document. In todays world where people regularly move roles and companies, it is critical to ensure that all tasks, no matter how trivial, are documented so that others can complete them. If your company has a wiki, use it. If your company does not have a wiki, put one online for others to use. The fastest way to develop mindshare within a company is to be the person who has contributed the most to internal repositories. By ensuring your tasks are carefully documented, you are telling company management you are not trying to force them into keeping you, you are telling them you care about the companies long term success and letting them know you can be moved to other roles and new staff can take over your tasks.
Moving On
Ultimately, some combinations of employees and companies will not mesh well. In times like that it is appropriate to look at opportunities elsewhere. There are a lot of companies in IT today, and often times a different one will provide you the opportunities you are looking for. But make sure that any time you are looking to move companies you closely assess why your current company is not providing what you need to meet your goals, and work to find a place that will assist in meeting those goals.
When looking for a new role or starting a new role, remember that these things take time. It can often take months or years to feel at home at a new company and really feel like you are a highly contributing member of the team. When looking for a new role, discuss with your potential manager how long staff have been at the company, what types of development opportunities they offer and how the team dynamics work. This will ensure that any job change is meaningful and a path to newer and better things.
IT is not the dead end that it is often made out to be. It is also not a simple process to make yourself known in a large pool of people and advance in IT. To succeed in IT you must have a clear set of goals for your career, and use those to develop a clear list of activities to meet those goals. Moving up is not an immediate process, but by committing the time to development and communication, you can let management know you are willing and capable to take on new challenges and meet your career goals in the process.
Sunday, October 5, 2008
Succeeding in todays services driven IT market
The Information Technology (IT) space is undergoing a dramatic shift. This shift from hardware and software based sales processes will have a dramatic impact for those executives that manage IT vendors, those who sell IT solutions and those who market them. As hardware prices have fallen and hardware has become more commodity in nature, companies are focusing less time on purchasing hardware, and more on ensuring that their business needs are being met by their IT systems. Companies are beginning to realize that IT can be an enabler to ensure that their primary business is run as efficiently as possible. This realization is opening new markets around services, primarily highly complex consulting and integration focused services.
Service Definitions
To enable organizations to successfully meet these new services driven customer needs, there is a separation occurring within a lot of services organizations. This separation is usually along operational lines to enable each type of services delivery organization to effectively delivery value to their customer, in a scalable manner. Organizations can be broken into four distinct services delivery teams, Professional Services, Consulting Services, Managed Services and Support Services.
There are a variety of interpretations today about what types of offerings are available from a services organization, and how they are branded publicly. When speaking about Professional Services, I envision an organization that is focused on product delivery and integration. Professional Services are often the organization a vendor uses for deploying their hardware and software in customer environments. Professional Services personnel are often experts in a companies portfolio, as well as the products integration with other offerings on the market.
Consulting services are a higher caliber of Professional Services in my experience. Consulting Services are offerings around custom integration or custom development, either system or software. Consulting Services tend to be more complex deals that run longer, and do not necessarily have a hardware or commercial software component as Professional Services would.
Managed services are often used as a way to ensure a vendor has a long term presence at a customer site. Managed Services offerings are often provided to the customer to manage the on-site hardware and software that a customer has purchased, but does not have the staff to operate day-to-day. Managed Services are often long term agreements for a company to ensure a customers' IT operations are stable and managed per industry best practices.
Finally, Support Services. Support Services are typically the contracts that are purchased with hardware and software to entitle the owner to a clear path for product assistance. This is most often phone support and access to patches for the product for bugs and security vulnerabilities.
Each of these four offerings is distinct, they each have their own lifecycle, associated costs and required skill levels for delivery. It is important to distinguish the various service offerings when developing sales strategies, as well as delivery methodologies. Each one is a different type of purchase for the customer, and has different implications to the cost/benefit trade off analysis work that customers do when purchasing services.
Hardware is cheap
One large influence on the drive towards services and specifically Consulting Services is the drive towards cheaper hardware. Hardware today is based on standards and commodity parts that enable a larger number of vendors to sell the same capabilities and components. Because of this commonality around features, customers look mostly to price when comparing two similar pieces of hardware. The companies that strive are the ones that realize hardware is only a platform for running a business; the real value to companies in todays fast paced market is putting highly capable solutions on this common hardware to enable a customer to be more successful.
Many companies today rely on a predictable, regular refresh cycle for all hardware. This enables companies to position themselves to deliver solutions around managing services on top of this regular refresh. This refresh requires companies to ensure that data and applications are implemented in a way that when the next hardware refresh comes, the data and applications can be easily migrated. Often times customers do not posses the necessary staff in house to implement these types of software provisioning, they will turn to consulting organizations to implement integrated solutions around these refresh cycles.
Fixing Business Problems
Todays customers are looking more and more to IT as a way to enable their core business in a more efficient way. Customers are using data warehouses as a way to process vast amounts of data to ensure the business is being managed correctly, they are using customer resource management systems to ensure customer requests are handled efficiently and correctly the first time, and they are using mobile devices to connect remote workers to the office and get them information as soon as possible.
Often times companies do not have the necessary staff in house to both implement and manage todays complex solutions. Customers often must balance having too many versus too few staff, and often will lean towards fewer staff, and contracting the complex implementations and projects that require more time then staff immediately have.
Customers today will also look to outside services for guidance in business inefficiencies. Often customers see a value to outside input when reviewing legacy processes, this outside input can ensure that new processes are developed with an understanding of currently available technology and tools that can assist with driving productivity. These business efficiencies have a multitude of possible solutions including business intelligence tools with an associated data warehouse, a formal enterprise architecture program, automated provisioning of new services or automated software development assistance.
Business Intelligence
Todays businesses rely on information for making decisions, as well as reviewing previous decisions. in a systematic way. This information must be organized and have associated tools for reporting. Often times companies will look to outside firms to assist them with managing their sales information, forecasting, process assessment, manufacturing data and purchasing. Todays business intelligence solutions rely on expertise in these areas in data management, data mining, data cleansing and ultimately reporting using accurate and proven methods.
Enterprise Architecture
Enterprise Architecture is growing in popularity as companies look to formalize how business processes and company visions become IT systems. Todays TOGAF and the Zachman Frameworks are used by countless organizations to formally document the architecture that IT will follow for implementing systems, tools, software and support services. Few companies today have the expertise in house to develop a formal Enterprise Architecture program, and because of that will look to outside companies that have expert level knowledge and experience with the above frameworks.
Automated Provisioning
Speed is an important factor when doing business today. Companies that can rapidly adapt to change are more successful in meeting customer demands and needs. A companies IT systems are a critical component to all adjustments to market environments. By being able to more rapidly provision new services, or capacity for existing services, companies can ensure they are ready for this change. Automated provisioning ensures that minimal staff intervention is needed when bringing online new services; this both lowers the time to market as well as the costs associated with bringing new capabilities and capacity to market. Companies will often look for experienced outside assistance when developing automated provisioning systems, this outside experience can ensure that new services are brought online both efficiently and correctly.
Software Development
Software development can be a complicated orchestra including requirements gathering, architecture, development, internal testing, and finally customer testing. Companies will often look for external assistance with developing unit tests and automated regression testing environments. Outside resources can provide a unique perspective on the development and testing process because of there disconnect from the entire development process, they are able to focus all efforts on testing for defects and usability.
Selling to the decision makers
This shift in IT purchases from hardware to services has a dramatic impact on the sales process for vendors, particularly those that sell both hardware and services. As more and more IT solutions are purchased that are directly tied to company objectives, fewer purchases will be made by the managers and staff implementing the solutions. More and more large IT purchases are being made by a companies executives including the CIO, CFO, CTO and COO. These individuals are no longer focusing on the technology behind the products, they are looking to vendor solutions and offerings as a way to increase productivity, increase output, and to better understand and manage their business.
When selling services and solutions today, sales teams must articulate to potential customers the immediate and long term costs of solutions, and how those costs will directly affect the bottom line of the business. The cost of individual servers, licenses and data centers are no longer looked at with the level of scrutiny they once were. Today purchasers are looking to the total cost of a solution in implementation costs, reoccurring costs, and upgrade costs; then comparing those costs to the measurable benefits once the solution is in place.
Considerations for solution based purchases
Now that we know who is making the decisions around big IT purchases and that hardware is a small piece of the equation, we need to understand what items influence these decisions. These will enable solution developers and sales teams to properly positions services within a customer organization to enable effective business solutions.
Life-cycle Support
Customers look at vendors to provide them solutions that will meet the immediate needs as well as provide paths as the company grows aver time. The paths can be upgrades, accommodation of new legal requirements, growth and manageability. As part of the sales process it is critical to communicate to the customer a vendors capabilities around not only solution implementation, but also upgrades, changes and solution support.
Solution Ownership
Todays companies look to do business with vendors that will own solutions from end to end. This ownership requires the vendors to have solid methodologies around product development, delivery, support and upgrades. This does not mean a vendor needs to develop all products in house, or have a software package for every customer, but vendors should be able to provide their customers a single point of contact for all phases of complex projects.
Solution Flexibility
Companies today want to ensure they are not locked in to any specific solution, be it hardware, software or a specific consultant. For a consultant to put in the extra effort to ensure a solution is properly documented and communicated to the customer, shows the customer the dedication to their business and helping them succeed.
Cost Management
When purchasing services related IT solutions, companies today are looking at more then the initial cost of the contract, or the options. They are looking at the benefits the company will see because of the solutions, these benefits can be more efficient operations, more customers, or simplified growth paths. It is critical that as part of the sales process the costs and benefits are both understood and communicated to the customer.
To be successful in todays IT market, vendors must focus on correcting customers business problems, and work to become a trusted adviser in their business operations. Customers today are looking for long term solutions to their IT needs that will ensure they are competitive and able to not only grow, but change as the market demands. This has caused a dramatic shift away from purchases focused strictly on servers and storage, but to purchases of solutions. These solutions must have definitive cost returns over time that allow management to clearly understand how their business will be positively impacted.
Service Definitions
To enable organizations to successfully meet these new services driven customer needs, there is a separation occurring within a lot of services organizations. This separation is usually along operational lines to enable each type of services delivery organization to effectively delivery value to their customer, in a scalable manner. Organizations can be broken into four distinct services delivery teams, Professional Services, Consulting Services, Managed Services and Support Services.
There are a variety of interpretations today about what types of offerings are available from a services organization, and how they are branded publicly. When speaking about Professional Services, I envision an organization that is focused on product delivery and integration. Professional Services are often the organization a vendor uses for deploying their hardware and software in customer environments. Professional Services personnel are often experts in a companies portfolio, as well as the products integration with other offerings on the market.
Consulting services are a higher caliber of Professional Services in my experience. Consulting Services are offerings around custom integration or custom development, either system or software. Consulting Services tend to be more complex deals that run longer, and do not necessarily have a hardware or commercial software component as Professional Services would.
Managed services are often used as a way to ensure a vendor has a long term presence at a customer site. Managed Services offerings are often provided to the customer to manage the on-site hardware and software that a customer has purchased, but does not have the staff to operate day-to-day. Managed Services are often long term agreements for a company to ensure a customers' IT operations are stable and managed per industry best practices.
Finally, Support Services. Support Services are typically the contracts that are purchased with hardware and software to entitle the owner to a clear path for product assistance. This is most often phone support and access to patches for the product for bugs and security vulnerabilities.
Each of these four offerings is distinct, they each have their own lifecycle, associated costs and required skill levels for delivery. It is important to distinguish the various service offerings when developing sales strategies, as well as delivery methodologies. Each one is a different type of purchase for the customer, and has different implications to the cost/benefit trade off analysis work that customers do when purchasing services.
Hardware is cheap
One large influence on the drive towards services and specifically Consulting Services is the drive towards cheaper hardware. Hardware today is based on standards and commodity parts that enable a larger number of vendors to sell the same capabilities and components. Because of this commonality around features, customers look mostly to price when comparing two similar pieces of hardware. The companies that strive are the ones that realize hardware is only a platform for running a business; the real value to companies in todays fast paced market is putting highly capable solutions on this common hardware to enable a customer to be more successful.
Many companies today rely on a predictable, regular refresh cycle for all hardware. This enables companies to position themselves to deliver solutions around managing services on top of this regular refresh. This refresh requires companies to ensure that data and applications are implemented in a way that when the next hardware refresh comes, the data and applications can be easily migrated. Often times customers do not posses the necessary staff in house to implement these types of software provisioning, they will turn to consulting organizations to implement integrated solutions around these refresh cycles.
Fixing Business Problems
Todays customers are looking more and more to IT as a way to enable their core business in a more efficient way. Customers are using data warehouses as a way to process vast amounts of data to ensure the business is being managed correctly, they are using customer resource management systems to ensure customer requests are handled efficiently and correctly the first time, and they are using mobile devices to connect remote workers to the office and get them information as soon as possible.
Often times companies do not have the necessary staff in house to both implement and manage todays complex solutions. Customers often must balance having too many versus too few staff, and often will lean towards fewer staff, and contracting the complex implementations and projects that require more time then staff immediately have.
Customers today will also look to outside services for guidance in business inefficiencies. Often customers see a value to outside input when reviewing legacy processes, this outside input can ensure that new processes are developed with an understanding of currently available technology and tools that can assist with driving productivity. These business efficiencies have a multitude of possible solutions including business intelligence tools with an associated data warehouse, a formal enterprise architecture program, automated provisioning of new services or automated software development assistance.
Business Intelligence
Todays businesses rely on information for making decisions, as well as reviewing previous decisions. in a systematic way. This information must be organized and have associated tools for reporting. Often times companies will look to outside firms to assist them with managing their sales information, forecasting, process assessment, manufacturing data and purchasing. Todays business intelligence solutions rely on expertise in these areas in data management, data mining, data cleansing and ultimately reporting using accurate and proven methods.
Enterprise Architecture
Enterprise Architecture is growing in popularity as companies look to formalize how business processes and company visions become IT systems. Todays TOGAF and the Zachman Frameworks are used by countless organizations to formally document the architecture that IT will follow for implementing systems, tools, software and support services. Few companies today have the expertise in house to develop a formal Enterprise Architecture program, and because of that will look to outside companies that have expert level knowledge and experience with the above frameworks.
Automated Provisioning
Speed is an important factor when doing business today. Companies that can rapidly adapt to change are more successful in meeting customer demands and needs. A companies IT systems are a critical component to all adjustments to market environments. By being able to more rapidly provision new services, or capacity for existing services, companies can ensure they are ready for this change. Automated provisioning ensures that minimal staff intervention is needed when bringing online new services; this both lowers the time to market as well as the costs associated with bringing new capabilities and capacity to market. Companies will often look for experienced outside assistance when developing automated provisioning systems, this outside experience can ensure that new services are brought online both efficiently and correctly.
Software Development
Software development can be a complicated orchestra including requirements gathering, architecture, development, internal testing, and finally customer testing. Companies will often look for external assistance with developing unit tests and automated regression testing environments. Outside resources can provide a unique perspective on the development and testing process because of there disconnect from the entire development process, they are able to focus all efforts on testing for defects and usability.
Selling to the decision makers
This shift in IT purchases from hardware to services has a dramatic impact on the sales process for vendors, particularly those that sell both hardware and services. As more and more IT solutions are purchased that are directly tied to company objectives, fewer purchases will be made by the managers and staff implementing the solutions. More and more large IT purchases are being made by a companies executives including the CIO, CFO, CTO and COO. These individuals are no longer focusing on the technology behind the products, they are looking to vendor solutions and offerings as a way to increase productivity, increase output, and to better understand and manage their business.
When selling services and solutions today, sales teams must articulate to potential customers the immediate and long term costs of solutions, and how those costs will directly affect the bottom line of the business. The cost of individual servers, licenses and data centers are no longer looked at with the level of scrutiny they once were. Today purchasers are looking to the total cost of a solution in implementation costs, reoccurring costs, and upgrade costs; then comparing those costs to the measurable benefits once the solution is in place.
Considerations for solution based purchases
Now that we know who is making the decisions around big IT purchases and that hardware is a small piece of the equation, we need to understand what items influence these decisions. These will enable solution developers and sales teams to properly positions services within a customer organization to enable effective business solutions.
Life-cycle Support
Customers look at vendors to provide them solutions that will meet the immediate needs as well as provide paths as the company grows aver time. The paths can be upgrades, accommodation of new legal requirements, growth and manageability. As part of the sales process it is critical to communicate to the customer a vendors capabilities around not only solution implementation, but also upgrades, changes and solution support.
Solution Ownership
Todays companies look to do business with vendors that will own solutions from end to end. This ownership requires the vendors to have solid methodologies around product development, delivery, support and upgrades. This does not mean a vendor needs to develop all products in house, or have a software package for every customer, but vendors should be able to provide their customers a single point of contact for all phases of complex projects.
Solution Flexibility
Companies today want to ensure they are not locked in to any specific solution, be it hardware, software or a specific consultant. For a consultant to put in the extra effort to ensure a solution is properly documented and communicated to the customer, shows the customer the dedication to their business and helping them succeed.
Cost Management
When purchasing services related IT solutions, companies today are looking at more then the initial cost of the contract, or the options. They are looking at the benefits the company will see because of the solutions, these benefits can be more efficient operations, more customers, or simplified growth paths. It is critical that as part of the sales process the costs and benefits are both understood and communicated to the customer.
To be successful in todays IT market, vendors must focus on correcting customers business problems, and work to become a trusted adviser in their business operations. Customers today are looking for long term solutions to their IT needs that will ensure they are competitive and able to not only grow, but change as the market demands. This has caused a dramatic shift away from purchases focused strictly on servers and storage, but to purchases of solutions. These solutions must have definitive cost returns over time that allow management to clearly understand how their business will be positively impacted.
Monday, September 15, 2008
Preparing for an IPv6 Deployment
IPv6 is the talk of the Internet, there are varying degrees of urgency stating that we will run out of existing IPv4 space within 2 years, with some saying there is enough IPv4 space left for 10 years. No matter which prediction is correct, eventually, IPv4 space will be exhausted and companies will have to begin migrating to IPv6 to ensure the availability of publicly routable IP space. Few companies have begun to evaluate the problem in detail; the sooner companies begin to evaluate their infrastructure, the more smoothly they can plan a migration from IPv4 to IPv6 and the longer period of time they can amortize the costs over.
IPv6 is an upgrade to the most basic elements of the internet and the networks that connect companies, individuals and the devices we have become so accustomed to using like our Blackberrys, iPhones and laptops. Making changes to the basis of all connectivity is not an easy thing to accomplish, or even begin planning for. The dependencies are unique and well established over many years of additions, improvements and research around IP connectivity. In this paper I intend to break down the process for companies to begin evaluating this upgrade to there infrastructure so that a strategic plan for IPv6 deployment can be developed.
Deployment Options
When deploying IPv6 there are a variety of options to ensure that no services will be interrupted during the time that both IPv4 and IPv6 are operational; both at your company, and across the global internet. Both options are important to consider because they each can work for specific cases to provide a bridge to IPv6 enable a system or application.
Parallel Stacks
When evaluating options for implementing IPv6 without impacting existing IPv4 traffic, most companies are looking to vendors to provide parallel stack capability, also called dual stack in some cases. By utilizing a parallel stack solution, companies can bring up IPv6 capability in parallel to their existing IPv4 deployments. This ensures that services can be migrated as they are fully tested and validated on IPv6. This parallel stack solution does come at a cost because of the overhead of administering two separate logical networks within a single physical network.
IPv6 NAT-PT
Todays modern routers are also offering capabilities to do NAT for traffic between IPv4 and IPv6, and vice-versa. IPv6 NAT-PT is a capability to have devices in your network with both IPv4 and IPv6 addresses assigned to them, these devices can then used as gateways for devices to use as a connection point to newer IPv6 devices. IPv6 NAT-PT was designed to provide a step from IPv4 to IPv6. IPv6 NAT-PT is a very specific use of the above mentioned parallel stack; ensuring that devices that only speak one protocol, can access devices speaking the alternate protocol.
Assessment
Overall Questions
The first step to evaluating the impact of an IPv6 update is by reviewing the high level components of your Information Technology (IT) systems. This evaluation is to begin looking at vendor commitments, capability and simplicity of system upgrades and regulatory impacts:
1)Inventory all vendors you currently use, document what there current and future support plans are for IPv6? What assistance can they provide either through documentation or consulting services to assist with a migration?
2)Review all applications, which ones are developed in house and which are commercial software? Are all the commercial software vendors still in business?
3)Do you have any legacy systems that no longer are covered by support agreements?
4)Do you have any systems that are covered by federal laws for data consistency? What legal rules are in place governing how these systems are tested and maintained?
Infrastructure Tiers
The next step is to systematically evaluate each component that contributes to the operational capability of your IT systems:
Network – Todays networks are complex sets of routers, switches, Intrusion Detection Systems (IDS) and physical links between sites. To properly assess this portion for an IPv6 upgrade, an audit must be done for each device. It should assess what types of Deployment Options the device supports, how the vendor plans to support IPv6 on this platform and, what upgrades, either software or hardware will be required for IPv6 support.
VPN Infrastructure – Enterprises are increasingly reliant on VPNs to secure traffic in todays mobile workforce. The software and hardware supporting these VPN sessions needs to be tested and evaluated to ensure it will support future IPv6 connections and traffic as well as a mix of traffic during any transition periods.
Applications – Applications will be the most complex and time consuming component of the evaluation process. Most companies have many dozens of applications in place, if not more, that must be evaluated to ensure that they will properly migrate to IPv6. This assessment for each application will need to include outside dependencies like license servers, database servers or client software on users individual machines.
Monitoring Tools – Todays enterprises have a diverse collection of tools used for monitoring network usage, network performance, application usage, application availability, users connection. All this information is critical to both developing and ensuring compliance with SLAs. As part of a complete IPv6 assessment all monitoring tools, both performance and availably, should be evaluated to ensure they can provide the same level of detail in monitoring, as well as properly store and report data that could be IPv6 or IPv4 specific.
Core services – Core services, including DNS, DHCP, and file sharing are some of the most critical components to an IPv6 migration. These services form the basis for all user experiences and if implemented correctly, will ensure that a transition to IPv6 is seamless to the users.
Mobile Devices – Mobile devices are becoming a standard for doing business in todays mobile workforce. As you begin to develop your IPv6 transition plan, it is important to include these as part of the assessment to ensure they will continue to operate through the transition and when the transition is complete. You should begin by speaking with your mobile device vendors to understand if the carriers network will support your IPv6 plans, as well as your employees handheld devices. This assessment will allow you to develop a cost associated with upgrading or replacing units in the field.
End Users Systems – Todays mobile workforce means that many staff have a laptop and a desktop system at a minimum, with more then one laptop per person in a lot of cases. All these devices need to be evaluated for IPv6 support to see if they will support the proposed changes, and what, if any upgrades will be needed for full support. This will have an impact to both the schedule and cost of an IPv6 deployment.
Migration Timeline
Now that we have developed a list of what will need to be upgraded, and paired that with a list of what upgrades our vendors will support, we can use that to develop a process and timeline to test all necessary changes, upgrade appropriate systems and eventually move an an environment where IPv6 is fully operational across all IT systems. This planning stage, taking what we know will need to be updated and planning how to update and test it, is the most important part of an IPv6 migration. This stage is our best opportunity to ensure we understand the time commitments for this project, the costs this project will incur and the potential challenges we will run into.
As we develop a complete migration process, there are many angles that must be included to ensure all services rolled out are ready for prime time and allow your staff to be as proficient as they were in an all IPv4 world. We must ensure that we understand what software will need to be upgraded, what software re-written, and how to test those changes so that we do not introduce complications.
After we have a plan for making the appropriate software updates and testing them, you can develop a detailed plan for how to implement IPv6. This plan should include which services will be upgraded first, second and so on. This plan should also include what groups of users will be the first to migrate so that they can be made aware of the plans and provide input during the migration process. This input can then be used to make each subsequent step smoother then the previous one.
In addition to the migration plans and testing plans, plans for disaster recovery and maintenance will need to be updated. Because IPv6 is such a radical change from current technologies, most maintenance plans and disaster recovery plans will need to be updated to handle the varied techniques that will need to be used once IPv6 is in place and operational.
Industry Commitments
The last big question lingering after a through assessment of your infrastructure is, what about the rest of the industry and our vendors? That is currently a point of contention in the vendors space, vendors are hesitant to implement IPv6 capability until customers demand it, and customers are hesitant to implement IPv6 until vendors provide a fully support capability in there products. The most visible of this contention is with todays modern firewall products, today very few fully support the RFCs around IPv6, but as most companies look to IPv6, this is an early capability that must be in place to continue rolling out IPv6.
As time continues, more and more vendors will adopt and support IPv6 in the same ways they support IPv4. IPv4 has taken decades to grow to the point of adoption it is at now, along that path many, many enhancements have been made to the routers, switches and servers that power our enterprise environments. As time continues to progress, more and more customers will push vendors to add complete IPv6 capabilities, as they do it will enable a larger portion of companies to begin to fully embrace IPv6.
IPv6 is an upgrade to the most basic elements of the internet and the networks that connect companies, individuals and the devices we have become so accustomed to using like our Blackberrys, iPhones and laptops. Making changes to the basis of all connectivity is not an easy thing to accomplish, or even begin planning for. The dependencies are unique and well established over many years of additions, improvements and research around IP connectivity. In this paper I intend to break down the process for companies to begin evaluating this upgrade to there infrastructure so that a strategic plan for IPv6 deployment can be developed.
Deployment Options
When deploying IPv6 there are a variety of options to ensure that no services will be interrupted during the time that both IPv4 and IPv6 are operational; both at your company, and across the global internet. Both options are important to consider because they each can work for specific cases to provide a bridge to IPv6 enable a system or application.
Parallel Stacks
When evaluating options for implementing IPv6 without impacting existing IPv4 traffic, most companies are looking to vendors to provide parallel stack capability, also called dual stack in some cases. By utilizing a parallel stack solution, companies can bring up IPv6 capability in parallel to their existing IPv4 deployments. This ensures that services can be migrated as they are fully tested and validated on IPv6. This parallel stack solution does come at a cost because of the overhead of administering two separate logical networks within a single physical network.
IPv6 NAT-PT
Todays modern routers are also offering capabilities to do NAT for traffic between IPv4 and IPv6, and vice-versa. IPv6 NAT-PT is a capability to have devices in your network with both IPv4 and IPv6 addresses assigned to them, these devices can then used as gateways for devices to use as a connection point to newer IPv6 devices. IPv6 NAT-PT was designed to provide a step from IPv4 to IPv6. IPv6 NAT-PT is a very specific use of the above mentioned parallel stack; ensuring that devices that only speak one protocol, can access devices speaking the alternate protocol.
Assessment
Overall Questions
The first step to evaluating the impact of an IPv6 update is by reviewing the high level components of your Information Technology (IT) systems. This evaluation is to begin looking at vendor commitments, capability and simplicity of system upgrades and regulatory impacts:
1)Inventory all vendors you currently use, document what there current and future support plans are for IPv6? What assistance can they provide either through documentation or consulting services to assist with a migration?
2)Review all applications, which ones are developed in house and which are commercial software? Are all the commercial software vendors still in business?
3)Do you have any legacy systems that no longer are covered by support agreements?
4)Do you have any systems that are covered by federal laws for data consistency? What legal rules are in place governing how these systems are tested and maintained?
Infrastructure Tiers
The next step is to systematically evaluate each component that contributes to the operational capability of your IT systems:
Network – Todays networks are complex sets of routers, switches, Intrusion Detection Systems (IDS) and physical links between sites. To properly assess this portion for an IPv6 upgrade, an audit must be done for each device. It should assess what types of Deployment Options the device supports, how the vendor plans to support IPv6 on this platform and, what upgrades, either software or hardware will be required for IPv6 support.
VPN Infrastructure – Enterprises are increasingly reliant on VPNs to secure traffic in todays mobile workforce. The software and hardware supporting these VPN sessions needs to be tested and evaluated to ensure it will support future IPv6 connections and traffic as well as a mix of traffic during any transition periods.
Applications – Applications will be the most complex and time consuming component of the evaluation process. Most companies have many dozens of applications in place, if not more, that must be evaluated to ensure that they will properly migrate to IPv6. This assessment for each application will need to include outside dependencies like license servers, database servers or client software on users individual machines.
Monitoring Tools – Todays enterprises have a diverse collection of tools used for monitoring network usage, network performance, application usage, application availability, users connection. All this information is critical to both developing and ensuring compliance with SLAs. As part of a complete IPv6 assessment all monitoring tools, both performance and availably, should be evaluated to ensure they can provide the same level of detail in monitoring, as well as properly store and report data that could be IPv6 or IPv4 specific.
Core services – Core services, including DNS, DHCP, and file sharing are some of the most critical components to an IPv6 migration. These services form the basis for all user experiences and if implemented correctly, will ensure that a transition to IPv6 is seamless to the users.
Mobile Devices – Mobile devices are becoming a standard for doing business in todays mobile workforce. As you begin to develop your IPv6 transition plan, it is important to include these as part of the assessment to ensure they will continue to operate through the transition and when the transition is complete. You should begin by speaking with your mobile device vendors to understand if the carriers network will support your IPv6 plans, as well as your employees handheld devices. This assessment will allow you to develop a cost associated with upgrading or replacing units in the field.
End Users Systems – Todays mobile workforce means that many staff have a laptop and a desktop system at a minimum, with more then one laptop per person in a lot of cases. All these devices need to be evaluated for IPv6 support to see if they will support the proposed changes, and what, if any upgrades will be needed for full support. This will have an impact to both the schedule and cost of an IPv6 deployment.
Migration Timeline
Now that we have developed a list of what will need to be upgraded, and paired that with a list of what upgrades our vendors will support, we can use that to develop a process and timeline to test all necessary changes, upgrade appropriate systems and eventually move an an environment where IPv6 is fully operational across all IT systems. This planning stage, taking what we know will need to be updated and planning how to update and test it, is the most important part of an IPv6 migration. This stage is our best opportunity to ensure we understand the time commitments for this project, the costs this project will incur and the potential challenges we will run into.
As we develop a complete migration process, there are many angles that must be included to ensure all services rolled out are ready for prime time and allow your staff to be as proficient as they were in an all IPv4 world. We must ensure that we understand what software will need to be upgraded, what software re-written, and how to test those changes so that we do not introduce complications.
After we have a plan for making the appropriate software updates and testing them, you can develop a detailed plan for how to implement IPv6. This plan should include which services will be upgraded first, second and so on. This plan should also include what groups of users will be the first to migrate so that they can be made aware of the plans and provide input during the migration process. This input can then be used to make each subsequent step smoother then the previous one.
In addition to the migration plans and testing plans, plans for disaster recovery and maintenance will need to be updated. Because IPv6 is such a radical change from current technologies, most maintenance plans and disaster recovery plans will need to be updated to handle the varied techniques that will need to be used once IPv6 is in place and operational.
Industry Commitments
The last big question lingering after a through assessment of your infrastructure is, what about the rest of the industry and our vendors? That is currently a point of contention in the vendors space, vendors are hesitant to implement IPv6 capability until customers demand it, and customers are hesitant to implement IPv6 until vendors provide a fully support capability in there products. The most visible of this contention is with todays modern firewall products, today very few fully support the RFCs around IPv6, but as most companies look to IPv6, this is an early capability that must be in place to continue rolling out IPv6.
As time continues, more and more vendors will adopt and support IPv6 in the same ways they support IPv4. IPv4 has taken decades to grow to the point of adoption it is at now, along that path many, many enhancements have been made to the routers, switches and servers that power our enterprise environments. As time continues to progress, more and more customers will push vendors to add complete IPv6 capabilities, as they do it will enable a larger portion of companies to begin to fully embrace IPv6.
Thursday, September 4, 2008
Hardware TCO – Predictable Planning with Refresh Cycles
Often, the most expensive investment any Information Technology (IT) organization will make is its base infrastructure; servers, storage and other various hardware. Yet, these hardware purchases are often given much less thought then software or services purchases and assumed to be routine, and just a cost of doing business. Hardware typically has several phases that should be evaluated as part of the purchase, these include the initial purchase price, the cost of maintaining it, and the ultimate cost of refreshing the hardware at the end of it's useful life. All should play an equal balance when evaluating new platforms, refresh cycles and testing new solutions for introduction to a company.
Often times when a company begins to assess the total cost of ownership (TCO) around its IT assets, it must involve teams not traditionally involved in IT planning. These teams can include facilities, engineering, building managers, application developers and data base administrators. Each of these groups can provide valuable input on how the servers and other infrastructure affect there environments and costs on a yearly basis.
There are three primary phases to all hardware purchases:
Initial Hardware Purchase
The initial hardware purchase is often thought to be the most expensive phase, but in reality after factoring in the support costs for a piece of hardware it turns out to be about one-third to one-fourth of the TCO. The initial price is often the easiest to evaluate, but should carefully be weighed against the long term costs of purchasing a specific brand or type of hardware.
Often vendors will allow for additional years of warranty coverage, or higher levels of support to be purchased when the system is first bought. These are often a wise investment if the hardware will be used longer then the initial warranty period. The increased level of support can often mean that your staff will spend less time supporting the system, and more time working on more beneficial tasks.
Support
The support costs are often the most expensive phases of hardware ownership. The support costs include patching the operating system, supplying power to the system, cooling the system and managing the applications hosted on the system. These costs are amortized over the life of the system, and over time can add up to be the most expensive part of the TCO formula. Often, these support costs are also where the most efficiencies can be gained to lower the TCO of the system.
There are many things that can be done to lower the support costs around hardware, most involve improved processes to cut back the amount of time staff have to spend manually managing each specific server. The most notable of these is automation of patch management. By utilizing tools to automate patch deployments and status monitoring, staff can cut significant manual administration time from each specific server. Proactive monitoring of system and application health can also play an important role in cutting down TCO for hardware and associated services. There are many packages available today to assist system administrators in proactively correcting both hardware and software faults before they cause a failure for the end users. These apps can ensure that staff isolate and correct problems as soon as possible to minimize the necessary time to correct faults.
Utilization is another space where the TCO for your servers can be lowered. By ensuring that servers do not run at idle for long periods of time, you can ensure that any power being used by the servers is being used efficiently. Often times, a single server can handle the load that multiple servers used to handle. It is much more efficient to power and cool a single server then multiple servers in this case, and scales very well as you begin looking at utilization rates across dozens or hundreds of servers.
Refresh
The final phase to evaluate for all hardware purchases is the refresh cycle. All hardware has a finite lifecycle in which at the end it will need to be replaced because it is either obsolete and not cost effective to maintain any longer. Obsolete in this context can be used in two ways, first to mean the hardware is so old is can no longer operate with the current operating systems, tools and patches available, or it no longer meets the business needs of your company.
There are often two methods that companies use to replace aging hardware, the first and most common is just purchasing new hardware when a system is no longer under warranty or has gotten too slow to use in the IT environment. More and more though, firms are implementing a rolling refresh cycle to add a level of predictability to all hardware purchases. A rolling refresh cycle allows a company to more clearly outlay capitol for IT investments, and better plan long term cycles for purchases, upgrades and replacements. Typically, a rolling refresh schedule is based on the standard warranty with newer server hardware, 3-5 years. A rolling refresh cycle also allows IT staff to better plan work loads by knowing ahead of time that new servers will need to be configured, tested and put into production.
Refresh cycle planning should also include an assessment of upcoming technologies and how that will affect purchases two and three years down the line. Every year hardware is faster and faster then before, and provides new possibilities for the amount of data that can be processed. In addition new technologies around virtualization are changing the dynamic of how system administrators provision new systems. No longer do system administrators add a single new server because of a single new application, today many different apps can be run on a single piece of hardware and kept separate from each other by using virtualization technologies.
As you assess your refresh cycle an important part of the TCO calculation is determining what applications can run within virtualized environments, and which will need separate hardware to run on. This will determine what level of consolidation can occur from year to year as the refresh occurs. As you look toward implementing a rolling refresh cycle, a first step is to understand how many existing servers are in place and how many existing applications. That data can then be used to develop a matrix of how things would look if virtualization were employed, and how a rolling refresh cycle could be utilized over time to ensure that all pieces of the infrastructure are upgraded in an expected period of time.
Minimizing the TCO of IT hardware is a key component of ensuring that the long term costs of owning the hardware are predictable and manageable. A rolling refresh cycle, paired with newer technologies like monitoring tools, automation tools and virtualization can allow IT staff to clearly plan how hardware will be used from purchase to end of life and how it will then be replaced. This cycle ensures staff can plan for future upgrades and migrations, as well as avoid last minute unexpected upgrades.
Often times when a company begins to assess the total cost of ownership (TCO) around its IT assets, it must involve teams not traditionally involved in IT planning. These teams can include facilities, engineering, building managers, application developers and data base administrators. Each of these groups can provide valuable input on how the servers and other infrastructure affect there environments and costs on a yearly basis.
There are three primary phases to all hardware purchases:
Initial Hardware Purchase
The initial hardware purchase is often thought to be the most expensive phase, but in reality after factoring in the support costs for a piece of hardware it turns out to be about one-third to one-fourth of the TCO. The initial price is often the easiest to evaluate, but should carefully be weighed against the long term costs of purchasing a specific brand or type of hardware.
Often vendors will allow for additional years of warranty coverage, or higher levels of support to be purchased when the system is first bought. These are often a wise investment if the hardware will be used longer then the initial warranty period. The increased level of support can often mean that your staff will spend less time supporting the system, and more time working on more beneficial tasks.
Support
The support costs are often the most expensive phases of hardware ownership. The support costs include patching the operating system, supplying power to the system, cooling the system and managing the applications hosted on the system. These costs are amortized over the life of the system, and over time can add up to be the most expensive part of the TCO formula. Often, these support costs are also where the most efficiencies can be gained to lower the TCO of the system.
There are many things that can be done to lower the support costs around hardware, most involve improved processes to cut back the amount of time staff have to spend manually managing each specific server. The most notable of these is automation of patch management. By utilizing tools to automate patch deployments and status monitoring, staff can cut significant manual administration time from each specific server. Proactive monitoring of system and application health can also play an important role in cutting down TCO for hardware and associated services. There are many packages available today to assist system administrators in proactively correcting both hardware and software faults before they cause a failure for the end users. These apps can ensure that staff isolate and correct problems as soon as possible to minimize the necessary time to correct faults.
Utilization is another space where the TCO for your servers can be lowered. By ensuring that servers do not run at idle for long periods of time, you can ensure that any power being used by the servers is being used efficiently. Often times, a single server can handle the load that multiple servers used to handle. It is much more efficient to power and cool a single server then multiple servers in this case, and scales very well as you begin looking at utilization rates across dozens or hundreds of servers.
Refresh
The final phase to evaluate for all hardware purchases is the refresh cycle. All hardware has a finite lifecycle in which at the end it will need to be replaced because it is either obsolete and not cost effective to maintain any longer. Obsolete in this context can be used in two ways, first to mean the hardware is so old is can no longer operate with the current operating systems, tools and patches available, or it no longer meets the business needs of your company.
There are often two methods that companies use to replace aging hardware, the first and most common is just purchasing new hardware when a system is no longer under warranty or has gotten too slow to use in the IT environment. More and more though, firms are implementing a rolling refresh cycle to add a level of predictability to all hardware purchases. A rolling refresh cycle allows a company to more clearly outlay capitol for IT investments, and better plan long term cycles for purchases, upgrades and replacements. Typically, a rolling refresh schedule is based on the standard warranty with newer server hardware, 3-5 years. A rolling refresh cycle also allows IT staff to better plan work loads by knowing ahead of time that new servers will need to be configured, tested and put into production.
Refresh cycle planning should also include an assessment of upcoming technologies and how that will affect purchases two and three years down the line. Every year hardware is faster and faster then before, and provides new possibilities for the amount of data that can be processed. In addition new technologies around virtualization are changing the dynamic of how system administrators provision new systems. No longer do system administrators add a single new server because of a single new application, today many different apps can be run on a single piece of hardware and kept separate from each other by using virtualization technologies.
As you assess your refresh cycle an important part of the TCO calculation is determining what applications can run within virtualized environments, and which will need separate hardware to run on. This will determine what level of consolidation can occur from year to year as the refresh occurs. As you look toward implementing a rolling refresh cycle, a first step is to understand how many existing servers are in place and how many existing applications. That data can then be used to develop a matrix of how things would look if virtualization were employed, and how a rolling refresh cycle could be utilized over time to ensure that all pieces of the infrastructure are upgraded in an expected period of time.
Minimizing the TCO of IT hardware is a key component of ensuring that the long term costs of owning the hardware are predictable and manageable. A rolling refresh cycle, paired with newer technologies like monitoring tools, automation tools and virtualization can allow IT staff to clearly plan how hardware will be used from purchase to end of life and how it will then be replaced. This cycle ensures staff can plan for future upgrades and migrations, as well as avoid last minute unexpected upgrades.
Thursday, August 28, 2008
Defining "the edge"
I was in a planning meeting with a customer recently and we were assessing the customers security plan. We had two major topics to discuss, the first was in regards to data management and compliance. The second, and the one we discussed at length was in regards to previous policies they had around what they called “the edge”, the previous end of their network and the beginning of folks and systems they could not trust. The discussion went on for a while with us working towards consensus on how to define “the edge.” I believe we made the right decisions for there needs, but wanted to continue the discussion, I imagine most companies have this discussion at some point and will continue as new technologies evolve.
At one time "the edge" of any given network was easy to find; the last router between you and the upstream access provider. But today, "the edge" is getting increasingly difficult to find, and this has implications for the fundamentals of Information Technology (IT) including patching and password policies, and the most complicated of questions including privacy, monitoring and forensics. Today we have to evaluate many different details in regards to where “the edge” truly is, these include PDAs, company laptops with VPN access, employees home systems, thumb drives, and outside vendors/contractors.
The most important implications around defining what constitutes “the edge” is defining how customers, and staff will be able to access servers, services and storage. By clearly defining “the edge” we can then work to define what services will be publicly accessible, and which ones will be restricted by VPN access, firewalls, or other mechanisms. By defining “the edge” we also have a baseline to use when defining policies for information management, information tracking and information retention. These are critical areas in todays world of compliance, being able to precisely say who accessed and stored what day and when, is almost necessity.
When defining “the edge”, I start by listing all possible devices (laptop, desktop, thumb drive, PDA, cell phone, etc) that an employee or partner could use to access data that is not publicly available. This should be a list of devices currently allowed and possible technologies to employ. This data could include sales presentations, engineering documents, support forums, or any other data that is intentionally kept private to provide a competitive edge in your industry.
Second, I work to list where those devices could possible be used (office, Starbucks, employees home, airport, restaurant, etc). This is important to understand what implications those devices have including being lost, stolen, or a staff member having a conversation listened too by an outside party. This list should include the associate risks and possibility of it occurring at each location. The chance of a desktop system being stolen from the office is relatively low compared to a laptop being stolen while at the coffee shop. This does not imply that less security should be utilized to protect data on office systems, but that different techniques should be employed to do so.
The final component of defining “the edge” is defining appropriate policies for each device based on risk to the device and associated data, and a cost benefit tradeoff analysis for which devices should be allowed and which should not because of the level of risk they pose. These policies should take into account technologies like full disk encryption, passwords and non-reusable password generators, Virtual Private Network (VPN) technologies, and physical security like cable locks for laptops. Each potential technology is a tool to lower the risk and increase the reward for offering various tools and capabilities to employees.
Ultimately, this is a discussion around what risks can be outweighed by there benefits in a business setting. Often times staff can gain a significant level of productivity by having access too laptops, PDAs and other mobile devices, the company must weigh that additional productivity against the risk of a company device becoming compromised.
The concept of “the edge” is always going to be present for a companies IT infrastructure. As Web 2.0 and associated architectures grow, the ability to present more and more tools and capability to staff is only going to increase. By properly laying the ground work for how staff securely access these systems, a company can ensure that new tools can increase productivity without negatively impacting the risk to the company.
At one time "the edge" of any given network was easy to find; the last router between you and the upstream access provider. But today, "the edge" is getting increasingly difficult to find, and this has implications for the fundamentals of Information Technology (IT) including patching and password policies, and the most complicated of questions including privacy, monitoring and forensics. Today we have to evaluate many different details in regards to where “the edge” truly is, these include PDAs, company laptops with VPN access, employees home systems, thumb drives, and outside vendors/contractors.
The most important implications around defining what constitutes “the edge” is defining how customers, and staff will be able to access servers, services and storage. By clearly defining “the edge” we can then work to define what services will be publicly accessible, and which ones will be restricted by VPN access, firewalls, or other mechanisms. By defining “the edge” we also have a baseline to use when defining policies for information management, information tracking and information retention. These are critical areas in todays world of compliance, being able to precisely say who accessed and stored what day and when, is almost necessity.
When defining “the edge”, I start by listing all possible devices (laptop, desktop, thumb drive, PDA, cell phone, etc) that an employee or partner could use to access data that is not publicly available. This should be a list of devices currently allowed and possible technologies to employ. This data could include sales presentations, engineering documents, support forums, or any other data that is intentionally kept private to provide a competitive edge in your industry.
Second, I work to list where those devices could possible be used (office, Starbucks, employees home, airport, restaurant, etc). This is important to understand what implications those devices have including being lost, stolen, or a staff member having a conversation listened too by an outside party. This list should include the associate risks and possibility of it occurring at each location. The chance of a desktop system being stolen from the office is relatively low compared to a laptop being stolen while at the coffee shop. This does not imply that less security should be utilized to protect data on office systems, but that different techniques should be employed to do so.
The final component of defining “the edge” is defining appropriate policies for each device based on risk to the device and associated data, and a cost benefit tradeoff analysis for which devices should be allowed and which should not because of the level of risk they pose. These policies should take into account technologies like full disk encryption, passwords and non-reusable password generators, Virtual Private Network (VPN) technologies, and physical security like cable locks for laptops. Each potential technology is a tool to lower the risk and increase the reward for offering various tools and capabilities to employees.
Ultimately, this is a discussion around what risks can be outweighed by there benefits in a business setting. Often times staff can gain a significant level of productivity by having access too laptops, PDAs and other mobile devices, the company must weigh that additional productivity against the risk of a company device becoming compromised.
The concept of “the edge” is always going to be present for a companies IT infrastructure. As Web 2.0 and associated architectures grow, the ability to present more and more tools and capability to staff is only going to increase. By properly laying the ground work for how staff securely access these systems, a company can ensure that new tools can increase productivity without negatively impacting the risk to the company.
Wednesday, August 27, 2008
Risk Management in HPC
Risk management is a very broad topic within the project management space. It covers planning for and understanding the most unimaginable of possibilities within a project so that a plan is in place to respond to these situations and mitigate risk across the project. I will focus on risk management specifically in High Performance Compute (HPC) deployments. HPC, like any other specialty area has it's own specific risks and possibilities. Within HPC these risks are both procedural and technical in nature, but have equal implications to the overall delivery of a successful solution.
Risk management in any project begins with a risk assessment, this includes both identifying risks and possible risk mitigation techniques. This can be done through a variety of methods including brainstorming, the Delphi technique, or by referencing internal documentation about similar, previous projects. This initial assessment phase is critical to ensure that both risks and responses are captured. By capturing both of these upfront, it allows for better communication around the known risks, and better preparation for managing unknown risks. This risk assessment will produce a risk matrix, this is the documented list of possible risks to a project and there mitigation, or response plans. The risk matrix will become part of the overall project delivery plan.
Risk Matrix
When beginning any HPC project, either an initial deployment or an upgrade, it is important to develop a risk matrix. This can include both known risks (late delivery, poor performance, failed hardware) as well as unknown risks. The unknown risks category is much more difficult to define for just that reason, but a common approach is to define levels of severity and responses. These responses can include procedural details, escalation details, communication information and documentation about the problem to prevent a reoccurrence.
This matrix should include a variety of information including:
Known Risks
Often times, known risks are the easiest for people to plan for, but very difficult to handle. This understanding up front and anticipation of the risk or problem can often fool us into believing we know the best response to the problem, when often the only way to truly understand how to respond to a problem is to do it incorrectly one or more times.
Lets explore some common risks that are specific to HPC deployments, and the most common mitigation strategies to combat them:
Application Scaling
A fundamental premise of HPC is that applications should scale in a way that makes more hardware produce more accurate results and/or more efficient production of data. Because of this an application is often expected to perform with the same scalability on 64 nodes, as it does on 128 and often many more. This type of scalability must be architected into the application as it is written and improved on as hardware performance evolves over time. Every time a newer, faster or bigger cluster is installed, there is an inherent risk that the applications previously used will not properly scale on the new platform.
Often times the best mitigation strategy for this risk is proper planning, testing and benchmarking; before system deployment. The most difficult time to manage an application scaling problem is after a customer's hardware has been delivered and installed. By benchmarking and testing the application prior to shipment, the expectations with the customer can be properly set. It also allows proper time for working with any development teams to troubleshoot scaling problems and correct them before presenting results and completing acceptance testing with the customer.
Facility Limitations
HPC solutions often use large amounts of power, cooling and space within a data center compared to a companies business support systems or database centric systems. Because of the large facility needs of HPC it is very common for customers to underestimate the facility needs, or the numbers to be poorly communicated from a vendor to a customer. The power and cooling requirements can also vary widely based upon the customers final use and intended application of the cluster.
All facility design issues should be managed and planed for before hardware is shipped or systems are assembled. To ensure a smooth cluster delivery, it is critical that site planning and assessment be done as part of the system design. This site planning should ensure there is enough power, cooling and space to accommodate the cluster. It should additionally work to ensure the power and cooling are in the proper places and can be directed to the cluster in the recommended fashion.
Mean Time between Failure (MTBF)
MTBF is a calculation used to understand how often components across a single node or cluster will fail. It averages the known and designed life cycle of all individual components to provide a time between each individual component failure. These component failures can either be severe enough to impact the whole system, or just portions of a cluster based on the cluster's design. Often times a completed cluster will fail in unexpected ways because of the MTBF characteristics of putting large numbers of compute nodes in a single fabric. If proper redundancy is not built into critical systems of the cluster, a customer satisfaction issue can develop because of prolonged and unplanned for outages.
By properly assessing all uptime requirements from the customer, a system can be designed that will provide the uptime necessary to conduct business regardless of the MTBF that is collective across all components. Each individual service and capability of the cluster should be assessed to ensure that the proper level of redundancy including clustered nodes, redundant power, and redundant disks is included with the complete solution.
Performance Targets, I/O and Compute
Performance guarantees are often included in HPC proposals to customers to provide a level of comfort when planning times for job completion and capacity planning for an organization. These numbers can often be sources of concern as a system is brought online if compute capacity is not as promised or I/O is not operating as fast as expected or promised.
There is often misunderstandings with complete cluster deployments about a clusters capability for sustained versus peak performance. Sustained is most often the number used for a representative test of how the system will perform over its life cycle. Where as peak is the level of performance often stated for bragging rights because it is the theoretical maximum potential of a given cluster.
There is very little that can be done after delivery of a system if this type of risk comes up, other then giving the customer the additional hardware to pull the sustained performance number up to the peak performance number. This can be a very expensive response. This is the reason that the staff doing HPC architecture must fully understand application benchmarking and performance when designing new clusters. All numbers should also be reviewed by multiple people, this will insure errors in math or testing methodologies do not go unnoticed.
Unknown Risks
Often times planning for unknown risks can be the most stressful, but can yield the most gains when actually responding. This is because of a lack of prior perceptions and the ability to be very creative with responses and future mitigation strategies. Risk planning for unknown risks is often an exercise in understanding the levels of severities that could occur with a problem, and associating it with the appropriate level of response and future prevention.
When defining response strategies for unknown risks, often the first step is to define levels of severity that could develop from any given problem. A common list is:
Path Forward
The most important component of risk management is skill and experience development. It is important to ensure that as a company, you have processes to document all experience that is gained as part of risk management within managing your projects. This knowledge must be documented so that other teams, new teams and new staff can learn from previous experience of the company.
The more efficient a job that is done with documenting risk response and lessons learned, the more efficiently companies can scope future projects. This allows companies to much more accurately assess costs for future projects, as well as risk versus reward tradeoffs for large, complex projects. Ultimately the best way to manage risk is to understand it before beginning actual deployment and implementation on a project. This comes from a combination of utilizing all data collected on previous projects as well as techniques like brain storming and the Delphi technique to ensure as many possible risks are documented with appropriate response plans.
Risk management in any project begins with a risk assessment, this includes both identifying risks and possible risk mitigation techniques. This can be done through a variety of methods including brainstorming, the Delphi technique, or by referencing internal documentation about similar, previous projects. This initial assessment phase is critical to ensure that both risks and responses are captured. By capturing both of these upfront, it allows for better communication around the known risks, and better preparation for managing unknown risks. This risk assessment will produce a risk matrix, this is the documented list of possible risks to a project and there mitigation, or response plans. The risk matrix will become part of the overall project delivery plan.
Risk Matrix
When beginning any HPC project, either an initial deployment or an upgrade, it is important to develop a risk matrix. This can include both known risks (late delivery, poor performance, failed hardware) as well as unknown risks. The unknown risks category is much more difficult to define for just that reason, but a common approach is to define levels of severity and responses. These responses can include procedural details, escalation details, communication information and documentation about the problem to prevent a reoccurrence.
This matrix should include a variety of information including:
- Risk name – Should be unique within the company to facilitate communication between groups and departments
- Risk Type – Minimal, Moderate, Severe, Extreme, etc
- Cost if this risk occurs – This can be in time, money or loss of reputation, or all of the above.
- Process to recovery – It is important to document early on how to respond to the risk and correct any problems that have developed because of the risk
- Risk Owner – Often times a specific individual has additional experience with dealing with a specific risk and can work as a Subject Matter Expert (SME) for the project team
- Outcome documentation – Clearing defining what should be documented should the risk occur so that it can be responded too
- Communication Channels - different risks require that different staff and management become engaged, it is important to document who should be involved should a risk occur
- Time Component – Every risk has a response, every response has a time component associated with it. It is important to understand these time components up front, it will allow project management staff to adjust schedules accordingly should a risk occur
Known Risks
Often times, known risks are the easiest for people to plan for, but very difficult to handle. This understanding up front and anticipation of the risk or problem can often fool us into believing we know the best response to the problem, when often the only way to truly understand how to respond to a problem is to do it incorrectly one or more times.
Lets explore some common risks that are specific to HPC deployments, and the most common mitigation strategies to combat them:
Application Scaling
A fundamental premise of HPC is that applications should scale in a way that makes more hardware produce more accurate results and/or more efficient production of data. Because of this an application is often expected to perform with the same scalability on 64 nodes, as it does on 128 and often many more. This type of scalability must be architected into the application as it is written and improved on as hardware performance evolves over time. Every time a newer, faster or bigger cluster is installed, there is an inherent risk that the applications previously used will not properly scale on the new platform.
Often times the best mitigation strategy for this risk is proper planning, testing and benchmarking; before system deployment. The most difficult time to manage an application scaling problem is after a customer's hardware has been delivered and installed. By benchmarking and testing the application prior to shipment, the expectations with the customer can be properly set. It also allows proper time for working with any development teams to troubleshoot scaling problems and correct them before presenting results and completing acceptance testing with the customer.
Facility Limitations
HPC solutions often use large amounts of power, cooling and space within a data center compared to a companies business support systems or database centric systems. Because of the large facility needs of HPC it is very common for customers to underestimate the facility needs, or the numbers to be poorly communicated from a vendor to a customer. The power and cooling requirements can also vary widely based upon the customers final use and intended application of the cluster.
All facility design issues should be managed and planed for before hardware is shipped or systems are assembled. To ensure a smooth cluster delivery, it is critical that site planning and assessment be done as part of the system design. This site planning should ensure there is enough power, cooling and space to accommodate the cluster. It should additionally work to ensure the power and cooling are in the proper places and can be directed to the cluster in the recommended fashion.
Mean Time between Failure (MTBF)
MTBF is a calculation used to understand how often components across a single node or cluster will fail. It averages the known and designed life cycle of all individual components to provide a time between each individual component failure. These component failures can either be severe enough to impact the whole system, or just portions of a cluster based on the cluster's design. Often times a completed cluster will fail in unexpected ways because of the MTBF characteristics of putting large numbers of compute nodes in a single fabric. If proper redundancy is not built into critical systems of the cluster, a customer satisfaction issue can develop because of prolonged and unplanned for outages.
By properly assessing all uptime requirements from the customer, a system can be designed that will provide the uptime necessary to conduct business regardless of the MTBF that is collective across all components. Each individual service and capability of the cluster should be assessed to ensure that the proper level of redundancy including clustered nodes, redundant power, and redundant disks is included with the complete solution.
Performance Targets, I/O and Compute
Performance guarantees are often included in HPC proposals to customers to provide a level of comfort when planning times for job completion and capacity planning for an organization. These numbers can often be sources of concern as a system is brought online if compute capacity is not as promised or I/O is not operating as fast as expected or promised.
There is often misunderstandings with complete cluster deployments about a clusters capability for sustained versus peak performance. Sustained is most often the number used for a representative test of how the system will perform over its life cycle. Where as peak is the level of performance often stated for bragging rights because it is the theoretical maximum potential of a given cluster.
There is very little that can be done after delivery of a system if this type of risk comes up, other then giving the customer the additional hardware to pull the sustained performance number up to the peak performance number. This can be a very expensive response. This is the reason that the staff doing HPC architecture must fully understand application benchmarking and performance when designing new clusters. All numbers should also be reviewed by multiple people, this will insure errors in math or testing methodologies do not go unnoticed.
Unknown Risks
Often times planning for unknown risks can be the most stressful, but can yield the most gains when actually responding. This is because of a lack of prior perceptions and the ability to be very creative with responses and future mitigation strategies. Risk planning for unknown risks is often an exercise in understanding the levels of severities that could occur with a problem, and associating it with the appropriate level of response and future prevention.
When defining response strategies for unknown risks, often the first step is to define levels of severity that could develop from any given problem. A common list is:
- Most severe level of risk, requires executive management level response to the customer and has a high percentage cost to the project (greater then 50% of project revenue is at risk).
- Severe level of risk, requires executive level of response and carries a medium level of financial risk (less then 50% of project revenue is at risk).
- Medium level project risk, requires senior management response, could or could not have a financial impact on the project, but does have a deliverable and schedule component.
- Lower level risk, has an impact on project schedule, but no negative impact on project financials.
- The lowest level of project risk, often just a communication issue with a customer or potential misunderstanding. Often no schedule impact or financial impact to the project.
- Steps to research and understand the problem.
- Communication channels, who needs to be communicated with for a problem of this magnitude and how are they communicated with. This needs to include both customer and company contacts that will be necessary to correct the problem.
- Flow chart for responding, this is the path to determining the appropriate response and deciding if more resources, either financial or staffing, are needed to correct the risk.
- Documentation to prevent future occurrences is important. It is important to ensure that any information about the project is gathered and documented to be used in house to prevent future occurrences of the same risk.
- Risk closure document. A checklist to document that all protocol was followed and the risk was corrected. This should include components that the risk will not return on the same project because mitigation techniques have been implemented.
Path Forward
The most important component of risk management is skill and experience development. It is important to ensure that as a company, you have processes to document all experience that is gained as part of risk management within managing your projects. This knowledge must be documented so that other teams, new teams and new staff can learn from previous experience of the company.
The more efficient a job that is done with documenting risk response and lessons learned, the more efficiently companies can scope future projects. This allows companies to much more accurately assess costs for future projects, as well as risk versus reward tradeoffs for large, complex projects. Ultimately the best way to manage risk is to understand it before beginning actual deployment and implementation on a project. This comes from a combination of utilizing all data collected on previous projects as well as techniques like brain storming and the Delphi technique to ensure as many possible risks are documented with appropriate response plans.
Subscribe to:
Posts (Atom)