Thursday, November 29, 2012

Big Data, Analytics and Hadoop


 Three of the biggest catch terms in technology today are Big Data, Hadoop and Analytics.  Often times, they are used interchangeably, when in reality they are three unique capabilities and categories.  Each solves a different, but related, set of struggles within IT and the businesses that leverage IT for unique advantages.

Start with Big Data.  Big Data is a term we hear more and more often about the developing struggles at many companies to cope with growing volumes of data and changing varieties of data.  There is no mark that says where Big Data begins, but is rather a dramatic change for an organization that requires them to look at new tools, technology, processes and skill sets to address the challenges they face.  Big Data is focused on the efficient storage, movement and organization of these evolving data types.

Analytics is related to Big Data, but has some variations.  Where as Big Data focuses on the data itself and the infrastructure to manage the data, Analytics is about putting that data to use and enhancing the capabilities of an organization.  Analytics is about taking that data and ensuring that actionable information comes from it that the company can execute on.

Finally, Hadoop, one of the most common technologies being talked about today.  Hadoop is a technology that can be used to enable organizations to manage their Big Data, while building a platform for more advanced analytic capabilities.  Hadoop is a rapidly growing open source project that has been adopted by many main stream organizations as a standard method for the storage and processing of complex data sets.  Hadoop is one of many tools that you can use to enable Big Data and begin to implement an Analytics strategy.

Tuesday, September 18, 2012

Adopting Hadoop in the Enterprise


Apache Hadoop is one of the hottest technologies today, garnering attention from small startups to the largest corporations and government agencies.  Hadoop provides the middleware need to store and analyze very large data sets, while enabling businesses to make better data driven decisions.

Hadoop started as an internal project at Yahoo and was eventually released as an open source project called Apache Hadoop.  Other prominent technology companies including Facebook, eBay and Nokia quickly adopted Hadoop and began contributing back to the Hadoop community.

Because of the origin of Hadoop, many of its features and usage models are targeted at web scale companies with highly differentiated operational models like Facebook.  These features and design decisions, while worthwhile for these web scale firms, are not always a good fit for the varied operational models that are used in traditional enterprise IT environments.

Traditional enterprise IT environments are characterized by mixed environments of many different software packages, hardware platforms and storage platforms that must integrate and coexist.  Enterprise environments often have much different requirements for monitoring, lifecycle management and security then single, highly integrated platforms like Facebook and Amazon.

Many web scale firms have the luxury of building out internal monitoring and orchestration frameworks that are tightly coupled across the environment.  Compare that with the often fractured and legacy deployment struggles that are common with enterprise computing shops and you see that enterprise computing environments have a unique set of needs when deploying scalable, open source software.

To ensure Hadoop is successful in your enterprise, you should start by evaluating what features and functionality are a priority for your deployment; that can then be used to determine the optimal Hadoop distribution, additional tools or custom development that will be required for Hadoop deployment in a production environment.

Some common areas of consideration for running Hadoop within an integrated enterprise are:
  • Access Controls – With any consolidation of data, the access to that data becomes a primary concern for the organization.  Hadoop provides minimal capabilities for limited access to data, and does not come close to the cell-level granularity that is commonly expected within enterprise software.  There are several projects associated with Hadoop that look to overcome this category of problem, including Accumulo, Zettaset, Sqrrl.
  • IDM Integration – Integration with outside authentication mechanisms is important to ensure that a users’ identity is tracked across all interconnected applications.  Hadoop has the ability to leverage outside systems including LDAP and Kerberos for user authentication.
  • Monitoring/Auditing/Alerting – Understanding what is occurring within a Hadoop environment is key in ensuring stability, a managed lifecycle and the ability to take action to user feedback.  The tools that are deployed for managing Hadoop should encompass for the entire lifecycle of managing the cluster and provide an integrated view into the users, applications, Hadoop core and hardware to enable administrators to quickly make changes to the environment and assess there impact.
  •  Skills & Expertise – Hadoop is a new technology and as a result the market of job candidates has not caught up to the demand for Hadoop skills.  When developing a team a two-angle approach is recommended.  First, enable existing staff time to obtain training and hands on experience with Hadoop and it’s supporting technologies.  Second, leverage outside consulting expertise to help train, and assist the organization as they deploy new technologies.  These two methods balance the need to ensure skills are available in the organization long term, with the immediate need to deploy new technologies in a low-risk, proven architecture.
  • Legacy System Connectivity – Hadoop is rarely deployed as a standalone island within IT, more often it is a connection point between other data repositories, BI tools and user access technologies.  When defining Hadoop deployment strategies it is key to account for the end-to-end flow of the data in the organization to ensure the right tools are in place to facilitate this movement and any transformation of data.  Some proven tools for this are Pentaho Data Integration, Informatica and Syncsort.
  • Process Modification – As with any new technology, organizational changes are going to occur around how people execute daily tasks.  As Hadoop is deployed, it is important to plan for process changes across the organization to ensure that value is gained from this new tool and that the new types of information that can be gained from Hadoop help to drive decisions within the organization.
  • User Empowerment – As with any new technology, not all users will be able to utilize Hadoop on day one.  Some users will prefer more graphical interfaces, while others will prefer a software development interface.  As IT departments deploy Hadoop, all user types should be considered to ensure they have access to tools that meet their usage models, their skills sets and the flow of their daily jobs.  Some common tools to deploy along with Hadoop are Pentaho, Datameerand Karmasphere.


Hadoop is not easy.  That is a function of both new capabilities on the market that are still maturing, as well as the flexibility of Hadoop that enables the power it provides.  Both can be overcome by careful planning, slow, methodical rolls outs and the upfront investment in expertise to assist and drive Hadoop deployments in your environment.

Tuesday, September 11, 2012

Types of Analytic Users


Analytics is becoming more important for todays companies.  As companies look to make data-driven decisions, it is paramount that the data driving those decisions is accessible to all the necessary users.  Each user has a unique set of needs based on their role within the organization, their skill set and knowledge of corporate decision-making processes.

Analytical environments should be built to enable a variety of user types from a single set of data and resources.  Each user should be enabled with an interface that meets their individual job needs, and integrates with any necessary business processes.

Within Analytics, we see three primary types of users:

  • Empowered – Empowered users are those that gain real value from the analytic technologies and tools, but do not know about the underlying technology, nor does their primary role focus on the underlying technology.  These users are typically experts in a business process and the inputs & outputs of that process.  These users commonly use a tool that presents the data in pre-packaged format, while allowing them flexibility to modify any reports to meet their specific process needs.
  • Aware – Aware users are educated of the underlying technology and understand the power of it, but are not responsible for the daily operations of the analytic environment.  Aware users will commonly consume analytical technologies, have a background in computer science or information technology and focus on creating tools for Empowered users to consume.  Aware users are the connection point between information technology operations and Empowered users that consume the data stored within analytical environments.
  • Enabled – Enabled users are the staff that deeply understand the workings of the analytical environment and are responsible for operations.  Enabled users will commonly be deeply technical staff with some understanding of the business processes, but defer deep process and data value questions to Aware users.


When building analytic environments and deploying new technologies, it is key to map those technologies to the user types and ensure all users have the tools they require for the job they perform.

Saturday, August 18, 2012

Vectors of Information Security


Within the realm of information security, a lot of focus is paid to the vectors of attack. Essentially how an attacker can go after your networks, systems, people and information. These vectors focus on how the attacks can occur, how to detect and respond to them. But they only hit on part of the challenge in securing todays complex information technology (IT) environments.
Vectors of Information Security start with a definition of what behavior is allowed and then monitor and react to anything outside of that defined criteria. Most information security policies state policies in the form of “Administrators will deny access to those not allowed”. In the form of VIS, we will say that “Active employees are allowed access” and respond to all access outside that form. This is a variation of the security models focused on policies based on denying access and is a change in mindset for many security professionals.
More critical then the vectors of attach, are the overarching Vectors of Information Security (VIS). These correlate to the overall usage of information and allow Architects, Administrators and IT Leadership to plan accordingly for information access and risk management around expected usage patterns. The three Vectors of Information Security are:
  • Paths of access – This category focuses on all the tools, technologies and applications that allow access to a corporation’s data. This includes both data in transit and data at rest.
  • Paths of change – This avenue is for documenting and understanding how information changes; information can include access logs, configurations, customer information and financial information, just to name a few.
  • Paths of risk – This is the category that vectors of attack will become part of. Path of risk is the likelihood that an unknown, unacceptable or unanticipated event will occur and the associated cost to the organization for the incident.

Information security is about risk management and mitigation. The Vectors of Information Security enable organizations to outline clear policies for understanding, managing and responding to the risk that is inherent with todays interconnected systems.

Tuesday, August 7, 2012

Understanding Security versus Compliance


I have been in a variety of projects over the years that mixed the use of the terms security and compliance. Often using them interchangeable. While some implementation details are common for both, the end goals of security and compliance are very different.
I have worked in a variety of environments that combine security and compliance from a functional and operational standpoint, and while this often makes sense from a resource perspective, it is critical to ensure staff understand the differences between security and compliance. Plainly put:
  • Security is about ensuring that only those authorized can obtain access to resources and there are mechanisms in place to alert when events outside the norm occur.
  • Compliance is about ensuring that implementation and operation of the environment follows all corporate policies, industry standards, and regulations; and that exceptions are clearly documented.

The short answer is that you can be compliant and non-secure; you can also be secure but non-compliant. This is the catch-22 that must be balanced for the staff tasked with implementing and monitoring corporate policies. When planning your corporate security standards, it is important to ensure that compliance teams have a seat at the table, and vice versa for compliance planning. This cross team support will ensure alignment, no duplication of efforts and understanding of what each team is trying to accomplish.
The best case is that a companies compliance policies mirror the IT policies for security and access controls, this ensures that monitoring of the environment and implementation is as light weight on staff as possible. By ensuring a level of consistency in the security implementation, compliance regulation can be proven more quickly, with fewer resources and with less rework of the environment as requirements and policies evolve.