Tuesday, September 18, 2012

Adopting Hadoop in the Enterprise


Apache Hadoop is one of the hottest technologies today, garnering attention from small startups to the largest corporations and government agencies.  Hadoop provides the middleware need to store and analyze very large data sets, while enabling businesses to make better data driven decisions.

Hadoop started as an internal project at Yahoo and was eventually released as an open source project called Apache Hadoop.  Other prominent technology companies including Facebook, eBay and Nokia quickly adopted Hadoop and began contributing back to the Hadoop community.

Because of the origin of Hadoop, many of its features and usage models are targeted at web scale companies with highly differentiated operational models like Facebook.  These features and design decisions, while worthwhile for these web scale firms, are not always a good fit for the varied operational models that are used in traditional enterprise IT environments.

Traditional enterprise IT environments are characterized by mixed environments of many different software packages, hardware platforms and storage platforms that must integrate and coexist.  Enterprise environments often have much different requirements for monitoring, lifecycle management and security then single, highly integrated platforms like Facebook and Amazon.

Many web scale firms have the luxury of building out internal monitoring and orchestration frameworks that are tightly coupled across the environment.  Compare that with the often fractured and legacy deployment struggles that are common with enterprise computing shops and you see that enterprise computing environments have a unique set of needs when deploying scalable, open source software.

To ensure Hadoop is successful in your enterprise, you should start by evaluating what features and functionality are a priority for your deployment; that can then be used to determine the optimal Hadoop distribution, additional tools or custom development that will be required for Hadoop deployment in a production environment.

Some common areas of consideration for running Hadoop within an integrated enterprise are:
  • Access Controls – With any consolidation of data, the access to that data becomes a primary concern for the organization.  Hadoop provides minimal capabilities for limited access to data, and does not come close to the cell-level granularity that is commonly expected within enterprise software.  There are several projects associated with Hadoop that look to overcome this category of problem, including Accumulo, Zettaset, Sqrrl.
  • IDM Integration – Integration with outside authentication mechanisms is important to ensure that a users’ identity is tracked across all interconnected applications.  Hadoop has the ability to leverage outside systems including LDAP and Kerberos for user authentication.
  • Monitoring/Auditing/Alerting – Understanding what is occurring within a Hadoop environment is key in ensuring stability, a managed lifecycle and the ability to take action to user feedback.  The tools that are deployed for managing Hadoop should encompass for the entire lifecycle of managing the cluster and provide an integrated view into the users, applications, Hadoop core and hardware to enable administrators to quickly make changes to the environment and assess there impact.
  •  Skills & Expertise – Hadoop is a new technology and as a result the market of job candidates has not caught up to the demand for Hadoop skills.  When developing a team a two-angle approach is recommended.  First, enable existing staff time to obtain training and hands on experience with Hadoop and it’s supporting technologies.  Second, leverage outside consulting expertise to help train, and assist the organization as they deploy new technologies.  These two methods balance the need to ensure skills are available in the organization long term, with the immediate need to deploy new technologies in a low-risk, proven architecture.
  • Legacy System Connectivity – Hadoop is rarely deployed as a standalone island within IT, more often it is a connection point between other data repositories, BI tools and user access technologies.  When defining Hadoop deployment strategies it is key to account for the end-to-end flow of the data in the organization to ensure the right tools are in place to facilitate this movement and any transformation of data.  Some proven tools for this are Pentaho Data Integration, Informatica and Syncsort.
  • Process Modification – As with any new technology, organizational changes are going to occur around how people execute daily tasks.  As Hadoop is deployed, it is important to plan for process changes across the organization to ensure that value is gained from this new tool and that the new types of information that can be gained from Hadoop help to drive decisions within the organization.
  • User Empowerment – As with any new technology, not all users will be able to utilize Hadoop on day one.  Some users will prefer more graphical interfaces, while others will prefer a software development interface.  As IT departments deploy Hadoop, all user types should be considered to ensure they have access to tools that meet their usage models, their skills sets and the flow of their daily jobs.  Some common tools to deploy along with Hadoop are Pentaho, Datameerand Karmasphere.


Hadoop is not easy.  That is a function of both new capabilities on the market that are still maturing, as well as the flexibility of Hadoop that enables the power it provides.  Both can be overcome by careful planning, slow, methodical rolls outs and the upfront investment in expertise to assist and drive Hadoop deployments in your environment.

Tuesday, September 11, 2012

Types of Analytic Users


Analytics is becoming more important for todays companies.  As companies look to make data-driven decisions, it is paramount that the data driving those decisions is accessible to all the necessary users.  Each user has a unique set of needs based on their role within the organization, their skill set and knowledge of corporate decision-making processes.

Analytical environments should be built to enable a variety of user types from a single set of data and resources.  Each user should be enabled with an interface that meets their individual job needs, and integrates with any necessary business processes.

Within Analytics, we see three primary types of users:

  • Empowered – Empowered users are those that gain real value from the analytic technologies and tools, but do not know about the underlying technology, nor does their primary role focus on the underlying technology.  These users are typically experts in a business process and the inputs & outputs of that process.  These users commonly use a tool that presents the data in pre-packaged format, while allowing them flexibility to modify any reports to meet their specific process needs.
  • Aware – Aware users are educated of the underlying technology and understand the power of it, but are not responsible for the daily operations of the analytic environment.  Aware users will commonly consume analytical technologies, have a background in computer science or information technology and focus on creating tools for Empowered users to consume.  Aware users are the connection point between information technology operations and Empowered users that consume the data stored within analytical environments.
  • Enabled – Enabled users are the staff that deeply understand the workings of the analytical environment and are responsible for operations.  Enabled users will commonly be deeply technical staff with some understanding of the business processes, but defer deep process and data value questions to Aware users.


When building analytic environments and deploying new technologies, it is key to map those technologies to the user types and ensure all users have the tools they require for the job they perform.