Apache Hadoop is one of the hottest technologies today, garnering
attention from small startups to the largest corporations and government
agencies. Hadoop provides the middleware
need to store and analyze very large data sets, while enabling businesses to
make better data driven decisions.
Hadoop started as an internal project at Yahoo and was eventually released as an open source project called Apache Hadoop. Other prominent technology companies including Facebook, eBay and Nokia quickly adopted Hadoop and began contributing back to the Hadoop community.
Hadoop started as an internal project at Yahoo and was eventually released as an open source project called Apache Hadoop. Other prominent technology companies including Facebook, eBay and Nokia quickly adopted Hadoop and began contributing back to the Hadoop community.
Because of the origin of Hadoop, many of its features and usage
models are targeted at web scale companies with highly differentiated
operational models like Facebook. These
features and design decisions, while worthwhile for these web scale firms, are
not always a good fit for the varied operational models that are used in
traditional enterprise IT environments.
Traditional enterprise IT environments are characterized by
mixed environments of many different software packages, hardware platforms and
storage platforms that must integrate and coexist. Enterprise environments often have much
different requirements for monitoring, lifecycle management and security then
single, highly integrated platforms like Facebook and Amazon.
Many web scale firms have the luxury of building out
internal monitoring and orchestration frameworks that are tightly coupled
across the environment. Compare that
with the often fractured and legacy deployment struggles that are common with
enterprise computing shops and you see that enterprise computing environments
have a unique set of needs when deploying scalable, open source software.
To ensure Hadoop is successful in your enterprise, you
should start by evaluating what features and functionality are a priority for
your deployment; that can then be used to determine the optimal Hadoop distribution,
additional tools or custom development that will be required for Hadoop
deployment in a production environment.
Some common areas of consideration for running Hadoop within
an integrated enterprise are:
- Access Controls – With any consolidation of data, the access to that data becomes a primary concern for the organization. Hadoop provides minimal capabilities for limited access to data, and does not come close to the cell-level granularity that is commonly expected within enterprise software. There are several projects associated with Hadoop that look to overcome this category of problem, including Accumulo, Zettaset, Sqrrl.
- IDM Integration – Integration with outside authentication mechanisms is important to ensure that a users’ identity is tracked across all interconnected applications. Hadoop has the ability to leverage outside systems including LDAP and Kerberos for user authentication.
- Monitoring/Auditing/Alerting – Understanding what is occurring within a Hadoop environment is key in ensuring stability, a managed lifecycle and the ability to take action to user feedback. The tools that are deployed for managing Hadoop should encompass for the entire lifecycle of managing the cluster and provide an integrated view into the users, applications, Hadoop core and hardware to enable administrators to quickly make changes to the environment and assess there impact.
- Skills & Expertise – Hadoop is a new technology and as a result the market of job candidates has not caught up to the demand for Hadoop skills. When developing a team a two-angle approach is recommended. First, enable existing staff time to obtain training and hands on experience with Hadoop and it’s supporting technologies. Second, leverage outside consulting expertise to help train, and assist the organization as they deploy new technologies. These two methods balance the need to ensure skills are available in the organization long term, with the immediate need to deploy new technologies in a low-risk, proven architecture.
- Legacy System Connectivity – Hadoop is rarely deployed as a standalone island within IT, more often it is a connection point between other data repositories, BI tools and user access technologies. When defining Hadoop deployment strategies it is key to account for the end-to-end flow of the data in the organization to ensure the right tools are in place to facilitate this movement and any transformation of data. Some proven tools for this are Pentaho Data Integration, Informatica and Syncsort.
- Process Modification – As with any new technology, organizational changes are going to occur around how people execute daily tasks. As Hadoop is deployed, it is important to plan for process changes across the organization to ensure that value is gained from this new tool and that the new types of information that can be gained from Hadoop help to drive decisions within the organization.
- User Empowerment – As with any new technology, not all users will be able to utilize Hadoop on day one. Some users will prefer more graphical interfaces, while others will prefer a software development interface. As IT departments deploy Hadoop, all user types should be considered to ensure they have access to tools that meet their usage models, their skills sets and the flow of their daily jobs. Some common tools to deploy along with Hadoop are Pentaho, Datameer, and Karmasphere.
Hadoop is not easy.
That is a function of both new capabilities on the market that are still
maturing, as well as the flexibility of Hadoop that enables the power it
provides. Both can be overcome by
careful planning, slow, methodical rolls outs and the upfront investment in
expertise to assist and drive Hadoop deployments in your environment.