Apache Hadoop is one of the hottest technologies today, garnering
attention from small startups to the largest corporations and government
agencies. Hadoop provides the middleware
need to store and analyze very large data sets, while enabling businesses to
make better data driven decisions.
Hadoop started as an internal project at Yahoo and was
eventually released as an open source project called Apache Hadoop. Other prominent technology companies
including Facebook, eBay and Nokia quickly adopted Hadoop and began
contributing back to the Hadoop community.
Because of the origin of Hadoop, many of its features and usage
models are targeted at web scale companies with highly differentiated
operational models like Facebook. These
features and design decisions, while worthwhile for these web scale firms, are
not always a good fit for the varied operational models that are used in
traditional enterprise IT environments.
Traditional enterprise IT environments are characterized by
mixed environments of many different software packages, hardware platforms and
storage platforms that must integrate and coexist. Enterprise environments often have much
different requirements for monitoring, lifecycle management and security then
single, highly integrated platforms like Facebook and Amazon.
Many web scale firms have the luxury of building out
internal monitoring and orchestration frameworks that are tightly coupled
across the environment. Compare that
with the often fractured and legacy deployment struggles that are common with
enterprise computing shops and you see that enterprise computing environments
have a unique set of needs when deploying scalable, open source software.
To ensure Hadoop is successful in your enterprise, you
should start by evaluating what features and functionality are a priority for
your deployment; that can then be used to determine the optimal Hadoop distribution,
additional tools or custom development that will be required for Hadoop
deployment in a production environment.
Some common areas of consideration for running Hadoop within
an integrated enterprise are:
- Access Controls – With any consolidation of
data, the access to that data becomes a primary concern for the
organization. Hadoop provides minimal
capabilities for limited access to data, and does not come close to the
cell-level granularity that is commonly expected within enterprise software. There are several projects associated with
Hadoop that look to overcome this category of problem, including Accumulo, Zettaset, Sqrrl.
- IDM Integration – Integration with outside
authentication mechanisms is important to ensure that a users’ identity is
tracked across all interconnected applications.
Hadoop has the ability to leverage outside systems including LDAP and
Kerberos for user authentication.
- Monitoring/Auditing/Alerting – Understanding
what is occurring within a Hadoop environment is key in ensuring stability, a
managed lifecycle and the ability to take action to user feedback. The tools that are deployed for managing
Hadoop should encompass for the entire lifecycle of managing the cluster and
provide an integrated view into the users, applications, Hadoop core and
hardware to enable administrators to quickly make changes to the environment
and assess there impact.
- Skills & Expertise – Hadoop is a new
technology and as a result the market of job candidates has not caught up to
the demand for Hadoop skills. When
developing a team a two-angle approach is recommended. First, enable existing staff time to obtain
training and hands on experience with Hadoop and it’s supporting
technologies. Second, leverage outside
consulting expertise to help train, and assist the organization as they deploy
new technologies. These two methods
balance the need to ensure skills are available in the organization long term,
with the immediate need to deploy new technologies in a low-risk, proven
architecture.
- Legacy System Connectivity – Hadoop is rarely
deployed as a standalone island within IT, more often it is a connection point
between other data repositories, BI tools and user access technologies. When defining Hadoop deployment strategies it
is key to account for the end-to-end flow of the data in the organization to
ensure the right tools are in place to facilitate this movement and any
transformation of data. Some proven
tools for this are Pentaho Data Integration,
Informatica and Syncsort.
- Process Modification – As with any new
technology, organizational changes are going to occur around how people execute
daily tasks. As Hadoop is deployed, it
is important to plan for process changes across the organization to ensure that
value is gained from this new tool and that the new types of information that
can be gained from Hadoop help to drive decisions within the organization.
- User Empowerment – As with any new technology,
not all users will be able to utilize Hadoop on day one. Some users will prefer more graphical
interfaces, while others will prefer a software development interface. As IT departments deploy Hadoop, all user
types should be considered to ensure they have access to tools that meet their
usage models, their skills sets and the flow of their daily jobs. Some common tools to deploy along with Hadoop
are Pentaho, Datameer, and Karmasphere.
Hadoop is not easy.
That is a function of both new capabilities on the market that are still
maturing, as well as the flexibility of Hadoop that enables the power it
provides. Both can be overcome by
careful planning, slow, methodical rolls outs and the upfront investment in
expertise to assist and drive Hadoop deployments in your environment.