Tuesday, September 18, 2012

Adopting Hadoop in the Enterprise


Apache Hadoop is one of the hottest technologies today, garnering attention from small startups to the largest corporations and government agencies.  Hadoop provides the middleware need to store and analyze very large data sets, while enabling businesses to make better data driven decisions.

Hadoop started as an internal project at Yahoo and was eventually released as an open source project called Apache Hadoop.  Other prominent technology companies including Facebook, eBay and Nokia quickly adopted Hadoop and began contributing back to the Hadoop community.

Because of the origin of Hadoop, many of its features and usage models are targeted at web scale companies with highly differentiated operational models like Facebook.  These features and design decisions, while worthwhile for these web scale firms, are not always a good fit for the varied operational models that are used in traditional enterprise IT environments.

Traditional enterprise IT environments are characterized by mixed environments of many different software packages, hardware platforms and storage platforms that must integrate and coexist.  Enterprise environments often have much different requirements for monitoring, lifecycle management and security then single, highly integrated platforms like Facebook and Amazon.

Many web scale firms have the luxury of building out internal monitoring and orchestration frameworks that are tightly coupled across the environment.  Compare that with the often fractured and legacy deployment struggles that are common with enterprise computing shops and you see that enterprise computing environments have a unique set of needs when deploying scalable, open source software.

To ensure Hadoop is successful in your enterprise, you should start by evaluating what features and functionality are a priority for your deployment; that can then be used to determine the optimal Hadoop distribution, additional tools or custom development that will be required for Hadoop deployment in a production environment.

Some common areas of consideration for running Hadoop within an integrated enterprise are:
  • Access Controls – With any consolidation of data, the access to that data becomes a primary concern for the organization.  Hadoop provides minimal capabilities for limited access to data, and does not come close to the cell-level granularity that is commonly expected within enterprise software.  There are several projects associated with Hadoop that look to overcome this category of problem, including Accumulo, Zettaset, Sqrrl.
  • IDM Integration – Integration with outside authentication mechanisms is important to ensure that a users’ identity is tracked across all interconnected applications.  Hadoop has the ability to leverage outside systems including LDAP and Kerberos for user authentication.
  • Monitoring/Auditing/Alerting – Understanding what is occurring within a Hadoop environment is key in ensuring stability, a managed lifecycle and the ability to take action to user feedback.  The tools that are deployed for managing Hadoop should encompass for the entire lifecycle of managing the cluster and provide an integrated view into the users, applications, Hadoop core and hardware to enable administrators to quickly make changes to the environment and assess there impact.
  •  Skills & Expertise – Hadoop is a new technology and as a result the market of job candidates has not caught up to the demand for Hadoop skills.  When developing a team a two-angle approach is recommended.  First, enable existing staff time to obtain training and hands on experience with Hadoop and it’s supporting technologies.  Second, leverage outside consulting expertise to help train, and assist the organization as they deploy new technologies.  These two methods balance the need to ensure skills are available in the organization long term, with the immediate need to deploy new technologies in a low-risk, proven architecture.
  • Legacy System Connectivity – Hadoop is rarely deployed as a standalone island within IT, more often it is a connection point between other data repositories, BI tools and user access technologies.  When defining Hadoop deployment strategies it is key to account for the end-to-end flow of the data in the organization to ensure the right tools are in place to facilitate this movement and any transformation of data.  Some proven tools for this are Pentaho Data Integration, Informatica and Syncsort.
  • Process Modification – As with any new technology, organizational changes are going to occur around how people execute daily tasks.  As Hadoop is deployed, it is important to plan for process changes across the organization to ensure that value is gained from this new tool and that the new types of information that can be gained from Hadoop help to drive decisions within the organization.
  • User Empowerment – As with any new technology, not all users will be able to utilize Hadoop on day one.  Some users will prefer more graphical interfaces, while others will prefer a software development interface.  As IT departments deploy Hadoop, all user types should be considered to ensure they have access to tools that meet their usage models, their skills sets and the flow of their daily jobs.  Some common tools to deploy along with Hadoop are Pentaho, Datameerand Karmasphere.


Hadoop is not easy.  That is a function of both new capabilities on the market that are still maturing, as well as the flexibility of Hadoop that enables the power it provides.  Both can be overcome by careful planning, slow, methodical rolls outs and the upfront investment in expertise to assist and drive Hadoop deployments in your environment.

9 comments:

sundara rami reddy said...

I like the helpful hadoop information you provide for your tutorials. I’ll bookmark your weblog and check again here frequently. I am quite sure I’ll learn many new stuff proper here! Best of luck for the following!
Hadoop Training in hyderabad

Stephen said...

Thank you so much for sharing this great information. Today I stand as a successful hadoop certified professional. Thanks to hadoop training velachery

dhanamlakshmi palu said...

I gathered a lot of information through this article.Every example is easy to understandable and explaining the logic easily.Thanks! VMWare Training in chennai | VMWare Training chennai | VMWare course in chennai | VMWare course chennai

surangacloud said...

Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post. cloud computing training in chennai | cloud computing training chennai | cloud computing course in chennai | cloud computing course chennai

dhanalakshmi palu said...

This Information very helpful for the beginners.In this each step have a wonderful explanation.I would study and known about the application.thanks for giving wonderful information. AWS Training in chennai | AWS Training chennai | AWS course in chennai

Steve Hawks said...

Actually, you have explained the technology to the fullest. Thanks for sharing the information you have got. It helped me a lot. I experimented your thoughts in my training program.

Hadoop Training Chennai
Hadoop Training in Chennai
Big Data Training in Chennai

geethu said...

Great content. I really enjoyed while reading this content with useful information, keep sharing.
Hadoop Training in Chennai | Hadoop Training Chennai | FITA Velachery | FITA Academy Chennai.

Nikshitha S said...

Updating with the latest technology and implementing it is the only way to survive in our niche. Thanks for making me this article. You have done a great job by sharing this content in here. Keep writing article like this.
SAS Training in Chennai | SAS Course in Chennai

Nikshitha S said...

The strategy you have posted on this technology helped me to get into the next level and had lot of information in it. The angular js programming language is very popular which are most widely used.
Angularjs Training in Chennai | Angularjs training Chennai