I speak to a lot of customers and a common question is, “why do I need Hadoop?”. There are commonly additional components to the question like “I already have a data warehouse” or “I have lots of databases, what is one more “. And while these questions are valid and no IT department wants to continually add tools without understanding the value they will add; Apache Hadoop is for a very different set of uses then traditional relational databases, or specialized solutions like data warehouses. When I speak to these customers, most have heard of Hadoop, it is a very overly used buzz word these days, and they want to understand not only where Hadoop fits in their environment (if it does), but if it overlaps with tools or technology they already have.
There are already a multitude of tools on the market that store data, allow it to be accessed and allow for modeling and business intelligence tools to make sense of it. These tools come in the form of relational databases, modeling tools, network attached storage (NAS) and data warehouse appliances just to name a few. Within this set, each tool is good for specific use cases.
Relational databases are excellent for ensuring data integrity, while balancing read and write and ensuring consistent, high speed access. Today’s relational databases scale quite well for structured data. Relational databases are difficult to optimize for highly read biased environments and struggle to contain large amounts of unstructured data because of the schema that must be defined prior to data insert or use.
NAS offerings are great as a low cost, generally accessible location for putting files and data that does not easily fit in a relational database because of size or format. NAS devices are easy to manage, but are focused on storage of information, not deriving value from using that information.
At the core, Hadoop’s value is its ability to combine storage and analysis of data into a single software stack. This tight coupling allows for a broad range of information types and formats to be stored and analyzed by the same set of tools. Hadoop’s core capabilities, and value to an organization are:
Flexibility – Hadoop has no predefined schema for the data it stores, this allows any type of information, regardless of format or data type to be stored. That data is stored in a single, uniform way, allowing for a wide variety of tools to access it through the same interface. This enables reporting and analysis across different types of data, without the need for different tools.
Community Adoption – Hadoop has become widely adopted by both the open source community, and a broad range of commercial software firms. This uniform adoption enables customers to utilize Hadoop for storing their data, while running a variety of tools in conjunction with Hadoop for data analysis, consumption and presentation.
Independent Scaling – Hadoop’s distributed architecture allows operations staff to independently scale the compute and capacity based on business needs. Because of the flexibility of the Hadoop architecture, you can add the necessary ratio of disk spindles and CPUs to match your workload and business growth.
Hadoop is powerful tool that enables enterprise to store and understand volumes and types of information never before possible. Hadoop enables a variety of information types to be stored and analyzed from a single interface, while providing a strong ecosystem of tools to simplify deployment, ease adoption and enable end users to consume the growing data volumes being stored in Hadoop environments.