Monday, March 26, 2012

The real value of Hadoop

I speak to a lot of customers and a common question is, “why do I need Hadoop?”. There are commonly additional components to the question like “I already have a data warehouse” or “I have lots of databases, what is one more “. And while these questions are valid and no IT department wants to continually add tools without understanding the value they will add; Apache Hadoop is for a very different set of uses then traditional relational databases, or specialized solutions like data warehouses. When I speak to these customers, most have heard of Hadoop, it is a very overly used buzz word these days, and they want to understand not only where Hadoop fits in their environment (if it does), but if it overlaps with tools or technology they already have.

There are already a multitude of tools on the market that store data, allow it to be accessed and allow for modeling and business intelligence tools to make sense of it. These tools come in the form of relational databases, modeling tools, network attached storage (NAS) and data warehouse appliances just to name a few. Within this set, each tool is good for specific use cases.

Relational databases are excellent for ensuring data integrity, while balancing read and write and ensuring consistent, high speed access. Today’s relational databases scale quite well for structured data. Relational databases are difficult to optimize for highly read biased environments and struggle to contain large amounts of unstructured data because of the schema that must be defined prior to data insert or use.

NAS offerings are great as a low cost, generally accessible location for putting files and data that does not easily fit in a relational database because of size or format. NAS devices are easy to manage, but are focused on storage of information, not deriving value from using that information.

At the core, Hadoop’s value is its ability to combine storage and analysis of data into a single software stack. This tight coupling allows for a broad range of information types and formats to be stored and analyzed by the same set of tools. Hadoop’s core capabilities, and value to an organization are:

Flexibility – Hadoop has no predefined schema for the data it stores, this allows any type of information, regardless of format or data type to be stored. That data is stored in a single, uniform way, allowing for a wide variety of tools to access it through the same interface. This enables reporting and analysis across different types of data, without the need for different tools.

Community Adoption – Hadoop has become widely adopted by both the open source community, and a broad range of commercial software firms. This uniform adoption enables customers to utilize Hadoop for storing their data, while running a variety of tools in conjunction with Hadoop for data analysis, consumption and presentation.

Independent Scaling – Hadoop’s distributed architecture allows operations staff to independently scale the compute and capacity based on business needs. Because of the flexibility of the Hadoop architecture, you can add the necessary ratio of disk spindles and CPUs to match your workload and business growth.

Hadoop is powerful tool that enables enterprise to store and understand volumes and types of information never before possible. Hadoop enables a variety of information types to be stored and analyzed from a single interface, while providing a strong ecosystem of tools to simplify deployment, ease adoption and enable end users to consume the growing data volumes being stored in Hadoop environments.

Thursday, March 15, 2012

Isn't Big Data just a new name for HPC?

High Performance Computing (HPC) is a field of study and associated technologies that has been around for multiple decades. Big Data is an emerging term used to describe the new types of operational challenges and complexity that are common with today’s growing data sets that must be stored, analyzed and understood. Big Data has many roots from the HPC space related to parallel programming methods, data set size and complexity and algorithms used for data analysis, manipulation and understanding.

Ultimately, HPC and Big Data are not technologies. They are a common set of concepts and practices, supported by specific tools and technology. Each is commonly used to represent a set of problems and the technical solutions for solving the problems. HPC and Big Data overlap in many places, but also have specific domains that they are unique too and do not overlap.

HPC Commonly includes technologies and concepts like (certainly not exhaustive):

  • Message Passing Interface (MPI) – MPI provides a common set of functions to enable distributed processes to communicate at high speeds.
  • Parallel File Systems – Parallel file systems allow for high levels of throughput by simultaneously writing and reading across many different storage servers that appear as a single file system and name space. Parallel file systems enable access to single data sets from many different systems, at high speed, and ensure data integrity while many different systems are simultaneously reading and writing to different files and locations within the file system.
  • Lustre – Lustre is an open source, commercially supported parallel file system that is highly scalable, commonly allowing for use on multi-thousand node clusters and multi-thousand processor HPC systems.
  • Infiniband – Infiniband is a network technology that provides very high levels of bi-directional bandwidth in an HPC environment at lower latency then is commonly available with traditional Ethernet based technologies.

Big Data commonly includes technologies and concepts like:

  • Map Reduce – Map Reduce is a set of algorithms and the functions that allow for the distributed analysis of large data sets. Map Reduce is the result of many years of work in the computer science fields and research papers from a variety of universities and technology focused companies.
  • Distributed File Systems – Distributed file systems provide the ability to store large data sets with no pre-defined structure, distributed across many commodity nodes. Distributed file systems work in the Big Data space to provide scalability, data locality and data integrity through replication and checksum validation on read.
  • Hadoop – Hadoop is an open source, commercially supportabed implementation of Map Reduce and the Hadoop Distributed File System (HDFS). Hadoop has enabled a large ecosystem of additional tools to process and exploit the data stored in a Hadoop environment. Hadoop has many tools to allow it to be integrated into data pipelines allowing for complex storage and analysis of data across a variety of tools.
  • HPCC Systems (LexisNexis) – Using the Enterprise Control Language (ECL) for development, HPCC Systems is an open source application stack from LexisNexis for storing and analyzing large, complex data sets.
  • NoSQL – Not-only-SQL (NoSQL) is an emerging set of tools for storing loosely structured data and providing access through SQL-like interfaces, but removing some of the more complex functionality common in traditional relational databases, but unnecessary in some of the current used cases for NoSQL tools.

Common workloads for use in Big Data environments:

  • Better advertising – Many online retailers and businesses utilize technologies within the Big Data space to provide targeted advertising, ensuing a higher acceptance rate and purchases by customers.
  • Social Networking – The rapid rise of social networking sites and tools are the most common example of Big Data. Today, tools like Hadoop, MongoDB and Cassandra are the supporting technology for the majority of the social networking sites. These tools have been developed specifically to meet the needs and requirements of social networking companies.
  • Recommendations engine and matching – Big Data tools like Hadoop are commonly used to make recommendations to customers on purchases, these recommendations are driven by a large data set specific to customer type based on previous purchases, recommendations by friends and other items priced and reviewed online.
  • Differential Pricing – Differential pricing is becoming more and more common as tools that can quickly determine market value get deployed. Differential pricing is the adjustment of a good or service’s price, up or down, to influence demand. While this has long been leveraged by retailers to control inventory; Big Data technologies allow it to be done more rapidly and pricing set automatically based on market dynamics.

Common workloads for use in HPC environments:

  • Chemistry and Physics research – Many workloads related to chemistry and physics modeling require a level of inter-process communication that Big Data technologies do not provide. These workloads are commonly used in traditional HPC environments, providing researchers with proven methods to model new chemicals, and physical reactions expected in actual experiments.
  • Oil & Gas Modeling – The types of modeling that the Oil & Gas industry commonly do involve large data sets captured from the field, that requires processing prior to making decisions about how to properly exploit an energy reserve. This modeling is commonly run on traditional HPC environments and has many years of proven technology behind it.

Fundamentally, Big Data and HPC are different in one major aspect – What data moves and where it moves. The big difference between traditional HPC environments and Big Data environments is where the data is relative to where the data is processed. Within HPC environments the data is always moved to the location where it will be processed. In Big Data environments the job is moved to the location of the data to minimize data movement. The struggle with this approach is that HPC users and Big Data users commonly demand dedicated environments; this can increase the operational costs for IT departments by requiring separate sets of hardware that must be managed with their own efficiency metrics to be monitored.

So, can these different methods be mixed? It is becoming more and more common for HPC departments to receive request to enable newer Big Data type applications to be used in traditional HPC computing environments. Ultimately both HPC and Big Data are about taking very large, complex data sets and analyzing the information to enable better understanding and decisions. The methods taken are what differs.

I will target some future postings on the technical implementations of running Hadoop and other Map Reduce frameworks on traditional HPC environments.