Thursday, March 15, 2012

Isn't Big Data just a new name for HPC?

High Performance Computing (HPC) is a field of study and associated technologies that has been around for multiple decades. Big Data is an emerging term used to describe the new types of operational challenges and complexity that are common with today’s growing data sets that must be stored, analyzed and understood. Big Data has many roots from the HPC space related to parallel programming methods, data set size and complexity and algorithms used for data analysis, manipulation and understanding.

Ultimately, HPC and Big Data are not technologies. They are a common set of concepts and practices, supported by specific tools and technology. Each is commonly used to represent a set of problems and the technical solutions for solving the problems. HPC and Big Data overlap in many places, but also have specific domains that they are unique too and do not overlap.

HPC Commonly includes technologies and concepts like (certainly not exhaustive):

  • Message Passing Interface (MPI) – MPI provides a common set of functions to enable distributed processes to communicate at high speeds.
  • Parallel File Systems – Parallel file systems allow for high levels of throughput by simultaneously writing and reading across many different storage servers that appear as a single file system and name space. Parallel file systems enable access to single data sets from many different systems, at high speed, and ensure data integrity while many different systems are simultaneously reading and writing to different files and locations within the file system.
  • Lustre – Lustre is an open source, commercially supported parallel file system that is highly scalable, commonly allowing for use on multi-thousand node clusters and multi-thousand processor HPC systems.
  • Infiniband – Infiniband is a network technology that provides very high levels of bi-directional bandwidth in an HPC environment at lower latency then is commonly available with traditional Ethernet based technologies.

Big Data commonly includes technologies and concepts like:

  • Map Reduce – Map Reduce is a set of algorithms and the functions that allow for the distributed analysis of large data sets. Map Reduce is the result of many years of work in the computer science fields and research papers from a variety of universities and technology focused companies.
  • Distributed File Systems – Distributed file systems provide the ability to store large data sets with no pre-defined structure, distributed across many commodity nodes. Distributed file systems work in the Big Data space to provide scalability, data locality and data integrity through replication and checksum validation on read.
  • Hadoop – Hadoop is an open source, commercially supportabed implementation of Map Reduce and the Hadoop Distributed File System (HDFS). Hadoop has enabled a large ecosystem of additional tools to process and exploit the data stored in a Hadoop environment. Hadoop has many tools to allow it to be integrated into data pipelines allowing for complex storage and analysis of data across a variety of tools.
  • HPCC Systems (LexisNexis) – Using the Enterprise Control Language (ECL) for development, HPCC Systems is an open source application stack from LexisNexis for storing and analyzing large, complex data sets.
  • NoSQL – Not-only-SQL (NoSQL) is an emerging set of tools for storing loosely structured data and providing access through SQL-like interfaces, but removing some of the more complex functionality common in traditional relational databases, but unnecessary in some of the current used cases for NoSQL tools.

Common workloads for use in Big Data environments:

  • Better advertising – Many online retailers and businesses utilize technologies within the Big Data space to provide targeted advertising, ensuing a higher acceptance rate and purchases by customers.
  • Social Networking – The rapid rise of social networking sites and tools are the most common example of Big Data. Today, tools like Hadoop, MongoDB and Cassandra are the supporting technology for the majority of the social networking sites. These tools have been developed specifically to meet the needs and requirements of social networking companies.
  • Recommendations engine and matching – Big Data tools like Hadoop are commonly used to make recommendations to customers on purchases, these recommendations are driven by a large data set specific to customer type based on previous purchases, recommendations by friends and other items priced and reviewed online.
  • Differential Pricing – Differential pricing is becoming more and more common as tools that can quickly determine market value get deployed. Differential pricing is the adjustment of a good or service’s price, up or down, to influence demand. While this has long been leveraged by retailers to control inventory; Big Data technologies allow it to be done more rapidly and pricing set automatically based on market dynamics.

Common workloads for use in HPC environments:

  • Chemistry and Physics research – Many workloads related to chemistry and physics modeling require a level of inter-process communication that Big Data technologies do not provide. These workloads are commonly used in traditional HPC environments, providing researchers with proven methods to model new chemicals, and physical reactions expected in actual experiments.
  • Oil & Gas Modeling – The types of modeling that the Oil & Gas industry commonly do involve large data sets captured from the field, that requires processing prior to making decisions about how to properly exploit an energy reserve. This modeling is commonly run on traditional HPC environments and has many years of proven technology behind it.

Fundamentally, Big Data and HPC are different in one major aspect – What data moves and where it moves. The big difference between traditional HPC environments and Big Data environments is where the data is relative to where the data is processed. Within HPC environments the data is always moved to the location where it will be processed. In Big Data environments the job is moved to the location of the data to minimize data movement. The struggle with this approach is that HPC users and Big Data users commonly demand dedicated environments; this can increase the operational costs for IT departments by requiring separate sets of hardware that must be managed with their own efficiency metrics to be monitored.

So, can these different methods be mixed? It is becoming more and more common for HPC departments to receive request to enable newer Big Data type applications to be used in traditional HPC computing environments. Ultimately both HPC and Big Data are about taking very large, complex data sets and analyzing the information to enable better understanding and decisions. The methods taken are what differs.

I will target some future postings on the technical implementations of running Hadoop and other Map Reduce frameworks on traditional HPC environments.

No comments: