Monday, September 5, 2011

What is "Big Data"?

One of the most commonly used terms today is Big Data, it is regularly used in blogs, product launches, architecture documents and speeches just to name a few. Big Data is being used to describe products, capabilities, features and new ideas about how to build and manage many of today’s new applications and the data that drives them. The struggle is that Big Data has different meanings to different people, and there in lies the problem. For any technology to survive the test of time in IT, it must be understood and accepted by a large enough segment of the user and administration population that the term and associated products evolve into a self-sustaining ecosystem.


I want to outline what I see as Big Data and the common definition I use for people that are struggling with the problems that Big Data often includes and helps to address. The most common emerging definition of Big Data is one that includes one or more of these three parameters – Volume, Velocity and Variety. There are many different ways to define Big Data, but I believe that by using these three parameters, you can clearly define what problems fall into Big Data versus traditional data management and analysis.


  • Volume – Volume is the measure of how much data a company has under their management, operation and analysis. Volume is typically measured in Gigabytes, Terabytes or Petabytes. Volume is not only an absolute number of current capacity, it can be expressed in data growth over time. Defining what Big Data is takes an evaluation of the companies total data-assets and their growth over time.

  • Velocity – Velocity is the time that elapses from the time a company receives a new data point to the time they must act on it and make a decision. This decision could be to make changes to a stock portfolio, change the pricing for a product or trigger the staff to make a change to the environment. Big Data typically contains customers that have a velocity requirement of at or near-real time decision making every time new data is received.

  • Variety – The third parameter that defines Big Data is Variety. Variety defines the types of data a company utilizes within its analysis tools, its customer applications and its business-driven workloads. Big Data customers are typically characterized by a multitude of data including user information, movies, pictures, GPS data, log files, and sales information or. While storing these types of data is not new in IT, Big Data has brought about users that make connections between data that previously was left in islands to analyze, manage and manipulate.


Now that we have defined Big Data – one or more parameters of Volume, Velocity or Variety, we can look at how people are using Big Data in their environments to drive better decision making for a company, faster responses to customer demands and more accurate forecasting of possible business trends. There are a variety of common themes used in Big Data environments:

  • Related but Unstructured – Many Big Data environments have lots of related data, but that data is unstructured. These types of data could include movies, images, log files or users. While all this data has an association with one another, that association is constantly changing based on how each of these items changes over time and what questions people are trying to ontain from the data sets.

  • Traditional Tools Don't Scale – The Big Data ecosystem is evolving rapidly with new tools for storing data, managing data, analyzing data and finding new uses of that data. These new tools have come about because typical tools for data storage do not scale to support the volume, velocity and variety that are common for Big Data.


Now that we have looked at what defines Big Data, as well as what commonality can be found with Big Data, who are some of the common consumers and operators of Big Data? How are they using their Big Data environments?


  • Facebook – Facebook is the first name that comes to mind for a lot of people when talking about Big Data. Facebook has an example of all three parameters to Big Data. Their Volume of data is well into the Petabyes and growing daily. The Velocity at which they must receive a piece of information and make suggestions to others based on that information is measured in seconds and the Variety of data that Facebook stores includes movies, pictures, places, users, usage information, log files and suggestions just to name a few.

  • Amazon – Amazon has a Big Data environment, and one of the most well known features of Amazon driven by Big Data is it's recommendation engine. Every time you purchase an item from Amazon, a list of suggestions is made at near-real time of other items that may interest you based on previous users recommendations and purchases. This Big Data need is driven by the immediate need for recommendations, Amazon can not reasonable run batch queries and recommend other items an hour after you finished your previous purchase.

  • Linkedin – Linked in uses Big Data to make recommendations on both contacts that you may know, as well as jobs that you may be interested in. Both of these problems are solved through the analysis of large sets of data based on constantly changing relationships and associations.


By using a companies data to it's full advantage, companies can use Big Data to be more efficient at business operations, more connected to users needs and more rapid to respond then competitions. But the concept of Big Data is only so useful, to really exploit these capabilities there must be tools that allow companies to quickly store, analyze and utilize their growing data sets. Some of the most common tools for exploiting Big Data are:

  • Hadoop – Hadoop is an Apache project and one of the more commonly used tools by customers that have Big Data. Hadoop provides a framework for storing and analyze large data sets with no restrictions on what types of data can be stored and analyzed.

  • R – R from Revolution Analytics provides an extremely powerful set of libraries and capabilities for analyzing large data sets, finding data associations and creating applications that exploit Big Data.

  • Accumulo – Accumulo is a recently released tool of the United States Government for utilizing data stored in a Big Table format.

  • HPCC – HPCC is an open source tool from LexisNexis to enable companies to store and process large, complex data sets that typically required proprietary technology to analyze.


Big Data is a powerful new concept within today’s IT environments. Implemented through a variety of tools. Big Data solutions enable companies to analyze data in new ways, enabling new levels of productivity and response to customers in new, rapid methods.