Big Data is a term that we are hearing more and more to describe the growing challenges in managing the volume, variety and velocity of today’s data sets. Data today is being gathered from more diverse sources, analyzed more often and decisions made more quickly then ever before. But, ultimately, Big Data is not a technology, but rather changes to the methodologies that an organization uses to turn data into information. Big Data challenges can be resolved with a variety of new technologies on the market, but at a fundamental level, Big Data is process.
One growing area of focus within the Big Data space is Data Curation (DC). DC is the organization and tracking of this data, turned information, to ensure that it is accessible in an understandable and reliable fashion over a given period of time. There are three primary components too DC:
Finding the Data – Ensuring that the data is uniformly organized against a documented standard.
Ensuring Data Availability – Data, like any physical asset has a lifecycle. Data access must be planned over the lifecycle of the data to ensure it is available today and in the future against a pre-defined set of criteria for retention and accessibility.
Data Quality – Ensuring that data within an environment meets a minimum level of quality standards to ensure reporting and analysis activities are not tainted by non-compliant data.
If you think of incoming data as a giant set of piles with no labels or patterns, Data Curation is the process to turn that pile into information that is neatly organized in file folders consistent against a set of standards to enable users can quickly locate the information required.
As data volumes grow the lifecycle of the associated information becomes more and more difficult to manage, both the process and the underlying technology. All technologies that store date must ultimately be replaced at the end of their useful life, DC is the process to ensure that data is properly migrated to new platforms, the integrity of that data protected and the accessibility of the data not impacted in a negative way. DC must span the entire life of the data including creation, management, migration and ultimately disposal, per appropriate policies.
Data Curation is about organization of data to create information, while ensuring access to that information. DC is technology agnostic and requires a mindset beyond just acquisition of the data, but also inclusive of how long the data must be made available and retained. Firms that work with Big Data must consider the implications of their growing data sets as they work to understand the data and make better business decisions based on the derived information.