Data Intelligence

Data Lineage in a Big Data Environment

Actian Corporation

March 1, 2018

data-lineage

Data lineage is defined as a type of data life cycle. It is a detailed representation of any data over time: its origin, processes, and transformations. Although this isn’t a brand new concept, a paradigm shift is taking place.

Obtaining data lineage from a Data Warehouse, for example, was a pretty simple task. This centralized storage system allowed, “by design,” you to obtain data lineage from the data stored in the same place.

The data ecosystem has been evolving at a very rapid pace since the emergence of Big Data due to the appearance of various technologies and storage systems that complicate information systems in enterprises.

It has become impossible to keep and impose a single centralized tool in organizations. Software and methods used by urbanists and IS architects of the “old world” have become less and less maintainable, making their work obsolete and illegible.

So, How Can You Visualize an Efficient Data Lineage in a Big Data Environment?

To have a global vision of an enterprise’s IS data, new tools are emerging. We are talking about a data catalog. It allows for a maximum amount of metadata from all data storages to be treated via a user-friendly interface. By centralizing all of this information, it is possible to create data lineage in a Big Data environment at different levels:

At the Datasets Level

It can be a table in Oracle, a topic in Kafka, or even a directory in the data lake. A data catalog highlights the processes and datasets that made it possible to create the final dataset.

However, this data lineage standard on its own does not make it possible for data users to answer all of their questions. Among others, these questions remain: What about sensitive data? What columns were created and with what processes? etc.

At Column Level

A more granular way  to approach this topic is to represent the different transformation stages of a dataset in a timeline of actions/events. By selecting a specific field, users will be able to see what columns and actions created it.

actian avatar logo

About Actian Corporation

Actian makes data easy. Our data platform simplifies how people connect, manage, and analyze data across cloud, hybrid, and on-premises environments. With decades of experience in data management and analytics, Actian delivers high-performance solutions that empower businesses to make data-driven decisions. Actian is recognized by leading analysts and has received industry awards for performance and innovation. Our teams share proven use cases at conferences (e.g., Strata Data) and contribute to open-source projects. On the Actian blog, we cover topics ranging from real-time data ingestion, data analytics, data governance, data management, data quality, data intelligence to AI-driven analytics.