Knowing Your Data Starts with Data Lineage

Data lineage can be a tremendously useful tool for data engineering and analytics, but is often treated as an afterthought both because of the challenges in implementation and the fact that it has not been broadly available within organizations. Many practitioners have never had access to data lineage information and may not know what they are missing.

Assessing the value of lineage for your team requires defining its scope, understanding the specific benefits it delivers, and putting it in context within the broader architecture of tools used for data management.

In this blog post (and the accompanying webinar recording) I provide a high-level overview lineage and how it applies in practice.

DEFINING DATA LINEAGE

Simply stated, data lineage is a description of the history of a piece of data, the sources from which it is derived, and the transformations applied as the data is derived. It allows us to better characterize data and to move beyond simply capturing the structure of the information.

Lineage is history– What is the change log for any element of metadata? Who has made changes?
Lineage is dependencies– What is upstream of the final data that we are accessing?
Lineage is state– What is the status of the data? How do I know if it is current?

To provide a more concrete sense of what lineage might look like, the figure below shows a simple example of data lineage in our data engineering platform, Magpie.

LINEAGE IN CONTEXT

Data lineage doesn’t exist in a vacuum, it is one of many tools one can use during the data engineering process. The figure below shows how different tooling maps to some common activities within the data engineering lifecycle.

For example, lineage coupled with the ability to explore data using ad hoc queries, and access to detailed user activity and system logs, provide a comprehensive tool set for diagnosing issues. In Magpie, you can access all of this tooling through a single consistent interface and benefit from an integrated repository for the metadata supporting ETL pipelines, security, and data lineage.

IN PRACTICE: GREATER TRANSPARENCY & TRUST

Data lineage is valuable because it creates greater transparency and trust in enterprise data. Lineage provides users with the ability to see the upstream dependencies associated with a particular data set and the transformations applied to create the dataset. This makes lineage a useful tool in analysis, but also in identifying and diagnosing issues.

Below are a few of the common questions associated with problem diagnosis. It illustrates how lineage can simplify and accelerate answering some of the routine questions associated with understanding your data.

Problem Diagnosis without and with Lineage

QUESTION	WITHOUT LINEAGE	WITH LINEAGE
When was this data refreshed?	•Review ETL tool logs if you have access. •Find out who owns the update process and ask that person. •Query the data and infer last update based on timestamp.	View the lineage
What are the upstream sources that might be the root cause of the problem?	•Review the ETL process flow if you have access. •Ask the owner of the process.	View the lineage
What are the transformation rules applied in updating the data?	•Review the ETL process flow if you have access. •Review upstream dependencies. •Ask the owner of the process.	View the lineage
Are we sure the data wasn’t updated outside of the routine process?	•Query the data and look for signs that something happened out of process (timestamps, etc.). •Ask around.	View the lineage

IMPLEMENTATION CHALLENGES

Unfortunately, it can be difficult to fully realize the value of data lineage for a number reasons:

Tracking lineage often requires third-party tools that many organizations that may not be accessible to implement.
Even when tools are present, lineage is often only available to the users of specialized data management tools. For example, many ETL tools support some limited form of lineage, but generally are only used by some data engineers.
Most organizations are moving and transforming data through a range of tools so it is not possible to generate a clear lineage.

While these challenges exist, they can be overcome. In many organizations, limiting the availability of lineage to a subset of users who are leveraging ETL tools is better than nothing. Standalone data governance tools can often provide visibility into lineage, but can be difficult for organizations to adopt and require specialized skills. Finally, end-to-end tools, like our platform Magpie, combine lineage with a broader set of capabilities making lineage more accessible to a broader audience of users.

If you are considering how to implement lineage in your own environment, your starting point should be an inventory of the tools you are currently using and an exploration of how lineage could fit in. That could include adding additional tools, leveraging capabilities that already exist but have not been implemented, or reconsidering your toolset to move toward a more integrated platform.

SEE IT IN ACTION: DATA LINEAGE WITH MAGPIE

The best way to understand the value of lineage is to see it in action. The video linked below includes a brief review of some of the concepts in this blog post and a demonstration of lineage in the Magpie platform.