Blog

What is data lineage (and why should I care)? Written by Ross Turk on June 22, 2021

Any real-world data architecture is made up primarily of madness and chaos.

Your most cared for data pipeline, the one that you spend a lot of time keeping neat, the one that moves your most important data, is a total mess. If you’ve got one that’s tidy and easy to understand, it must not do much. This is inherently messy work. The information you cherish is in a web of datasets, continuously manipulated by jobs operating on a collection of platforms run by all kinds of different people. It’s an imperfect, beautiful mess…and it kind of has to be.

Don’t worry! This is normal, and reflects the basic nature of knowledge: fragmented, chaotic, and multidimensional. It also reflects the fact that data in any organization is produced and used by a whole bunch of different people with different processes, technologies, and needs. Even for those of us in small organizations, important insights most often come from a collection of data sources. That’s why data lineage is important.

What is data lineage?

By mashing together datasets, we create a collection of interdependencies. These relationships are entropic in nature and seldom straightforward. When someone asks us whether a specific dataset is reliable – and they inevitably will – the first thing we want to know is where it came from. Next, we want to understand everything we can about its creation. 

That’s data lineage: a dataset’s context within a broader data ecosystem, obtained by observing and analyzing its origins and influence on other datasets.

If, for example, you care a whole lot about how much revenue your food delivery service has generated this month (and of course you do!), that means you also care about how many orders you had and the total value of each order. That’s where your monthly revenue number comes from. This information can’t all be found in a single place, sadly. Multiple ordering systems and channels complicate things, meaning that your data probably lives in a half dozen different places.

Your monthly revenue number is only accurate if all of those data sources are accurate. That’s pretty easy to keep inside your head, but what if you want to forecast the use of particular ingredients? That adds a new set of data sources – suppliers, recipes, inventory – and it’s suddenly not so easy to keep everything inside your head. It’s no longer easy to be confident that you know where the number comes from, or how it was derived.

Data lineage solutions keep track of these relationships as data moves through the pipeline, so you don’t end up trying to piece things together when something goes wrong.

Why should I care?

Let’s say something has blown up in your pipeline. Somebody altered the schema for one of your datasets. It’s not a big deal – an INT in your orders table has become a VARCHAR – but you don’t know that yet. All you see is a list with a bunch of failed jobs. You imagine the failures are related, but you don’t know exactly how. Where do you start?

This is a very common situation. There is a widespread outage with a specific cause, and remediation requires a simple adjustment, but what you are presented with is the chaos of cascading failure. The problem is that you’re looking at a list. It may be an incredibly detailed list: it shows you the name of each job, its status, and stats on its last run. It gives you a lot of information you won’t need to solve this problem, but it leaves you with few clues on where to start.

As you sift through the muck, you create a mental graph of what things do and how they’re related. At some point your mental graph becomes complete enough to know where the problem is. Hooray! You fix it and probably forget most of what you learned. But this isn’t optimal. You have to do all of this while something is broken, people are concerned, and all eyes are on you.

A real-time data lineage repository does all of this for you, and it doesn’t wait until everything is on fire to get started. It observes running jobs, maps input and output datasets, and traces complex relationships as they change. Then it can use the lineage metadata it collects to display a lineage graph that helps you understand everything, resolve issues, and prevent issues.

We’ll go out on a limb here and claim this as a universal truism: in the same way that a picture is worth a thousand words, graphs are waaaayy more useful than lists. They make the critical relationships between datasets clear. They help you create order from chaos.

This seems obvious. What’s the big deal?

Yes, collecting data about the movement of data is a pretty obvious thing to do. Like all powerful ideas, it’s very simple: you should understand everything that’s going on as it’s happening. But that’s easier said than done.

Tracing data lineage in real-time is a surprisingly hard problem. Most pipelines are built using a heterogeneous collection of tools from multiple suppliers. Real-time data lineage systems can’t learn the full truth by studying datasets and queries after the fact, they need to observe jobs as they run. Does that mean you have to rebuild your entire pipeline? Let’s hope not, because we all know that isn’t going to happen. Even if you wanted to, it’s just not practical. There needs to be a way to collect real-time lineage metadata from existing pipelines.

Solving this doesn’t require complicated technology, although there are certainly some very interesting challenges involved. The only magic is the commoditization of lineage metadata collection through creation of an open standard, OpenLineage. The number of data solution vendors supporting OpenLineage is growing. You can learn more in this introduction post

Collecting lineage metadata (with OpenLineage!) is just the beginning. Real-time data lineage unlocks new possibilities in operations, data quality, governance, and more. To see some of the things that can be done, request an invite to the Datakin beta.