A real-time approach to data lineage Written by Ross Turk on August 5, 2021
A data ecosystem that spans multiple pipelines, teams, and platforms can be overwhelming. Each dataset and job exists in a unique operational context, with interdependencies that may seem simple…until they multiply. Every tiny piece has something in common, though: when it breaks, it becomes the most important thing to everyone you know.
Lineage metadata is the thread that connects it all together. By capturing the relationships between datasets – and how those relationships change over time – you can begin to see the chaos of your data ecosystem as a woven fabric: still intricate and delicate, but in a familiar and recognizable shape. However, there is more than one way to capture lineage metadata.
How lineage metadata has traditionally been captured
For most of us, lineage metadata is collected when we need it most: during an active investigation. Maybe a job has failed and we are looking for the cause…which we determine, after looking at source code and SQL queries, is another failed job. When we look into that job, it failed because of yet another failed job. And it’s turtles all the way down: you will be following the trail as far as it goes, and you don’t know how far that is. You don’t even really know how long it’s been broken. This is clearly not ideal, but we’ve all done it.
Fortunately, there are tools that can help. Some of them look through your source code, parsing SQL queries to determine which data sources and tables they SELECT from and INSERT into. For the purposes of this discussion, let’s call these tools static code analysis systems. Of course, anyone who has spent any time with SQL knows there’s more than one way to put data into a table. Static code analysis tools have developed sophisticated methods to deal with all of the corner cases – but it’s still easy to miss something.
There is another class of tools that study query logs, looking for the same lineage information. By studying queries that have been executed, they create a rough mapping of input/output datasets. Let’s call them query log analysis systems. They search the road already traveled to find challenges already experienced. Like archeologists, these tools study history and infer elements of a story based on what they find. It isn’t an incorrect story, but it’s rarely complete.
It is possible to collect lineage metadata in real-time
You don’t need to imagine what happened, or piece it together…you can know it. It’s just a matter of being in the right place at the right time, observing the data as it moves, and collecting everything in a lineage repository.
Lineage repositories are a neat bit of tech, with all sorts of interesting problems to solve, but being in the right place at the right time is the hard part. There are a ton of different frameworks that schedule jobs, tabulate totals, process streams, and train ML models – and some parts of your pipeline might even be custom-built. However, the introduction of an open standard for lineage metadata collection (OpenLineage) with connectors for the major frameworks makes this a whole lot easier.
Once you make the shift to real-time, data lineage grows beyond its governance and compliance roots and becomes an indispensable operational tool. When a job fails, it’s important to know exactly what it affected without any guesswork. Real-time data lineage systems that implement OpenLineage, like Datakin and Marquez, push lineage metadata as jobs run, keeping a historical record.
Why you should collect real-time lineage
Better accuracy: Static code analysis systems rely on queries being “fairly normal”. If the query is overly complex, dynamically generated, or contains tokenized input/output tables, static code analysis systems might not understand what they’re looking at. Query log analysis systems generate metadata that is partial at best. They learn what they know from parsing logs that were designed to be read by humans, not machines. Important metadata is lost, making these a poor source for automated analysis. Real-time lineage systems like Datakin and Marquez employ a more reliable method: direct observation.
Greater coverage: Static code and query log analysis systems both rely on their understanding of SQL queries, which are commonplace for relational databases. But what if your data isn’t in a relational database? What about streaming or non-SQL batch workloads? Real-time lineage systems built around open, extensible standards like OpenLineage can work in heterogeneous pipelines…even ones we haven’t imagined yet.
Fresher operational data: Static code analysis systems don’t have operational context for the queries they find. They don’t know when queries ran, or if they ran at all. Similarly, log analysis systems know about queries, but don’t have context on jobs themselves. They may see a similar query every hour, but won’t know whether it was created by the same job. Real-time lineage systems immediately become aware of job changes and can analyze their impact.
Datakin is an end-to-end operations solution based on real-time data lineage. Give it a try for free!