Data Pipeline Diffing with Datakin Written by Peter Hicks on June 3, 2021
Figure 1: Datakin enables pipeline diffing across runs
Envision yourself with a suddenly failing ETL task that you haven’t touched in months. You look at the code and nobody else has touched it, and nothing comes to mind about how this could have happened. You’re getting incredulous messages from engineers and managers in different departments. Tasks are backing up in the queue and you’re wary of a looming postmortem on the horizon, but need to understand what could have possibly caused this to happen and how to unblock other people in your organization.
Where could you start looking to resolve this?
This is one of the questions we started asking ourselves as we started aggregating lineage information with Marquez. Our original vision was to collect the links between jobs and datasets, collate the metadata associated with those entities, and provide broad observability to a highly complex data ecosystem. We soon realized that our metadata collection actually enables us to reconstruct the entire history of a data lineage pipeline. The snapshots captured by Marquez checkpoint and document the evolution of data lineage over time.
When code starts failing in production, software engineers will often look at a code diff as a first debugging step to understand what has changed from one release to the next. We wanted to bring forward this concept to a data pipeline that can span database providers, source code locations, teams, and maintainers. Job failure can occur for many reasons, including code bugs, dataset availability, or unpredictable pipeline issues upstream. In our example below, we see that Job 5 has failed, but is in the middle of a web of upstream and downstream dependencies.
Figure 2: The most recent run of Job 5 has failed.
The agents of change in a data pipeline are runs, and as a result, they are the foundation for our interaction to debug a job failure. If a failing run is selected along with a successful run, we can reconstruct both lineage graphs and see show the elements that have changed between those runs.
Figure 3: Select Runs 3 and 4 to see the difference in the entire run level graph between those two executions.
When constructing our comparison graph, our original graph is used as a basis since it reflects the current state of the data pipeline. This allows us to ignore previous inputs and outputs that are no longer present and retain focus on jobs and datasets that still exist in our current generation. Attention can be brought to upstream jobs with code changed or upstream datasets with an underlying schema change. Since Marquez has collected all of the corresponding metadata for all the previous versions of jobs and datasets, underlying changes can be brought forward to the data engineer trying to debug a pipeline.
Figure 4: We see that both the code for Job 3 and the schema for Dataset D have changed between our selected runs.
Bringing our Vision to Life
In building an interface to navigate job failure, we wanted to closely preserve the abstractness of both jobs and datasets and closely reflect our technical prototypes. The application enables both the broad exploration of a particular data pipeline as well drilling down into pipeline issues with our comparison tab. We hope to empower data engineers to debug pesky data pipeline issues in a way they could have never before.
Figure 5: Select two runs to compare across.
Figure 6: Showing the diff of a dataset for spanning runs of a downstream job while anchored to job and two of its runs.