
Blog
How I learned to stop worrying and love lineage Written by Laurent Paris on July 13, 2021
You’re a data engineer, and you’re dreading the coming week. This is the “week of hell”, the one when it’s your turn to be on-call. You will be responsible for any issues that might happen to the hundreds of data pipelines your company relies upon to conduct business every single day. What makes it worse is that while you understand how some of these data pipelines work because you wrote them yourself, the vast majority were written by your colleagues and remain an enigma to you.
The typical on-call of a data engineer
Naturally, what you feared the most happens. Suddenly an analyst pings you, worried that values in a dashboard seem wrong, and asks you to troubleshoot the issue.
Thus starts the usual detective work 🙄. It takes you hours to slowly reverse-engineer the flow of data – from files imported from public sources to database tables to ML models feeding the failed dashboard. This requires you to read the code of the jobs producing these various datasets, identify who wrote them, and ask for their help when you’re stuck.
Finally you identify the root cause of the problem. You modify the code of the problematic job and push the fix to production. Now you just have to remember all the jobs that you need to rerun manually to update all the downstream datasets and populate the problematic dashboard with correct values.
The next day, to add insult to injury, another analyst raises the alarm on a different dashboard. You quickly realize you forgot to manually trigger a secondary data pipeline that consumes the dataset you repaired the day before. Frustrated, you decide to try out this solution you’ve heard about that automatically tracks lineage changes in real time.
Lineage makes identifying the root cause of a problem (and repairing it) a breeze
A few weeks later you are on-call again, but this time your experience is completely different.
Just like last time, an analyst asks you to investigate a suspicious dashboard. You log into your new tool and, looking at a chart showing the number of rows added every hour, you quickly realize that a table hasn’t been updated since yesterday.
A glance at the lineage graph helps you immediately identify the various upstream datasets and jobs involved in the production of the faulty table. You quickly zoom in on a job that seems to be failing.
It is not immediately obvious why this job is failing, so you use an advanced feature that allows you to analyze what changed in the overall environment between the last time the job successfully ran and the first time it failed.
This helps you quickly narrow down to a code change in a job that altered the schema of an output dataset, breaking the code in the failing job.
The whole troubleshooting process only took you 10 minutes, compared to the hours you spent last time. Even better, triggering all the downstream jobs to backfill with correct data is less risky now because you can easily visualize all of them in the lineage view.
Once you realize that everything in the lineage graph can also be accessed through an API, you start to write a script to automate backfills. Since you can precisely identify the root cause, reprocessing all of the downstream jobs can be done automatically to avoid manual errors.
Lineage can even prevent issues
The day after, you share your experience with your colleagues, and also show them that before they change the code of a job they can quickly identify all the downstream dependencies of the produced dataset.
This allows them to inspect the code of the various data pipelines potentially impacted by their change, and warn the relevant teams before making changes to avoid surprises and issues.
Next time you will be on-call, you know you will be much less stressed than you used to be without the help of a real-time lineage-based operations tool.
Want to make your life easier during your next on-call? Try Datakin for free.