Studying Job Duration Written by Peter Hicks

October 25, 2021

A modern data pipeline is a large, complex, and often fragmented system with cascading interactions across multiple tools and platforms. It can be difficult to evaluate longer-term pipeline health in the absence of discrete warnings and failures, and to track tasks and dependencies across multiple teams and disparate systems. At Datakin, we’ve honed in on the runtime of pipeline jobs as a key metric to watch in daily data operations.

Duration Issues

Job runtime is a powerful metric because it’s fairly simple to identify — just watch for statistical outliers — and because it can be indicative of many common pipeline issues. A job run cycle that is abnormally brief might indicate that no rows were available to be processed from an input dataset, and therefore the job returned almost immediately. An abnormally long job run could indicate various problems such as deadlock, lack of resource availability, or query inefficiency. In the traditional list representation of a DAG, not only is it hard to identify duration issues, it’s costly and time-consuming to diagnose their cause.

Data Operations

At Datakin, we’ve worked to make it easy to both detect current job operational issues, and to allow more advanced introspection of a run anomaly in context.

On your home page, we surface jobs with recent runs that we’ve determined could be problematic in a pipeline. Sometimes these discrepancies are due to expected environmental changes like dataset size and hardware adjustments, but when they aren’t anticipated it’s often because of an issue with the job itself. From our dashboard you can navigate directly to the issue on our lineage page, allowing you to see concrete information about why we flagged a given job in our initial analysis.

See if we've detected any issues for you.

Job Duration & Lineage

Our job duration tab enables a more detailed inspection of how a given job fits into the overall pipeline. You can evaluate the execution times of all the upstream jobs in the most recent run cycle.

This allows you to visualize the current state of a DAG pipeline in context, helping to identify potential bottlenecks. You can also compare runtimes between runs of a single job within a pipeline to watch for systemic changes. This can be helpful when trying to identify whether a code change had a beneficial or detrimental effect on the runtime of a job. Below we’ll see both a contextual view for a long-running job, as well as a historical comparison that shows a sudden trend upward.

The contextual view of a sluggish job.
The historical view for a single job.

What's Next?

Our current knowledge about jobs and their corresponding runs provides the groundwork for even more features that benefit pipeline operators. In the future, we hope to be able to capture user-defined SLAs and generate specific preemptive pipeline feedback. For instance, if a dataset needs to be available by midnight for consumption by another team, we could evaluate upstream dependencies and generate early notifications about impending lateness. We also plan to integrate with pipeline and job orchestration tools, which will allow you to perform automated repairs and dynamic changes to your system in response to issues.

So that’s it for now! Check out the Duration Tab in our product demo and don’t hesitate to sign up for Datakin to gain visibility into your data pipeline and job durations.