A modern data pipeline is a large, complex, and often fragmented system with cascading interactions across multiple tools and platforms. It can be difficult to evaluate longer-term pipeline health in the absence of discrete warnings and failures, and to track tasks and dependencies across multiple teams and disparate systems. At Datakin, we’ve honed in on the runtime of pipeline jobs as a key metric to watch in daily data operations.
At Datakin, we’ve worked to make it easy to both detect current job operational issues, and to allow more advanced introspection of a run anomaly in context.
On your home page, we surface jobs with recent runs that we’ve determined could be problematic in a pipeline. Sometimes these discrepancies are due to expected environmental changes like dataset size and hardware adjustments, but when they aren’t anticipated it’s often because of an issue with the job itself. From our dashboard you can navigate directly to the issue on our lineage page, allowing you to see concrete information about why we flagged a given job in our initial analysis.
Job Duration & Lineage
Our job duration tab enables a more detailed inspection of how a given job fits into the overall pipeline. You can evaluate the execution times of all the upstream jobs in the most recent run cycle.
This allows you to visualize the current state of a DAG pipeline in context, helping to identify potential bottlenecks. You can also compare runtimes between runs of a single job within a pipeline to watch for systemic changes. This can be helpful when trying to identify whether a code change had a beneficial or detrimental effect on the runtime of a job. Below we’ll see both a contextual view for a long-running job, as well as a historical comparison that shows a sudden trend upward.
Our current knowledge about jobs and their corresponding runs provides the groundwork for even more features that benefit pipeline operators. In the future, we hope to be able to capture user-defined SLAs and generate specific preemptive pipeline feedback. For instance, if a dataset needs to be available by midnight for consumption by another team, we could evaluate upstream dependencies and generate early notifications about impending lateness. We also plan to integrate with pipeline and job orchestration tools, which will allow you to perform automated repairs and dynamic changes to your system in response to issues.