Data Quality + Data Lineage = 💕💕💕 Written by Peter Hicks on Sep 2, 2021
In a prior life, I dwelled in the day-to-day cycles of an e-commerce platform. I worked with a quite generalized system with orders, products, variants, SKUs, and customers that pined for every discount they could come by.
The system built around the core business schema was the kind of chaos that data engineers are all too familiar with; large volumes of clickstream data, etl_warehouses, read replicas, and machine learning models were all consumed to run daily operations. We had both an application engineering team and a data team consuming our data and building dashboards at various cadences and informing business decisions at large. I was able to see firsthand the broad impacts of not having simple quality checks on our datasets and the subsequent post-mortems riddled with curious shrugs when things went awry.
Often, datasets have built-in assumptions that aren’t quite so obvious looking from the outside in. Below is an example from a simplified variant of an e-commerce database schema. Can you spot the issue with this task?
-- job: etl_last_month_categorized_line_items -- description: Find the number of line items purchased in the last month per category/taxon select t.name as taxon, sum(li.quantity) as categorized_line_items from ecomm.orders as o inner join ecomm.line_item li on o.id = li.order_id inner join ecomm.variant v on v.id = li.variant_id inner join ecomm.products p on v.product_id = p.id inner join ecomm.taxons t on p.taxon_id = t.id where o.status = 'COMPLETE' and completed_at > now() - interval '1 month' group by t.name order by categorized_line_items desc
It’s very likely that you’ll have trouble finding an issue in the above schema and query. It has several quirks that require context and insider information to interact with. In this particular instance, there is an innocuous column on the
orders table called
type and, looking at the schema alone, it isn’t clear what it’s there for. The column is also missing its enumerations, so all you can know about the field is that it’s a raw
It turns out in our case this business has what is sometimes called a
COMP order, which is a special type of order generated internally to fulfill replacement orders, marketing campaigns, and internal test purchases that have
$0.00 totals. Our query above is going to aggregate rows that include refunds that were issued for previous orders and result in some degree of double counting.
There are some simple assertions that we can add to avoid this problem and protect future engineers from encountering this same issue.
1. expect_column_values_to_match_like_pattern # expect our `orders` status to be a `CUSTOMER` order 2. expect_column_values_to_be_between # expect our `orders` total to be greater than 0
Note: These are from the excellent Great Expectations glossary of expectations.
Returning to Lineage
So far we’ve only discussed the perils of lacking assertions, but not how to assess the collateral damage as data percolates through a data pipeline. In our example, our query destroyed lots of context by unifying data from five tables down to two columns. Downstream consumers won’t be able to realize they are consuming, modifying, and aggregating bad data. That leads us back to using our lineage view to assess the real impact.
In previous blog posts, we’ve talked before about the importance of understanding data lineage from debugging to privacy and governance. Our lineage graph helps us visualize the greater landscape of a data pipeline and track change over time. By merging versioned quality metrics with versioned lineage an engineer is able to see historical quality metrics per table in the context of an entire data pipeline.
Above: Downstream datasets can potentially be compromised
We’ve been able to fit quality metrics seamlessly into our existing Datakin lineage graph and have brought similar mechanics to our datasets that we’ve already had for our jobs. We also have a new dataset-specific quality tab that allows you to see the evolution of table and column level dataset metrics over time.
Many of our integrations support standard metrics like row count, byte count, and null count out-of-the-box. We also support custom metrics that can help build a more robust data pipeline, as we discussed with our e-commerce example. Be sure to play around with our
food_delivery example in our Datakin Live Demo and the assertion power of Great Expectations.
Above: Datasets with quality issues are highlighted in red
Above: The Datakin quality tab