Blog

Operational data lineage with dbt Written by Ross Turk

October 14, 2021

dbt is an amazing way to transform data within a data warehouse. So amazing, in fact, that it’s easy to end up doing tons and tons of transformations on all kinds of datasets. After a while, it can become an innavigable collection of overlapping tables. That’s a problem when it comes time to troubleshoot.

If you use Datakin to observe your dbt models as they run, you can always know exactly where your datasets came from and how they were created. You can study how the entire pipeline has changed over time, and find the root cause of issues quickly. Data lineage is super powerful like that.

This is a quick tutorial that can show you how to observe dbt pipelines with Datakin. It is based on a pre-built sample project – a study of the Stack Overflow public data set – but you can apply this approach to a dbt project of your own.

Prerequisites

Set up a BigQuery project and service account

This exercise requires familiarity with dbt and Google BigQuery. Most importantly, you will need to know:

  • the file path of your Google Cloud service account’s JSON key
  • the name of your Google Cloud project

If that doesn’t mean anything to you, never fear! There is an outstanding tutorial at the dbt site that can teach you about both of them at the same time. Or, on the other hand, maybe you do know all about that stuff but have forgotten how to set it up. Same. The tutorial is great, it’ll get you squared away.

The queries in this exercise can be performed within the Google BigQuery free tier. If you try hard, perhaps by running them in a loop, you could conceivably wind up with a bill. But you can easily avoid this outcome by not directly trying to achieve it.

Sign up for a Datakin account

Visit datakin.com/signup to create a new Datakin account. Creating an account is free, it doesn’t require a credit card, and this exercise can be completed within the 50 dataset free tier.

Clone the sample project and set up the dbt environment

The sample project contains a series of models that summarize questions and answers from Stack Overflow that have been tagged with ‘etl’. Using the resulting tables, the most active and helpful users can be identified quickly.

Clone the project repository from GitHub and cd into the new directory:

				
					% git clone https://github.com/DatakinHQ/demo.git
% cd demo/dbt/stacko
				
			

Install dbt and the OpenLineage integration inside a Python virtual environment:

				
					% python3 -m venv datakin-dbt
% source datakin-dbt/bin/activate
% pip3 install dbt openlineage-dbt

				
			

Add an entry to ~/.dbt/profiles.yml

So that our sample project knows how to connect to BigQuery, we’ll need an entry in ~/.dbt/profiles.yml. Here’s what mine looks like:

				
					stackostudy:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: service-account
      keyfile: "{{ env_var('HOME') + '/.dbt/ecosystem.json' }}"
      project: ecosystem
      dataset: stackostudy
      threads: 1
      timeout_seconds: 300
      location: US
      priority: interactive

				
			

Configure the project for Datakin

To store the captured lineage information, we need to specify some Datakin connection parameters.

These are most conveniently found in Docs page of your Datakin instance. Click on the icon in the lower-left corner of the interface, and then choose the dbt documentation page from the list on the left. Once there, you will see two lines of code that look similar to these:

				
					export OPENLINEAGE_URL=https://api.{{YOUR_COMPANY_NAME}}.datakin.com
export OPENLINEAGE_API_KEY={{YOUR_API_KEY}}

				
			

Run these two export commands, making sure to replace the {{ TOKENS }} if you didn’t copy and paste them from the docs.

Check that everything is working

Run dbt debug, which will check that dbt is installed, our profile is correct, and our connection to BigQuery is configured properly.

				
					% dbt debug
Running with dbt=0.21.0
dbt version: 0.21.0
python version: 3.9.7
python path: /Users/rturk/projects/stackostudy/dbt-datkin/bin/python3.9
os info: macOS-11.4-x86_64-i386-64bit
Using profiles.yml file at /Users/rturk/.dbt/profiles.yml
Using dbt_project.yml file at /Users/rturk/projects/stackostudy/dbt_project.yml

Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
 - git [OK found]

Connection:
  method: service-account
  database: stacko-study
  schema: stackostudy
  location: US
  priority: interactive
  timeout_seconds: 300
  maximum_bytes_billed: None
  Connection test: [OK connection ok]

All checks passed!

				
			

If you see output like that, you’re cookin! Everything’s set up correctly.

Run models & capture lineage metadata

When working with Datakin (or any other OpenLineage backend) it’s important to generate the dbt docs first. That way, dbt writes out a bunch of great information to target/catalog.json, target/run_results.json and target/manifest.json.

This information includes:

  • The schemas for all source and target tables
  • The SQL source code for each model
  • The start and stop times, number of rows, and bytes affected for each model

To generate these files, run dbt docs generate:

				
					% dbt docs generate
Running with dbt=0.21.0
Found 8 models, 0 tests, 0 snapshots, 0 analyses, 184 macros, 0 operations, 0 seed files, 4 sources, 0 exposures

00:27:46 | Concurrency: 1 threads (target='dev')
00:27:46 |
00:27:46 | Done.
00:27:46 | Building catalog
00:27:54 | Catalog written to /Users/rturk/projects/demo/dbt/stacko/target/catalog.json

				
			
Ok! That looks good. We are now ready to give those models a try. To run the models and capture lineage information in Datakin, use the wrapper script dbt-ol. This script accepts the same parameters as the standard dbt command, so running our models is done with dbt run:
				
					% dbt-ol run
Running with dbt=0.21.0
Found 8 models, 0 tests, 0 snapshots, 0 analyses, 184 macros, 0 operations, 0 seed files, 4 sources, 0 exposures

00:30:56 | Concurrency: 1 threads (target='dev')
00:30:56 |
00:30:56 | 1 of 8 START incremental model stackostudy.filtered_questions........ [RUN]
00:31:00 | 1 of 8 OK created incremental model stackostudy.filtered_questions... [MERGE (0.0 rows, 34.6 GB processed) in 3.80s]
00:31:00 | 2 of 8 START incremental model stackostudy.filtered_answers.......... [RUN]
00:31:05 | 2 of 8 OK created incremental model stackostudy.filtered_answers..... [MERGE (0.0 rows, 26.8 GB processed) in 4.58s]
00:31:05 | 3 of 8 START incremental model stackostudy.filtered_users............ [RUN]
00:31:15 | 3 of 8 OK created incremental model stackostudy.filtered_users....... [MERGE (0.0 rows, 2.5 GB processed) in 10.38s]
00:31:15 | 4 of 8 START incremental model stackostudy.filtered_votes............ [RUN]
00:31:28 | 4 of 8 OK created incremental model stackostudy.filtered_votes....... [MERGE (0.0 rows, 6.5 GB processed) in 12.91s]
00:31:28 | 5 of 8 START view model stackostudy.summary_daily.................... [RUN]
00:31:29 | 5 of 8 OK created view model stackostudy.summary_daily............... [OK in 0.85s]
00:31:29 | 6 of 8 START view model stackostudy.answer_stats..................... [RUN]
00:31:30 | 6 of 8 OK created view model stackostudy.answer_stats................ [OK in 0.88s]
00:31:30 | 7 of 8 START view model stackostudy.question_stats................... [RUN]
00:31:31 | 7 of 8 OK created view model stackostudy.question_stats.............. [OK in 0.77s]
00:31:31 | 8 of 8 START view model stackostudy.user_stats....................... [RUN]
00:31:31 | 8 of 8 OK created view model stackostudy.user_stats.................. [OK in 0.96s]
00:31:31 |
00:31:31 | Finished running 4 incremental models, 4 view models in 36.14s.

Completed successfully

Done. PASS=8 WARN=0 ERROR=0 SKIP=0 TOTAL=8
Emitted 16 openlineage events

				
			

The models have now run, creating four tables and four views. In total, this run cycle should have emitted 16 OpenLineage events to Datakin. If there are any errors in your run, they will be obvious.

One of the views created is called user_stats. It contains stats about the users who engaged with questions tagged with the ‘etl’ term on Stack Overflow. For example, here is a list of the community members with the most upvoted answers:

top_upvotes

Looks like we have a clear leader!

This view combines data from several tables. To find out which ones, you could look through the list of models one by one. Or you could consult the lineage graph in Datakin, which maps out the entire pipeline.

To view your lineage graph, click on the user_stats job from the explore page of your Datakin instance. It will take you to the graph with that table already selected. Here’s what my full lineage graph looked like:

Neat, huh? The user_stats model combines information from several other tables and views, each of which were created with a preceding dbt model. The graph shows how the data flows, grouping the BigQuery public datasets separately from the ones in our project.

To examine a single run-cycle, click on the Duration tab with the user_stats node highlighted. It will show you how long the entire pipeline took, from beginning to end.

Change the pipeline and run it again

Running models and seeing them mapped out is neat, for sure, but it’s not the most useful part of lineage tracing. It becomes most useful when pipelines change…and they change all the time.

In fact, I’ve got something I’d like to change about this one. Let’s add a new metric: the number of times each user provided an answer that was accepted as being the “right” one. That’s a pretty interesting metric because it shows whether a user was able to solve a problem effectively, not just that they had a popular answer.

To get this metric, we can change our question_stats model to add a new accepted_answer_owner_id column. Edit the models/question_stats.sql file accordingly:

				
					accepted_answer_id,
(
    select owner_user_id
    from {{ ref('answer_stats') }}
    where id = filtered_questions.accepted_answer_id
) as accepted_answer_owner_user_id,
answer_count,
				
			

Once we know the user who created the accepted answer for each question, we can count the number of accepted answers per user by adding an answers_accepted column to the user_stats table. Edit the models/user_stats.sql file to add the following clause:

				
					) as answers_posted,
(
    select count(*) from {{ ref('question_stats') }}
    where accepted_answer_owner_user_id = filtered_users.id
) as answers_accepted,
(
				
			

Ok! Now let’s run those models again. But – and this is important – we first need to regenerate the dbt docs. Our schema has changed, and we want Datakin to have the latest metadata about tables and columns.

				
					% dbt docs generate
				
			

When building automation in a production pipeline, generation of dbt docs should be done any time the underlying models change…but it can be done every time the models run without hurting anything. We’ve got the latest schema in target/catalog.json now, and now we can run our models again:

				
					% dbt-ol run
				
			

If this runs without errors, our two views have been updated. We can easily pull a list of the most effective answerers for ELT on Stack Overflow, and the results are pretty interesting. We still have a clear leader:

top_answerers

Study the pipeline changes

So far we’ve created a micro-simulation of what happens inside pipelines all the time: a set of interrelated models has been created, run, modified, and re-run. In this case, the changes are pretty easy to keep inside our head. In a real-world situation, though, there are likely to be way more tables and way more changes to keep track of.

Let’s go back to the lineage graph and look at our user_stats table one more time. Open up the I/O tab and verify that the schema has been updated in Datakin:

Good. Our new answers_accepted field is showing up in the schema for user_stats. But let’s say we needed to know exactly when it was introduced, and whether there were any correlated changes. This is what the Compare tab is for.

Click on the Compare tab, where you will see a list of runs for the user_stats job. There are probably only two, one from the original run and one after the models were updated. Select both of them:

stacko-compare

You will notice the lineage graph refresh to grey out some of the nodes. The nodes that are greyed out are the ones that have not changed between the two selected runs. So we can quickly see what parts of the pipeline we can potentially ignore as a root cause of an emerging issue. And, of course, we can easily identify the other model that changed, question_stats.

Try with your own pipeline

This example demonstrates how Datakin can be used to observe dbt pipelines, troubleshoot performance issues, and trace historical changes. However, it’s easy to add lineage tracing to a project of your own.

For more information, view the documentation for the openlineage-dbt integration. If you have questions, we invite you to find us on Datakin Slack.