Dbt is a wildly popular tool. When it comes to the data pipeline, companies do (and should!) use dbt tests to mitigate problems.
Dbt tests are great for organizations who want to improve data quality and reliability. But sometimes, dbt might not go far enough. Testing has inherent limitations. For example - tests only provide information from the areas that are tested. But what about the areas you don’t have time to test, because of resource or time limitations?
When an organization gets big enough, it might need to fold in some additional data observability. In this post, we’ll talk about 7 ways to level up your dbt tests, and some indicators that it’s time to move beyond the test.
When it comes to dbt tests, you have two options: data tests, and schema tests.
Data tests are specific queries that you run against a specific model. They live in the “tests” folder of your dbt project.
Schema tests are generic. They can be configured and applied to multiple models.
Out of the box, dbt has four schema tests that can be immediately used: unique, not_null, accepted_values, and relationships. You can also write custom schema tests, as macros with the prefix test and stored in the macros
folder.
If you have one model that you want to validate in a specific way, you can simply write a data test in the tests folder. But if you find that you’re copying and pasting that query and configuring it. across multiple models - you might want to transition your data test into a schema test.
Level up your dbt with integrations, like:
OpenLineage - an open platform for data lineage collection and analysis. It tracks metadata about datasets, jobs, and runs. OpenLineage consumes data produced by dbt runs.
dbt_expectations - a dbt plugin that allows users to deploy GE-like tests in their data warehouse directly from dbt
Unit tests - small test cases for Macros that verify data is working as expected
For daily dbt jobs, you likely run associated tests on the same schedule. You can configure your dbt job to have multiple commands, including tests, running, freshness checks, and more. You can set up automated emails and Slack notifications in the case of failure.
When a PR is created in your database repo with a migration, use dbt tests to verify changes won’t break your dbt models before merging the PR.
You can do this by using data model blue-green deployment. Your CI/CD tool should do the dbt run
and dbt test
commands to target a staging database environment or a staging schema within the production environment.
If the tests pass, continue on with dbt run
against the production environment or production schema.
Also, make sure that your staging environment is fully replicated. That way, the intitial staging dbt test
results are an accurate proxy for how production will play out.
To catch potential problems as early as possible, run “dbt test” against your source models first:
In other words:
dbt test --models source:[sources]
dbt run -m [models]
dbt test -m [models]
Set up an interactive way to display and visualize dbt test results. That way, you’ll always be aware when something in your dbt transformations goes wrong.
Several tools on the market can help you with these displays. For example, re_data and elementary do these visualizations for you. Or, you can create a custom visualization.
For example, the team at Snapcommerce leveraged dbt artifacts and the query log to build reporting and alerts on dbt model/job runs and performance. This may also be a good point to begin considering a more generic data observability tool like Bigeye.
Testing and data observability have the same end goal - detecting problems. However, they approach this end results differently. Testing is opinionated - it originates in a human expressing a condition they think should be true. Observability is neutral - it asks, “Do we know the state of the data system at all times? How has it changed?” Much like a dashboard on a vehicle, data observability provides insight into how fast you’re going, whether the engine light is on, and whether you’re due for an oil change.
Data engineers will go to some trouble to bridge the gap between dbt tests and complete data observability. It often results in a hacked-together setup that leaves out large parts of the data stack, outside the realm of dbt.
As organizations scale, there’s infinite opportunity to test. While you can only write tests for so many things, you can layer in data observability on top of your set of tests. It will help you catch the things for which you don’t want to (or can’t) write tests.