Introduction
Almost anyone who has used dbt is familiar with its tests. During development, we learn things about the data that should be true. For example, the sales team might assert that the only possible order statuses are “Pending”, “Completed”, and “Canceled”, and that only “Completed” orders should be counted in the revenue calculation. So, in our revenue model, we write a where clause to filter only “Completed” orders. But, we also write an accepted values test to check that no other statuses have snuck into the data. If a “Sent” status ever appears, we want to know because we probably want to include that in the revenue calculation, too. The promise of dbt is that, so long as we run the dbt tests regularly, we should know about the new status.
But, what happens in practice?
We often see clients with hundreds or thousands of tests in their warehouse. Rarely are all of the tests passing. Often, there are a set of tests failing every run. The data team has alert fatigue, and have begun ignoring all test failures. Then, the source data for orders changes, and completed orders start having a status of “Sent”. The accepted values test starts failing, but no one notices it appear among the long list of already failing tests. The total revenue calculated in the warehouse becomes increasingly wrong. Eventually, the COO pings the head of data to say that she thinks the revenue metric is wrong. The data team investigates, finds the failing test, and says “oops, you’re right – the metric is wrong”. After a few such incidents, the executives lose confidence in the data team.
Other clients realize that alert fatigue is a problem, so they use dbt’s thresholds and severities to suppress tests after they start failing. Maybe a few orders have a status of “Garbage” because of a bug in the ordering system. Since all existing test failures have already been suppressed, the data team notices the new test failure immediately. They investigate with the sales and engineering teams, but the source data is hard to fix, and the COO says she is not concerned about a few orders with weird data, since they are unlikely to change the revenue calculation much. So, the data team sets a threshold number of rows below which the accepted values test will not fail: “no test failures, no problems here”. Then, the client is audited. The data team says “all of the tests are passing, so there should be no problems with the revenue calculation”. But, the auditor still finds a large discrepancy because one of the orders with “Garbage” status was actually completed, and it turned out to be a big order. The head of data is in hot water for failing to give an accurate picture of the data quality.
In neither case did dbt tests deliver their promised value.
So, what should you do?
This post attempts to answer that question based on our experience with a variety of data teams.
Background on dbt Tests
dbt tests are queries that return rows when a certain (undesirable) condition is met.
Tests are executed by dbt and the results (number of rows returned) are written to the console and to other logs and artifacts.
dbt tests can be used to:
- catch mistakes during development.
- catch changes in source data that violate assumptions or “data contracts”.
- expose issues that may make marts data or downstream analysis/reports unreliable or poor quality to data consumers.
- prevent overloading of the warehouse and associated large costs by preventing models from running with poor input data like duplicate primary keys.
- reduce exploration time needed to do analytics by providing an understanding of what can be relied upon to be true about the data.
A test can be configured to have a severity of either a “warning” or “error” if it returns rows.1 Thresholds can also be used to configure a test to “pass” even if a small number of rows are returned by the query.
How Should Tests Work?
Testing only provides value if failures are made visible to the relevant producers, manipulators, and consumers of data, and if an appropriate action/response is taken if necessary.
In general, dbt is good at defining and executing tests, and alerting and responding to failures relies on manual processes or other tools, many of which are developed in-house.
dbt Cloud is making some inroads into alerting (for example, sending emails and Slack messages if any tests fail), and other monitoring tools like Monte Carlo are developing dbt integrations to use their already built functionality for incident creation and response for dbt test failures.
In our view, a good dbt testing platform should have functionality to:
- Run the test.
- Alert the right people when tests do not pass.
- Allow those people to decide what they want to do about it (ignore on current run only, ignore for X days, immediately fix, etc).
- Store the results so you know what was happening with your dataset and can troubleshoot.
- Include metadata that tells you why it’s a problem that the test is failing.
Secondarily, it may be helpful for the testing platform to:
- prevent downstream models from running in case of a severe failure.
- enumerate/expose upstream dbt test failures and warnings when sending critical reports as a measure of “data quality”.
Our Recommendations
Recommendation | Why? |
Configure tests to fail with severity of error when a failure means the potential for a large load on the warehouse or an expensive query, and downstream models should not be run. (Do this in conjunction with using dbt build to execute tests). | Failures should indicate serious problems that should be addressed immediately. In most warehouses, the only problems that truly demand immediate attention are those that endanger the warehouse infrastructure. |
Use dbt build to execute models and tests. | Tests are executed immediately after all upstream data required for the test is ready, so any failures are as timely as possible. Models downstream of models with failing tests are prevented from running by default. |
Configure all other tests where a failure does not require blocking downstream models as warnings. | Warnings should indicate all problems that have a potential negative impact on data quality. |
Do not use thresholds. | Thresholds suppress issues that may have an impact on data quality. |
Write test results to the warehouse by uploading the run results artifact (use PUT in Snowflake or write your own python script in BigQuery or use a package like re_data or Elementary). | dbt has a --store-test-failures option, but it only stores the most recent results and overwrites them. By storing the run results artifact, you can keep a complete record of test failures over time. |
Alert on every failure, new warnings, and warnings with substantially changed row counts. | Some warnings may persist for an extended period of time. Some flavor of anomaly detection is required to generate alerts for new or changed warnings. |
Make test results visible in a dashboard. | A dashboard can convey a visual sense of the number of test failures and the number of failing rows over time. |
The Upshot
dbt tests can add tremendous value to a data warehouse, but only when they are part of a coherent testing platform that includes alerting and visibility.
Footnotes
1On a technical note, the exit code for dbt test is 0 if all selected tests pass or throw warnings but 1 if any test fails: dbt considers a warning to mean “pass” from a command execution perspective.
If you have any questions around configuring dbt tests, Data CRT is happy to help.
What’s important to you? Did we miss anything?
Send us a message at hi(at)datacrt.com!