great post :) as ex-dwh team in a company DL scenario, i frequently see us tackling problems like this. this is a really nice and broad overview!
one of the current projects is a company wide dictionary on how to name fields and their contents. this serves as basis for both, data discovery input and contract generation. let’s see where we end up
This post is pure gold. It's a must-read if you own a data platform or a data warehouse and you want to ensure that data quality gets better over time. The challenge is to get the executive buy-in for the upfront effort and investment required.
Great post! For dbt, the schema.yml file for each model also provides the space to implement data contract validations. Have you thought about utilizing these schema files for data contract enforcement/validation?
Well done with this post. It is concise yet very informative.
I have one question though, you have stated that:
Similar to contracts in production services, contracts in the warehouse should be implemented in code and version controlled. The implementation of contracts can take many forms depending on your data tech stack and can be spread across tools.
Considering our tech stack includes dbt as well, could you consider the dbt model itself (with tests, metadata, metrics, etc.) the definition of a data contract?
The advantage of that over Protobuf, for example, is that I don't need to write custom-code to set up the monitor. As you mentioned, dbt + great expectations can validate the schema and the semantic layer.
We wanted to have a single interface for our contracts in production systems and in the data warehouse. Because we started with contracts in production systems and were using Protobuf, we had a starting point to work off of. As you mentioned, there’s a nice ecosystem around Protobuf that allowed us to easily integrate with the production data pipleines (CDC, Kafka, etc). When we moved to the data warehouse, we wanted to keep a consistent abstraction for contracts even if the implementation of how we monitor and enforce is different. The tradeoff in having a consistent interface meant writing code to translate Protobuf into Data Warehouse-centric tooling like dbt and Great Expectations.
Additionally, we also wanted to be mindful and try to keep a layer of abstraction for contracts between the definition and the tools that do the monitoring, enforcement, and fulfillment. This way, we could define the contract in a single spot and then have a process run to implement the contract using other tools. Additionally, we could change the underlying mechanisms for implementation as needed while keeping the definition the same. The tradeoff there is again, having to write code to create that abstraction.
Ultimately though, if you are already using dbt and Great Expectations and you’re looking for a good way in to contracts in the warehouse without a lot of investment in writing these abstractions, you could get started just using the dbt files. Just keep in mind the tradeoffs and when/if it makes sense to create those abstractions.
When using dbt test, it doesn't seem very easy to write a test based on an external JSON file. I created a custom validation within dbt, and a Snowflake Python UDF called JSONschema-validation. This is done on “source” definition like “dbt test source”. The limitation, as I experience in dbt Cloud, is to alert the right users about the errors.
If you are going in this direction, I will be more than happy to collaborate.
great post :) as ex-dwh team in a company DL scenario, i frequently see us tackling problems like this. this is a really nice and broad overview!
one of the current projects is a company wide dictionary on how to name fields and their contents. this serves as basis for both, data discovery input and contract generation. let’s see where we end up
This post is pure gold. It's a must-read if you own a data platform or a data warehouse and you want to ensure that data quality gets better over time. The challenge is to get the executive buy-in for the upfront effort and investment required.
Great post! For dbt, the schema.yml file for each model also provides the space to implement data contract validations. Have you thought about utilizing these schema files for data contract enforcement/validation?
Well done with this post. It is concise yet very informative.
I have one question though, you have stated that:
Similar to contracts in production services, contracts in the warehouse should be implemented in code and version controlled. The implementation of contracts can take many forms depending on your data tech stack and can be spread across tools.
Considering our tech stack includes dbt as well, could you consider the dbt model itself (with tests, metadata, metrics, etc.) the definition of a data contract?
The advantage of that over Protobuf, for example, is that I don't need to write custom-code to set up the monitor. As you mentioned, dbt + great expectations can validate the schema and the semantic layer.
That’s a great question!
We wanted to have a single interface for our contracts in production systems and in the data warehouse. Because we started with contracts in production systems and were using Protobuf, we had a starting point to work off of. As you mentioned, there’s a nice ecosystem around Protobuf that allowed us to easily integrate with the production data pipleines (CDC, Kafka, etc). When we moved to the data warehouse, we wanted to keep a consistent abstraction for contracts even if the implementation of how we monitor and enforce is different. The tradeoff in having a consistent interface meant writing code to translate Protobuf into Data Warehouse-centric tooling like dbt and Great Expectations.
Additionally, we also wanted to be mindful and try to keep a layer of abstraction for contracts between the definition and the tools that do the monitoring, enforcement, and fulfillment. This way, we could define the contract in a single spot and then have a process run to implement the contract using other tools. Additionally, we could change the underlying mechanisms for implementation as needed while keeping the definition the same. The tradeoff there is again, having to write code to create that abstraction.
Ultimately though, if you are already using dbt and Great Expectations and you’re looking for a good way in to contracts in the warehouse without a lot of investment in writing these abstractions, you could get started just using the dbt files. Just keep in mind the tradeoffs and when/if it makes sense to create those abstractions.
Hi @Raoni
When using dbt test, it doesn't seem very easy to write a test based on an external JSON file. I created a custom validation within dbt, and a Snowflake Python UDF called JSONschema-validation. This is done on “source” definition like “dbt test source”. The limitation, as I experience in dbt Cloud, is to alert the right users about the errors.
If you are going in this direction, I will be more than happy to collaborate.
Hi @Anders
We found out the only way to alert is to run a dedicated Airflow job that parses dbt cloud api, parse error and send it to the slack/email.
What option do you use for such case ?
I also see a need for Airflow, Dagster og similar or handle this task.