15 Comments

I like the idea of data contract in theory. It is always good to have layers of abstractions. But it feels like another way for some companies to sell services and products. How do we know if this won't be another item in our growing data portfolio we will need to maintain?

And we were supposed to fix this with data governance work and data management tools. Then we had data mesh and data product idea. What happened to those? It reminds me of that xkcd comic where we create another standard to fix everything (https://xkcd.com/927/)!

Using your example, if we have strongly typed schema, event based architecture, processes and policies in place why do you need a contract? Isn't contract implicit with SLA and internal procedures? Are we just putting an API on top this to make this fancy?

I like the idea that this is machine-readable and could be handled via API but I believe the problem is social and cultural, not technical.

"Data consumers define the schema and properties they NEED, instead of being forced to accept what's coming from production even if it is of low quality or missing critical components for their use cases."

This was suppose to be fixed by GraphQL, right?

Anyway, good idea. But I am bit skeptical.

Expand full comment

Thanks for writing this!

Having a background of backend development and microservices, I love the idea of Data Contracts. But whenever we have discussed this at work for our current setup we don't find a good solution to implement this on our current setup. More especifically:

- Airflow batch pipelines that fetch snapshots directly from the DB. If the contract doesn't match the operational DB schema, is the Product team responsible to implement the transformation step? How? With airflow itself?

- ELT pipelines and integration with tools such as Fivetran/Data Transfer/GA which implement ingestion directly into the DWH. How could a contract be placed in between?

Expand full comment

"Non-consensual API" - I am so going to use this

Expand full comment
Aug 25, 2022·edited Aug 25, 2022

Data Contracts, or as we name them now Interface Agreements (sounds less rigid), are part of our practice for around 15 years. They work pretty well and have limited the number of issues with data. Moreover they limited the number of data products, because people could see where the data could be found (and subscribe to the contract).

The biggest issue is, it means work must be done. Work to define what, how often, retention time, ... That is documentation and few people like documentation. We now attempt to integrate the Agreement part into our Business Glossary, where you already should have large part of the information needed in the Ingerface Ageement anyway. If combined with lineage in the same tooling you get a pretty complete overview of what you have and what could break down.

Expand full comment

The anomaly covers number of business and validation rules from source as well as target. The requirements are in the details. Anyhow. Old concepts have new names and are developed differently. The most important at least it has been recognised.

Expand full comment

Data Contracts is not a new concept. In my days early 2000s I designed and implemented anomaly table in data repository such as data whatehouse. Of course the technology used was not the same. I used pl/sql and C. I can’t believe that these days there is no such checks in place that it’s assumed it’s a new concept.

Expand full comment

Why would you call an api a “contract”? That doesn’t make sense in any other business context.

Expand full comment

Great article. Thank you for sharing your wisdom. By they way I didn't understand one point. If a data contract is an agreement between engineers who's writing services that generates data and consumers of this data, why source data definition is a step to create such an agreement: "...Clearly define the upstream source data...". I mean in the end isn't the purpose is deliver data in quality and shape consumers desire? Source data definition seems falls into engineering side.

Expand full comment

Is the official convoy blogpost out already? Or is it the one about using CDC as basis to build the data contract?

Also I came across this Data Contracts — The Mesh glue https://towardsdatascience.com/data-contracts-the-mesh-glue-c1b533e2a664

What are your thoughts?

I want to introduce the idea of using data contracts aka interface agreements at a MNC customer

They don’t have a coherent data strategy and most of the ingesting is meant for coordination across different parts of the company

Very little to no ML needed

Data is also pretty small in the tens or 100 of megabyte at most

Can help me understand a minimum via bale data contracts to start with assuming I may build systems that are both a provider and consumer of data for my mnc customer other application teams?

Thank you!

Yours and Yali newsletter are really awesome in talking abt data contracts

Expand full comment