I like the idea of data contract in theory. It is always good to have layers of abstractions. But it feels like another way for some companies to sell services and products. How do we know if this won't be another item in our growing data portfolio we will need to maintain?
And we were supposed to fix this with data governance work and data management tools. Then we had data mesh and data product idea. What happened to those? It reminds me of that xkcd comic where we create another standard to fix everything (https://xkcd.com/927/)!
Using your example, if we have strongly typed schema, event based architecture, processes and policies in place why do you need a contract? Isn't contract implicit with SLA and internal procedures? Are we just putting an API on top this to make this fancy?
I like the idea that this is machine-readable and could be handled via API but I believe the problem is social and cultural, not technical.
"Data consumers define the schema and properties they NEED, instead of being forced to accept what's coming from production even if it is of low quality or missing critical components for their use cases."
1. In some ways, yes, this is another item to manage. But proper usage of contracts should significantly reduce the usage/need for other tools since the right data/high quality is coming over the line.
2. Data Mesh is an organizational framework that takes a micro-service approach to data. Data Mesh doesn't inform which data should be emitted, or validate the data being emitted from production is correct or conforms to a consumers expectations.
3. Data contracts are a form of data governance. Creating new standards can and should happen over time!
4. If you have strongly typed schema, EDA, processes and policies then you HAVE a contract. However you would still lack the enforcement of the contract, which is incredibly challenging to rollout manually or ensure a handshake agreement is never broken.
5. There are both technical, and socio-technical challenges.
6. GraphQL subscriptions somewhat bridge the gap by pushing notifications of entity updates, but it's intended to be used in more of a real time websocket pattern, and it doesn't guarantee consumers will receive all updates to an entity. The main difference is that Contracts are guaranteed to capture all updates to the entities in a service's data model which isn't something GraphQL supports (or is really intended to be used for)
Having a background of backend development and microservices, I love the idea of Data Contracts. But whenever we have discussed this at work for our current setup we don't find a good solution to implement this on our current setup. More especifically:
- Airflow batch pipelines that fetch snapshots directly from the DB. If the contract doesn't match the operational DB schema, is the Product team responsible to implement the transformation step? How? With airflow itself?
- ELT pipelines and integration with tools such as Fivetran/Data Transfer/GA which implement ingestion directly into the DWH. How could a contract be placed in between?
Data Contracts, or as we name them now Interface Agreements (sounds less rigid), are part of our practice for around 15 years. They work pretty well and have limited the number of issues with data. Moreover they limited the number of data products, because people could see where the data could be found (and subscribe to the contract).
The biggest issue is, it means work must be done. Work to define what, how often, retention time, ... That is documentation and few people like documentation. We now attempt to integrate the Agreement part into our Business Glossary, where you already should have large part of the information needed in the Ingerface Ageement anyway. If combined with lineage in the same tooling you get a pretty complete overview of what you have and what could break down.
The anomaly covers number of business and validation rules from source as well as target. The requirements are in the details. Anyhow. Old concepts have new names and are developed differently. The most important at least it has been recognised.
Anomaly detection is fine - but this is not the same as a data contract, any more than a monitor for Javascript errors is the same as a service contract. Old concepts may have new names, but the example you have given is not an equivalent.
Data Contracts is not a new concept. In my days early 2000s I designed and implemented anomaly table in data repository such as data whatehouse. Of course the technology used was not the same. I used pl/sql and C. I can’t believe that these days there is no such checks in place that it’s assumed it’s a new concept.
I'm not sure I would classify an anomaly table as an interface between producers and consumers, and when applied in the Data Warehouse it fails to satisfy the core requirements of a contract (being owned/managed by the producer of the data in upstream source systems). However, I completely agree contracts are not a concept, much in the same way the first electric vehicles were created in the 1800s! However, a concept being new, and being easy to implement and scale are two very different things. Thank you for the comment!
We call APIs contracts today. 'Service contract' is vernacular every engineer that has worked on applications at scale would be familiar with. How it applies to other business context is irrelevant, because the context we are discussing is engineering, where contracts have a very specific and well understood meaning. Hope that makes sense!
Thanks for your quick response. but the term DATA CONTRACTS is getting more into the business vernacular due the popularity of mesh and other well promoted data concepts. So the business context is very relevant. To the non technical ear the word contract means agreement between parties. Part of the reason that data is not embraced by businesses as it should be is because the terminology used by data people is confusing. Words like Product, contract, domain already have existing meanings the business understands.
Also I have found several definitions of data contract that state it is similar to an SLA. Ie an agreement between parties.
Data contracts are independent from data mesh. As stated previously - contracts are interfaces that allows producers of data and consumers to define and enforce requirements upstream through ideally programmatic means which already has a well understood meaning in the context of software engineering. Data not being embraced by the business has little to nothing to do with engineering/data terminology, and everything to do with the data they need being neither accessible, available, accurate, or semantically valid.
Great article. Thank you for sharing your wisdom. By they way I didn't understand one point. If a data contract is an agreement between engineers who's writing services that generates data and consumers of this data, why source data definition is a step to create such an agreement: "...Clearly define the upstream source data...". I mean in the end isn't the purpose is deliver data in quality and shape consumers desire? Source data definition seems falls into engineering side.
I want to introduce the idea of using data contracts aka interface agreements at a MNC customer
They don’t have a coherent data strategy and most of the ingesting is meant for coordination across different parts of the company
Very little to no ML needed
Data is also pretty small in the tens or 100 of megabyte at most
Can help me understand a minimum via bale data contracts to start with assuming I may build systems that are both a provider and consumer of data for my mnc customer other application teams?
Thank you!
Yours and Yali newsletter are really awesome in talking abt data contracts
I like the idea of data contract in theory. It is always good to have layers of abstractions. But it feels like another way for some companies to sell services and products. How do we know if this won't be another item in our growing data portfolio we will need to maintain?
And we were supposed to fix this with data governance work and data management tools. Then we had data mesh and data product idea. What happened to those? It reminds me of that xkcd comic where we create another standard to fix everything (https://xkcd.com/927/)!
Using your example, if we have strongly typed schema, event based architecture, processes and policies in place why do you need a contract? Isn't contract implicit with SLA and internal procedures? Are we just putting an API on top this to make this fancy?
I like the idea that this is machine-readable and could be handled via API but I believe the problem is social and cultural, not technical.
"Data consumers define the schema and properties they NEED, instead of being forced to accept what's coming from production even if it is of low quality or missing critical components for their use cases."
This was suppose to be fixed by GraphQL, right?
Anyway, good idea. But I am bit skeptical.
Good questions!
1. In some ways, yes, this is another item to manage. But proper usage of contracts should significantly reduce the usage/need for other tools since the right data/high quality is coming over the line.
2. Data Mesh is an organizational framework that takes a micro-service approach to data. Data Mesh doesn't inform which data should be emitted, or validate the data being emitted from production is correct or conforms to a consumers expectations.
3. Data contracts are a form of data governance. Creating new standards can and should happen over time!
4. If you have strongly typed schema, EDA, processes and policies then you HAVE a contract. However you would still lack the enforcement of the contract, which is incredibly challenging to rollout manually or ensure a handshake agreement is never broken.
5. There are both technical, and socio-technical challenges.
6. GraphQL subscriptions somewhat bridge the gap by pushing notifications of entity updates, but it's intended to be used in more of a real time websocket pattern, and it doesn't guarantee consumers will receive all updates to an entity. The main difference is that Contracts are guaranteed to capture all updates to the entities in a service's data model which isn't something GraphQL supports (or is really intended to be used for)
Thanks for writing this!
Having a background of backend development and microservices, I love the idea of Data Contracts. But whenever we have discussed this at work for our current setup we don't find a good solution to implement this on our current setup. More especifically:
- Airflow batch pipelines that fetch snapshots directly from the DB. If the contract doesn't match the operational DB schema, is the Product team responsible to implement the transformation step? How? With airflow itself?
- ELT pipelines and integration with tools such as Fivetran/Data Transfer/GA which implement ingestion directly into the DWH. How could a contract be placed in between?
"Non-consensual API" - I am so going to use this
Data Contracts, or as we name them now Interface Agreements (sounds less rigid), are part of our practice for around 15 years. They work pretty well and have limited the number of issues with data. Moreover they limited the number of data products, because people could see where the data could be found (and subscribe to the contract).
The biggest issue is, it means work must be done. Work to define what, how often, retention time, ... That is documentation and few people like documentation. We now attempt to integrate the Agreement part into our Business Glossary, where you already should have large part of the information needed in the Ingerface Ageement anyway. If combined with lineage in the same tooling you get a pretty complete overview of what you have and what could break down.
The anomaly covers number of business and validation rules from source as well as target. The requirements are in the details. Anyhow. Old concepts have new names and are developed differently. The most important at least it has been recognised.
Anomaly detection is fine - but this is not the same as a data contract, any more than a monitor for Javascript errors is the same as a service contract. Old concepts may have new names, but the example you have given is not an equivalent.
Data Contracts is not a new concept. In my days early 2000s I designed and implemented anomaly table in data repository such as data whatehouse. Of course the technology used was not the same. I used pl/sql and C. I can’t believe that these days there is no such checks in place that it’s assumed it’s a new concept.
I'm not sure I would classify an anomaly table as an interface between producers and consumers, and when applied in the Data Warehouse it fails to satisfy the core requirements of a contract (being owned/managed by the producer of the data in upstream source systems). However, I completely agree contracts are not a concept, much in the same way the first electric vehicles were created in the 1800s! However, a concept being new, and being easy to implement and scale are two very different things. Thank you for the comment!
Why would you call an api a “contract”? That doesn’t make sense in any other business context.
We call APIs contracts today. 'Service contract' is vernacular every engineer that has worked on applications at scale would be familiar with. How it applies to other business context is irrelevant, because the context we are discussing is engineering, where contracts have a very specific and well understood meaning. Hope that makes sense!
Scott
just now
Thanks for your quick response. but the term DATA CONTRACTS is getting more into the business vernacular due the popularity of mesh and other well promoted data concepts. So the business context is very relevant. To the non technical ear the word contract means agreement between parties. Part of the reason that data is not embraced by businesses as it should be is because the terminology used by data people is confusing. Words like Product, contract, domain already have existing meanings the business understands.
Also I have found several definitions of data contract that state it is similar to an SLA. Ie an agreement between parties.
Data contracts are independent from data mesh. As stated previously - contracts are interfaces that allows producers of data and consumers to define and enforce requirements upstream through ideally programmatic means which already has a well understood meaning in the context of software engineering. Data not being embraced by the business has little to nothing to do with engineering/data terminology, and everything to do with the data they need being neither accessible, available, accurate, or semantically valid.
Thanks for the comment!
Great article. Thank you for sharing your wisdom. By they way I didn't understand one point. If a data contract is an agreement between engineers who's writing services that generates data and consumers of this data, why source data definition is a step to create such an agreement: "...Clearly define the upstream source data...". I mean in the end isn't the purpose is deliver data in quality and shape consumers desire? Source data definition seems falls into engineering side.
Is the official convoy blogpost out already? Or is it the one about using CDC as basis to build the data contract?
Also I came across this Data Contracts — The Mesh glue https://towardsdatascience.com/data-contracts-the-mesh-glue-c1b533e2a664
What are your thoughts?
I want to introduce the idea of using data contracts aka interface agreements at a MNC customer
They don’t have a coherent data strategy and most of the ingesting is meant for coordination across different parts of the company
Very little to no ML needed
Data is also pretty small in the tens or 100 of megabyte at most
Can help me understand a minimum via bale data contracts to start with assuming I may build systems that are both a provider and consumer of data for my mnc customer other application teams?
Thank you!
Yours and Yali newsletter are really awesome in talking abt data contracts