6 Comments

Great article! It is good to read about an actual technical implementation on a topic which is mostly being discussed on a more conceptual and functional level. Would love to get some insights in a technical solution like this but then for typical batch / file based exchange of data between producers and the datalake/warehouse.

Expand full comment

"The entity gates of reality" :)

Love it. In implementing these kinds of things throughout my career, the main problem as you point out is the enforcement. This ends up going down one of two paths:

1) The battle over the PR. Criteria are added to PRs. The product engineers/app owners (or as you say, Data Producers) have criteria added to PRs. This almost never works in my experience, though I'd like to know how people have been successful with it.

2) Chargebacks. If you screw up, you pay. This is 'the stick' approach. This type of model in general is hard to implement at most tech and digital native companies that have less mature budget management, though it is somewhat common at most large organizations, whether tech or digital or really any industry. I'm also curious as to success stories about when to enforce 'the stick' for startups/scaleups, etc.

Expand full comment

It's a great point. What we've observed at Convoy is that everyone wants the best for the company. Engineers want to be helpful, but being asked for additional criteria from producers is usually seen as a completely disconnected effort from service maintenance, the ask typically lacks context, the tooling to do is outside what SWEs are used to, and it creates a set of new constraints (never alter a db) which no engineer wants to do.

Where we have had a lot of success is being incremental and use-case driven, focusing on production-grade use cases (financial reporting pipelines, ML training sets, business critical dashboards, and so on). The first stage of contract implementation is limited to schema management - "Please don't break what already exists." We've also rolled out lineage solutions which help data producers understand how the data is being consumed downstream which has been very helpful.

Once the contract exists, the conversations around the enforcement criteria become much more frequent and meaningful. Teams in which the discussion wasn't happening at all are now talking constantly, and the enforcement (while not yet the stick based model) is being included as part of the on-call burden. I see the transition towards data contracts isn't too dissimilar from Agile - it will begin iteratively at the team level and then gradually expand until it requires top down governance.

Expand full comment

your Data Quality Camp link does not work

Expand full comment

That's pretty great article and it makes a lot of sense to me.

I'd say the requirements for contracts are pretty spot on and draw a lot from software engineering proven practices.

I'd say the hardest to implement are "Data contracts must be enforced at the producer level" and then "Data contracts cover semantics" (so I am really curious to see part 3).

Data contracts must be enforced at the producer level - that seems hardest.It is on one hand the producer (e.g software engineering creating the product or service ) need to be aware of how changes in semantics of data (which may not change product behaviour but would change analytics) is impactful. But there are purely seasonal changes (normal semantic/data drift) that depends on things outside of even the producer people which can change the semantic of the data.

Expand full comment

"A contract by definition requires enforcement." Probably missed this in the text, but who will be in charge of this enforcement? Let's say I have a DevOps and DataOps team with product owners and then CIO at the top of the org.

Expand full comment