And Why the Modern Data Stack Isn't Enough To Save Them
Ok this seriously triggered me! And I see pretty much all of this where I work! You can’t get two numbers to reconcile? Just create another query. Or a table. And so on...
Really enjoyed this article, recognize quite a bit. Looking forward to part 2 - the solutions ;)
Fantastic article — also looking forward to part 2!
I am adding this to my summarization SubStack -- there's too much to keep up with I can't write my own, but just going to write summaries and hot takes in my substack.....
The problem is very real and is directly related to scale. The suggestion (which is for the author’s next edition) is better communication. But I think the notion of Data Contracts (which the author also writes about but I have’t had a chance to read yet — but we are thinking in the right directions) is what will emerge. In the same way that blockchain protocols for data are adding slashing penalties, oracles, and zkp-backed attestations to verify quality, freshness, source-of-truth, those models will enter the enterprises to create trustworthy contracts between data products and consumers — both within and external to the perimeter or even the business boundaries.
I believe we are seeing the infrastructure emerge already with ways to secure and trust Kafka pipelines for consumption. We may even start to see these contracts by enterprises monetized and proof-of-trust for ecosystem partners or permissionless third-parties.
Unfortunately, its true and harsh reality, which none of the enterprises want to agree as they carried away with competitors on Scale and other aspects, though they understand this challenge
This was a fun read, you need to do this more often!
Such a great read. I miss the SSOT 🥲
After freelancing as a data scientist I had my first data employment in a company where the Data & Analytics team consisted of 3.5 data scientists and 40 BI/ETL engineers. It felt a bit lonely as a data scientist and I did quite some ETL work, which wasn’t really my cup of tea. Little did I know how lucky I was to gain this experience.
We were a Microsoft Platinum Partner. We had three data engineering MVP’s. All our solutions were best (or pretty good) in class. It taught me many valuable principles.
After that I never saw ETL and datawarehousing at the same level anywhere else. Most things seem duct-taped together. I can relate to pretty much everything you write and sometimes K wonder how things around us are still working 😂
Many times I have seen data scientists “descend” to the “datawarehouse” to create news views and tables in an additional layer to quickly fix problems, because the data engineers didn’t have time. I tried to challenge this in multiple organizations saying “you have an actual DWH, but you also have a ghostship DWH… we need to get rid of that ghostship”.
I never succeeded. It’s hard to get it done from a data science position. I’m hoping that from an MLOps perspective I will get a better mandate.
Let’s see 🤷🏽♂️
This is a very good article!
Great post. You might be interested in what we're building at www.dagworks.io using https://github.com/dagWorks-Inc/hamilton! We think we can be part of the solution here -- and we think it starts with changing how people write code... ! Would love any feedback.