How Scale Kills Data Teams

May 19, 2023

And Why the Modern Data Stack Isn't Enough To Save Them

9 Comments

May 22, 2023

Ok this seriously triggered me! And I see pretty much all of this where I work! You can’t get two numbers to reconcile? Just create another query. Or a table. And so on...

Expand full comment

Maaike Siegel

May 25, 2023

Really enjoyed this article, recognize quite a bit. Looking forward to part 2 - the solutions ;)

Expand full comment

Richard Tran

May 29, 2023

Fantastic article — also looking forward to part 2!

Expand full comment

Timbo

Jun 9, 2023

I am adding this to my summarization SubStack -- there's too much to keep up with I can't write my own, but just going to write summaries and hot takes in my substack.....

Hot Take:

The problem is very real and is directly related to scale. The suggestion (which is for the author’s next edition) is better communication. But I think the notion of Data Contracts (which the author also writes about but I have’t had a chance to read yet — but we are thinking in the right directions) is what will emerge. In the same way that blockchain protocols for data are adding slashing penalties, oracles, and zkp-backed attestations to verify quality, freshness, source-of-truth, those models will enter the enterprises to create trustworthy contracts between data products and consumers — both within and external to the perimeter or even the business boundaries.

I believe we are seeing the infrastructure emerge already with ways to secure and trust Kafka pipelines for consumption. We may even start to see these contracts by enterprises monetized and proof-of-trust for ecosystem partners or permissionless third-parties.

Expand full comment

Harish Ajjarapu

Jun 2, 2023

Unfortunately, its true and harsh reality, which none of the enterprises want to agree as they carried away with competitors on Scale and other aspects, though they understand this challenge

Expand full comment

Arpit Choudhury

Jun 1, 2023

This was a fun read, you need to do this more often!

Expand full comment

Raphaël Hoogvliets

Sep 28, 2023

Such a great read. I miss the SSOT 🥲

After freelancing as a data scientist I had my first data employment in a company where the Data & Analytics team consisted of 3.5 data scientists and 40 BI/ETL engineers. It felt a bit lonely as a data scientist and I did quite some ETL work, which wasn’t really my cup of tea. Little did I know how lucky I was to gain this experience.

We were a Microsoft Platinum Partner. We had three data engineering MVP’s. All our solutions were best (or pretty good) in class. It taught me many valuable principles.

After that I never saw ETL and datawarehousing at the same level anywhere else. Most things seem duct-taped together. I can relate to pretty much everything you write and sometimes K wonder how things around us are still working 😂

Many times I have seen data scientists “descend” to the “datawarehouse” to create news views and tables in an additional layer to quickly fix problems, because the data engineers didn’t have time. I tried to challenge this in multiple organizations saying “you have an actual DWH, but you also have a ghostship DWH… we need to get rid of that ghostship”.

I never succeeded. It’s hard to get it done from a data science position. I’m hoping that from an MLOps perspective I will get a better mandate.

Let’s see 🤷🏽‍♂️

Expand full comment

Aleš Najmann

Aug 17, 2023

This is a very good article!

Expand full comment

DAGWorks Inc.

Jun 26, 2023Edited

Great post. You might be interested in what we're building at www.dagworks.io using https://github.com/dagWorks-Inc/hamilton! We think we can be part of the solution here -- and we think it starts with changing how people write code... ! Would love any feedback.

Expand full comment

Data Products

How Scale Kills Data Teams