Not responding on LinkedIn, because I need to be careful what I share.
"Modern" data modelling tends to only encompass analytical modelling. Even then, there isn't always a clear demarcation between a system model, a process model, a dimensional model, or an aggregation model. It's all just a hodgepodge of whatever gets metrics the fastest. This tends to depend a LOT on the accessibility and normalization form of a given application.
Moreover, analytical workloads and operational/integration workloads are also extremely different, and the storage/access patterns for each SHOULD be different. Reverse ETL is a just a messier, less performant implementation of MDM or integration middleware.
Also - business process owners are trending AWAY from using data-enforced standards for declaring their business processes. Microservice data producers often don't even use something declarative to model their processes, instead they make "archi-toons." But there's a whole universe of standards that solidified over 20 years ago and covers everything from hardware design to service modelling:
The biggest opportunity, in my mind - is a system with a killer UX that enables business process owners to describe their process using a mix of behavior and artifact assets. Process models could have links to system level models declared in one of those standards, as well as to more rarefied/groomed models that share the same business logic for both ODS/developer-services and analytical/metadata systems. These inter-relations, and the inter-relations between models could collectively could be kept in a knowledge-graph - with both LDP and RDF aspects.
There are tools out there that offer 2/3 of this, but nothing out there that unifies a EAM stack with a metadata system and analytical modelling tools. There's also very little in the way of a replacement convention to dimensional models in regards to allowing model-driven time-travel. There needs to be a invigoration of modelling standards for the config-as-code era, with a really sexy UX that brings information workers to the table. Every vendor and their mom will claim to have all this, but it will be readily clear when the real data-messiah system comes out...
The real problem started with the decline of data ownership by the business stakeholders. The moment that data ownership is pseudo owned by IT, because they are the only one who understand how to fetch that data, this began a snow ball effect that culminated to today's data modelling reality.
You have to go back to when we are using filing cabinets and folders to store data, which is the height of data ownership, to understand how much decline happened to data ownership. The business stakeholders even have the actual physical keys to these filing cabinets. Twice removed these stakeholders began to rely on IT to pseudo apply ownership privilege and responsibility on data but we all know that their focus is on technology and application not data.
Business stakeholders need to be informed that they need to hold the reigns of their data again, so it represent business reality the closest possible way, adhering to all their business strategic values. The applications should conform to the data not the data conform to these applications. Then these business stakeholders would rely on these data models once more as a reliable and indispensable interface ("keys") to enforce, apply and practice their data ownership on their data.
Companies just don't understand is that you cannot resile from the need to model data. If you don't model it up-front (explicitly) , then it will be modelled after the fact (implicitly). In my experience, this is always more expensive in the long run.
We see this all the time in distributed teams recreating the same models over and over again, only with minor incompatible variations, and baked in to analytics models, ETL pipelines, data products, and application code. What was once explicit, discoverable and sharable, becomes implicit, hidden, and siloed.
The other big problem is the almost total focus on relational data models, which work against building relationships in data. This is where semantic Graph technology can make the biggest contribution - make relationships a first class citizen in data models.
Data Mesh is pretty much silent on these issues, which is its greatest weakness. Computational federated governance needs to embrace data modelling at its core to be truely useful.
Totally agree on the framing of graphs. The User Experience of querying and navigating graphDBs is pretty terrible, but once that is improved and SQL-like (or abstracted away entirely) the navigation of relationships will be become far simpler and easier for everyone. Still waiting on that tech.
> If you don't model it up-front (explicitly) , then it will be modelled after the fact (implicitly). In my experience, this is always more expensive in the long run.
strong agree with this
> The other big problem is the almost total focus on relational data models, which work against building relationships in data.
Until the day comes when the technology or practices mature for graph first DB to be near to relational data, we may have to do both.
Use relational first then add graph DB features.
Not ideal but at least time to value might be faster.
Completely agree with this post, especially the part around friction. The biggest challenge I find you have to solve, is convince the organisation that we need to review and update how the data is stored. Getting organisations to commit to something that they think works ok is the hardest part and that is where as an industry we may have made our lives harder. We have often found a solution to extract the data that is required, even if we know it is not the optimal way. I have had instances of where I have delivered a request to a client after a week and I can see that they are significantly underwhelmed, as they don't realise the pain I had to go through to get this data and visualise it in a way that makes sense for them.
Let me understand in more details: Data modeling's aim is to achieve the same models produced in application design, or is a completely different thing? If it's the first case, can these already created models be reused by the data team in some way?
The aim of the models produced by application design is to most effectively run the application, whereas the aim of models in the data warehouse is to most accurately describe the real world.
The two use cases should actually become decoupled. What I find is that engineers switch more to the data driven semantic model than the application model over time - expressing real world concepts accurately through events turns out to be the best method for shifting to a data driven architecture.
Really looking forward to part 2 of this article, which is outstanding! We as a data community contributed to the downfall of the data model in a big big way - in some places the model became the "thou shalt" and in other places there was little regard given to modelling at all. The result - spaghetti junction with very few people understanding the full landscape of data with a large reliance on the few who actually need to look after it all from an end to end perspective.
The model has to be flexible, robust and ready to change at the drop of a hat. Designing either Kimball or Inmon exclusively has pitfalls - if the models that are produced cannot find a happy medium between the 2 it becomes that much more difficult to adapt to ever changing business needs. If the model is not business driven and outcomes lead it becomes super hard for business to see the value of a well thought out model.
Super interested in not only how to bring agile to the model, but also how to migrate the spaghetti to a better place while not getting execs to run for the hills....
> Data Engineers are forced to be 'middlemen' between data producers and consumers.
I feel this so much.
Because of rapid changes in the business landscape, robust data modeling simply cannot generate robust data models that are static.
There's no more once and done for data modeling. Data modeling is a continuous task that will never end.
And yes, if the time to value takes too long, it's also poor data modeling.
So we may ahve to bridge the short term solutions/models (which deliver value fast but not sustainable or scalable or robust enough) with long term solutions/models (which don't deliver value as fast, but more robust or scalable).
By the time the transition from the short term to the long term is done at one part of the infrastructure, another part requires a different short term solution.
And we're just running around from place to place trying to patch to make sure no single spot gets too ugly too fast.
Trying to wish this away by looking for that silver bullet technology is just wishful thinking and is a person not meant to be in data engineering. We should just accept that.
"because there's nobody else to do it." So we have to be that guy.
We have all tried to look for alternative solution or alternative arrangement like John McClane looking for somebody else, but there's nobody else to do it. So we have to be that guy.
Interesting article Data modeling requires a different way of thinking. Our department has only 1 data modeler so they are trying to teach some of us to do modeling for our own projects. I had to design a table with about 4 columns we need to fetch and it took merge better part of a week.
Thanks for writing this, lots of great insights. What you've described as UI-based systems for front-end tracking (tools like Amplitude, Mixpanel, Snowplow) are often misunderstood. The data collection capabilities of these tools (focused on event data or behavioural data) go beyond the front-end and don't really rely on a UI.
The bigger problem is that the data collected by these tools is often high-volume but like you rightly pointed out, the implementation or the modeling is almost always up to the engineers who have little to no business context.
And this is precisely the problem that led me to create Data-led Academy (https://dataled.academy) where I had published a series of guides for non-engineering folks to understand all about the instrumentation process and play an active role to prevent the eventual data swamp. There are now well-defined tools and frameworks in place to collect good quality event-data --maybe there's room to create something similar for data that comes from 3P sources.
Even if these tools can go beyond the front-end with webhooks, they cannot be used within the service itself or transactional data - which is usually the main source of data for machine learning teams and the vast majority of in-house analytics. What a ML team needs to record is every time a transactional event was written to a database, and this data must be 100% accurate or else drift will occur over time. Tools like Mixpanel and Snowplow are blackboxes and the wrong tools for this type of job. Protocol buffers and Avro are much better.
Great Post! I've just recently entered that data world myself, and this post is very spot on with my outsider analysis of the modern data stack. It's quite interesting that the data trend is to apply advanced technologies on broken things, rather than focusing on fixing the fundementals. One thing I've been thinking, if companies want to become more data-driven, thus everyone in the company should be able to easily interact with data, why not use Data Models not just for Data Projects, but as a way for anyone in the company to easily see what data there is avaialble, and what are the semantics behind it. Again, I'm an outsider still with very limited knowledge, but from a business perspective I think that would make sense, as a good sales pitch. That being said, models are still tricky to read if you're not a technical person, so there should be some more simplified version that is readable by anyone. Looking forward to part 2 of your post!
Already share this post with every person I know. I understand some decisions on not modelling data enough when your business is probably change a lot over the early ages, but that mess comes with a price. That prices is high! Something I noted too is that books on data modelling are extremely rare, the good ones are decades old.
Totally agree. That why we build a tool that data consumers can work on data lake directly without asking for help from data engineer. Users know Excel can use the tool with no difficulty. And they just need work on spreadsheet to clean, enhance and transform data as they want. Data modeling is just a natural way on using data.
I think one of my trigger moments was the discovery of modeling your business data first and then seeing how the incoming data can fit into it. This was heavily influenced by one of the first chats we had about this topic.
I started with Kimball when I was starting to build models because everyone (at least in Berlin) did so. And it basically moved me away from the classics. It was a coherent approach but never really fitted me and the challenges I had. Data vaults looked more promising but were also a ton to learn and test.
So I navigated on my own and got better but was always on the lookout for a data model v2022. So really looking forward towards your ideas in the next post.
Not responding on LinkedIn, because I need to be careful what I share.
"Modern" data modelling tends to only encompass analytical modelling. Even then, there isn't always a clear demarcation between a system model, a process model, a dimensional model, or an aggregation model. It's all just a hodgepodge of whatever gets metrics the fastest. This tends to depend a LOT on the accessibility and normalization form of a given application.
Moreover, analytical workloads and operational/integration workloads are also extremely different, and the storage/access patterns for each SHOULD be different. Reverse ETL is a just a messier, less performant implementation of MDM or integration middleware.
Also - business process owners are trending AWAY from using data-enforced standards for declaring their business processes. Microservice data producers often don't even use something declarative to model their processes, instead they make "archi-toons." But there's a whole universe of standards that solidified over 20 years ago and covers everything from hardware design to service modelling:
https://www.omg.org/about/omg-standards-introduction.htm
The biggest opportunity, in my mind - is a system with a killer UX that enables business process owners to describe their process using a mix of behavior and artifact assets. Process models could have links to system level models declared in one of those standards, as well as to more rarefied/groomed models that share the same business logic for both ODS/developer-services and analytical/metadata systems. These inter-relations, and the inter-relations between models could collectively could be kept in a knowledge-graph - with both LDP and RDF aspects.
There are tools out there that offer 2/3 of this, but nothing out there that unifies a EAM stack with a metadata system and analytical modelling tools. There's also very little in the way of a replacement convention to dimensional models in regards to allowing model-driven time-travel. There needs to be a invigoration of modelling standards for the config-as-code era, with a really sexy UX that brings information workers to the table. Every vendor and their mom will claim to have all this, but it will be readily clear when the real data-messiah system comes out...
Totally agree with you!
The real problem started with the decline of data ownership by the business stakeholders. The moment that data ownership is pseudo owned by IT, because they are the only one who understand how to fetch that data, this began a snow ball effect that culminated to today's data modelling reality.
You have to go back to when we are using filing cabinets and folders to store data, which is the height of data ownership, to understand how much decline happened to data ownership. The business stakeholders even have the actual physical keys to these filing cabinets. Twice removed these stakeholders began to rely on IT to pseudo apply ownership privilege and responsibility on data but we all know that their focus is on technology and application not data.
Business stakeholders need to be informed that they need to hold the reigns of their data again, so it represent business reality the closest possible way, adhering to all their business strategic values. The applications should conform to the data not the data conform to these applications. Then these business stakeholders would rely on these data models once more as a reliable and indispensable interface ("keys") to enforce, apply and practice their data ownership on their data.
Totally agree with everything in this post.
Companies just don't understand is that you cannot resile from the need to model data. If you don't model it up-front (explicitly) , then it will be modelled after the fact (implicitly). In my experience, this is always more expensive in the long run.
We see this all the time in distributed teams recreating the same models over and over again, only with minor incompatible variations, and baked in to analytics models, ETL pipelines, data products, and application code. What was once explicit, discoverable and sharable, becomes implicit, hidden, and siloed.
The other big problem is the almost total focus on relational data models, which work against building relationships in data. This is where semantic Graph technology can make the biggest contribution - make relationships a first class citizen in data models.
Data Mesh is pretty much silent on these issues, which is its greatest weakness. Computational federated governance needs to embrace data modelling at its core to be truely useful.
Totally agree on the framing of graphs. The User Experience of querying and navigating graphDBs is pretty terrible, but once that is improved and SQL-like (or abstracted away entirely) the navigation of relationships will be become far simpler and easier for everyone. Still waiting on that tech.
> If you don't model it up-front (explicitly) , then it will be modelled after the fact (implicitly). In my experience, this is always more expensive in the long run.
strong agree with this
> The other big problem is the almost total focus on relational data models, which work against building relationships in data.
Until the day comes when the technology or practices mature for graph first DB to be near to relational data, we may have to do both.
Use relational first then add graph DB features.
Not ideal but at least time to value might be faster.
Completely agree with this post, especially the part around friction. The biggest challenge I find you have to solve, is convince the organisation that we need to review and update how the data is stored. Getting organisations to commit to something that they think works ok is the hardest part and that is where as an industry we may have made our lives harder. We have often found a solution to extract the data that is required, even if we know it is not the optimal way. I have had instances of where I have delivered a request to a client after a week and I can see that they are significantly underwhelmed, as they don't realise the pain I had to go through to get this data and visualise it in a way that makes sense for them.
Let me understand in more details: Data modeling's aim is to achieve the same models produced in application design, or is a completely different thing? If it's the first case, can these already created models be reused by the data team in some way?
The aim of the models produced by application design is to most effectively run the application, whereas the aim of models in the data warehouse is to most accurately describe the real world.
The two use cases should actually become decoupled. What I find is that engineers switch more to the data driven semantic model than the application model over time - expressing real world concepts accurately through events turns out to be the best method for shifting to a data driven architecture.
Eagerly awaiting part 2
Really looking forward to part 2 of this article, which is outstanding! We as a data community contributed to the downfall of the data model in a big big way - in some places the model became the "thou shalt" and in other places there was little regard given to modelling at all. The result - spaghetti junction with very few people understanding the full landscape of data with a large reliance on the few who actually need to look after it all from an end to end perspective.
The model has to be flexible, robust and ready to change at the drop of a hat. Designing either Kimball or Inmon exclusively has pitfalls - if the models that are produced cannot find a happy medium between the 2 it becomes that much more difficult to adapt to ever changing business needs. If the model is not business driven and outcomes lead it becomes super hard for business to see the value of a well thought out model.
Super interested in not only how to bring agile to the model, but also how to migrate the spaghetti to a better place while not getting execs to run for the hills....
could not agree more. Dataedo cartoon is on the spot, hilarious
thank you for sharing the article
> Data Engineers are forced to be 'middlemen' between data producers and consumers.
I feel this so much.
Because of rapid changes in the business landscape, robust data modeling simply cannot generate robust data models that are static.
There's no more once and done for data modeling. Data modeling is a continuous task that will never end.
And yes, if the time to value takes too long, it's also poor data modeling.
So we may ahve to bridge the short term solutions/models (which deliver value fast but not sustainable or scalable or robust enough) with long term solutions/models (which don't deliver value as fast, but more robust or scalable).
By the time the transition from the short term to the long term is done at one part of the infrastructure, another part requires a different short term solution.
And we're just running around from place to place trying to patch to make sure no single spot gets too ugly too fast.
Trying to wish this away by looking for that silver bullet technology is just wishful thinking and is a person not meant to be in data engineering. We should just accept that.
This reminds me of a quiet scene in Die Hard 4.
https://www.youtube.com/watch?v=Qga3aLPB0YE
"because there's nobody else to do it." So we have to be that guy.
We have all tried to look for alternative solution or alternative arrangement like John McClane looking for somebody else, but there's nobody else to do it. So we have to be that guy.
Interesting article Data modeling requires a different way of thinking. Our department has only 1 data modeler so they are trying to teach some of us to do modeling for our own projects. I had to design a table with about 4 columns we need to fetch and it took merge better part of a week.
Thanks for writing this, lots of great insights. What you've described as UI-based systems for front-end tracking (tools like Amplitude, Mixpanel, Snowplow) are often misunderstood. The data collection capabilities of these tools (focused on event data or behavioural data) go beyond the front-end and don't really rely on a UI.
The bigger problem is that the data collected by these tools is often high-volume but like you rightly pointed out, the implementation or the modeling is almost always up to the engineers who have little to no business context.
And this is precisely the problem that led me to create Data-led Academy (https://dataled.academy) where I had published a series of guides for non-engineering folks to understand all about the instrumentation process and play an active role to prevent the eventual data swamp. There are now well-defined tools and frameworks in place to collect good quality event-data --maybe there's room to create something similar for data that comes from 3P sources.
Even if these tools can go beyond the front-end with webhooks, they cannot be used within the service itself or transactional data - which is usually the main source of data for machine learning teams and the vast majority of in-house analytics. What a ML team needs to record is every time a transactional event was written to a database, and this data must be 100% accurate or else drift will occur over time. Tools like Mixpanel and Snowplow are blackboxes and the wrong tools for this type of job. Protocol buffers and Avro are much better.
I think this will make a great topic for a Data Beats episode: https://arpitc.substack.com/s/data-beats
Great Post! I've just recently entered that data world myself, and this post is very spot on with my outsider analysis of the modern data stack. It's quite interesting that the data trend is to apply advanced technologies on broken things, rather than focusing on fixing the fundementals. One thing I've been thinking, if companies want to become more data-driven, thus everyone in the company should be able to easily interact with data, why not use Data Models not just for Data Projects, but as a way for anyone in the company to easily see what data there is avaialble, and what are the semantics behind it. Again, I'm an outsider still with very limited knowledge, but from a business perspective I think that would make sense, as a good sales pitch. That being said, models are still tricky to read if you're not a technical person, so there should be some more simplified version that is readable by anyone. Looking forward to part 2 of your post!
Already share this post with every person I know. I understand some decisions on not modelling data enough when your business is probably change a lot over the early ages, but that mess comes with a price. That prices is high! Something I noted too is that books on data modelling are extremely rare, the good ones are decades old.
Totally agree. That why we build a tool that data consumers can work on data lake directly without asking for help from data engineer. Users know Excel can use the tool with no difficulty. And they just need work on spreadsheet to clean, enhance and transform data as they want. Data modeling is just a natural way on using data.
Excellent summary and introduction to the topic.
I think one of my trigger moments was the discovery of modeling your business data first and then seeing how the incoming data can fit into it. This was heavily influenced by one of the first chats we had about this topic.
I started with Kimball when I was starting to build models because everyone (at least in Berlin) did so. And it basically moved me away from the classics. It was a coherent approach but never really fitted me and the challenges I had. Data vaults looked more promising but were also a ton to learn and test.
So I navigated on my own and got better but was always on the lookout for a data model v2022. So really looking forward towards your ideas in the next post.