Congratulations for this discovery. I am surprised this is not tought in uni and bootcamps - after all this is not new insight, it has been known for at least 3 decades now.
Hi Tom, definitely nothing new being discovered, but I have personally found that engineers (producers) and data developers (analytsts/data scientists/consumers) are frighteningly unfamiliar with each other's work. Just because something is known by data engineers or architects doesn't mean it's known by everyone! It's important to ensure both sides A.) remember the details and B.) are empathetic towards the other side.
OLAP and the "need" for denormalisation is a misconception. OLTP/OLAP refer to the physical aspect of SQL database products but underlying it we suppose the relational data model and this is the same independently of OLTP/OLAP. An "analytical table" should also be normalised (5N or 6N not 3N [or more commonly Boyce-Codd Normal Form]). The "denormalisation" might appear as an intermediate logical step (not part of the data model per se) before an aggregation (or other function) is calculated but once data is aggregated the result relation should be normalised too.
Besides that super nice article and I completely agree with you approach recommendations and the stress on empathy. I have been working through meetings between the data team and the platform engineers team in the company I work and it i very clear that going there and being on-boarded into their (platform engineers or developers) getting to learn how the collect model and think data in their context and the related challenges is super important. Also participating on the phase where data modelling at producers level is defined is super important to foresee later issues in the "data" team. I do thin that a lack of knowledge on how data is defined, why data and information (as semantic content [Luciano Floridi]) are not the same thing and also a lack of knowledge on what is actually a data model and more specifically a relational data model (most people I met never read C. J. Date, H. Darwen, E. F. Codd, D. McGoveran or Fabian Pascal).
Thank you Mark, I agree that the OLTP vs OLAP divide is not talked about enough in the data world. However, I thought your database example was a bit contrived. There is no reason for the User Product Reviews table in 3NF, you can just have the product_id and user_id column in the Reviews table because each review has exactly one product_id and user_id.
Hello Angela! First, thanks so much for taking the time to read the article, it means the world to me. Regarding the User Product Reviews table, I actually went back and forth with my technical reviewer on this. The reason we opted for a junction table was to satisfy 2NF where "all non-key columns must be fully dependent on the primary key". Specifically, it can be argued that "product" is not fully dependent on "reviews" but the converse is.
For the point about the database example being "contrived" I completely agree and that was the intention. The article was about workflows and perceptions of data, and the database example served as a way to get both sides on the same page in very simple terms. If this article was about data modeling itself, then a different example is needed.
Hi Mark, thank you for your answer! I quite agree that the data modeling is not the point of the article, I just thought presenting this data model as being due to 3NF was misleading. I think your interpretation of 2NF is not quite correct because your source took a shortcut in their definition, which should actually read "It does not have any non-prime attribute that is functionally dependent on any proper subset of any candidate key of the relation". Unclear, I know, but a candidate key is "any set of columns that have a unique combination of values in each row". I found the examples provided in the Wikipedia articles helpful:
In my understanding, product_id and user_id are actually part of the candidate key for a review. I think putting them in the reviews table would quite fit under the 3NF definition: "[every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key".
Congratulations for this discovery. I am surprised this is not tought in uni and bootcamps - after all this is not new insight, it has been known for at least 3 decades now.
Hi Tom, definitely nothing new being discovered, but I have personally found that engineers (producers) and data developers (analytsts/data scientists/consumers) are frighteningly unfamiliar with each other's work. Just because something is known by data engineers or architects doesn't mean it's known by everyone! It's important to ensure both sides A.) remember the details and B.) are empathetic towards the other side.
OLAP and the "need" for denormalisation is a misconception. OLTP/OLAP refer to the physical aspect of SQL database products but underlying it we suppose the relational data model and this is the same independently of OLTP/OLAP. An "analytical table" should also be normalised (5N or 6N not 3N [or more commonly Boyce-Codd Normal Form]). The "denormalisation" might appear as an intermediate logical step (not part of the data model per se) before an aggregation (or other function) is calculated but once data is aggregated the result relation should be normalised too.
Besides that super nice article and I completely agree with you approach recommendations and the stress on empathy. I have been working through meetings between the data team and the platform engineers team in the company I work and it i very clear that going there and being on-boarded into their (platform engineers or developers) getting to learn how the collect model and think data in their context and the related challenges is super important. Also participating on the phase where data modelling at producers level is defined is super important to foresee later issues in the "data" team. I do thin that a lack of knowledge on how data is defined, why data and information (as semantic content [Luciano Floridi]) are not the same thing and also a lack of knowledge on what is actually a data model and more specifically a relational data model (most people I met never read C. J. Date, H. Darwen, E. F. Codd, D. McGoveran or Fabian Pascal).
Thank you Mark, I agree that the OLTP vs OLAP divide is not talked about enough in the data world. However, I thought your database example was a bit contrived. There is no reason for the User Product Reviews table in 3NF, you can just have the product_id and user_id column in the Reviews table because each review has exactly one product_id and user_id.
Hello Angela! First, thanks so much for taking the time to read the article, it means the world to me. Regarding the User Product Reviews table, I actually went back and forth with my technical reviewer on this. The reason we opted for a junction table was to satisfy 2NF where "all non-key columns must be fully dependent on the primary key". Specifically, it can be argued that "product" is not fully dependent on "reviews" but the converse is.
For the point about the database example being "contrived" I completely agree and that was the intention. The article was about workflows and perceptions of data, and the database example served as a way to get both sides on the same page in very simple terms. If this article was about data modeling itself, then a different example is needed.
quote source: https://www.databasejournal.com/mysql/clearing-a-path-through-the-3nf-join-jungle/
Hi Mark, thank you for your answer! I quite agree that the data modeling is not the point of the article, I just thought presenting this data model as being due to 3NF was misleading. I think your interpretation of 2NF is not quite correct because your source took a shortcut in their definition, which should actually read "It does not have any non-prime attribute that is functionally dependent on any proper subset of any candidate key of the relation". Unclear, I know, but a candidate key is "any set of columns that have a unique combination of values in each row". I found the examples provided in the Wikipedia articles helpful:
https://en.wikipedia.org/wiki/Second_normal_form
https://en.wikipedia.org/wiki/Candidate_key
https://en.wikipedia.org/wiki/Third_normal_form
In my understanding, product_id and user_id are actually part of the candidate key for a review. I think putting them in the reviews table would quite fit under the 3NF definition: "[every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key".
What do you think?