Why Data Quality Is So Hard In S3

And Data Lakes In General

and

May 17, 2024

Note from Chad: 👋 Hi folks, thanks for reading my newsletter! My name is Chad Sanderson, and I write about data products, data contracts, data modeling, and the future of data engineering and data architecture. I wanted to quickly share that I worked out a deal with O’Reilly so that people can download our upcoming book for free on our website. Follow the below link, and you will get access to the early-release chapters. People who sign up will also receive follow-up emails with updated PDFs with new chapters until the book's full release.

Download Our Book

ELT: Everybody Looking Tirelessly

Before becoming a data engineer, I worked as a data scientist, relying heavily on the data lake managed by my engineering colleagues. In my first role, we used Athena on top of S3, where the data sources' intricacies were hidden from me as I focused on SQL queries or pulled data into my Jupyter notebook. However, when data quality issues arose, I could identify the problems but couldn't fix them myself. This led to frequent delays, as data scientists and I on my team waited for upstream engineers to resolve issues—often taking days or weeks. Despite being told that having data readily accessible within data lakes was beneficial, my experience was much different. I often wondered if this problem was unique to my organization or an industry-wide issue. This was also the early spark that inspired my consideration of a career in data engineering.

A few years later, I found myself on the other side of the situation as a data engineer, still facing the same problems. Conversations with industry peers revealed that this issue was widespread, not just within the companies I worked at. I argue that a driver of this industry problem was the push for the "Modern Data Stack" (MDS), which was a double-edged sword, creating both opportunities and problems in today's data landscape. The rise of cloud computing around 2011, coupled with the hype surrounding data science (dubbed "The Sexiest Job of the 21st Century"), led to a rush of companies expanding their data capabilities. Specifically, cloud platforms lowered the barrier of entry into data compared to the costly and time-consuming on-prem data warehouses. This data gold rush birthed MDS, which vendors promoted as the default approach to becoming data-driven, conveniently offering the necessary tools to achieve it.

While it’s easy to criticize the limitations of the MDS today (even Tristan Handy of dbt, who helped popularize the term, is now distancing himself from it), I believe the data industry owes much to its rise to MDS. The prominence of MDS spurred the creation of numerous data roles and business considerations centered around data. However, the dark side of MDS is that it led companies to default to ELT processes that involved dumping vast amounts of data into a data lake (often via Fivetran), replicated into an analytical database (like Snowflake or Databricks), and then modeling it into a medallion architecture (typically via dbt). Although this architecture is a valid option, companies found themselves in trouble because it became the industry default choice via MDS. Many companies opted for a data lake, dumped a bunch of data into it, hoping for potential value, and ultimately ended with a data swamp that was hard to navigate. As a result, companies are now working to clear these swamps, which brings us to the focus on S3.

Why S3?

There are numerous solutions for data lake storage, so why only focus on S3 and its implications for data quality? Simply put, AWS is the most popular cloud platform– as of Q1 2024, AWS accounts for 31% of cloud market share as compared to Azure (25%) and GCP (11%). This is further demonstrated via the Google Trends graph below, highlighting AWS’s dominance over time. This is not to make a judgment on which platform is better, but rather an observation that many data systems will likely utilize AWS and thus S3.

In addition to being the most popular, it’s also the use case I often come across when discussing data contracts with teams. While I argue that data quality needs to be considered before the data lake (e.g., CRUD operations and ingestion), most people start thinking about data quality at this stage in the data lifecycle because it’s the most visible and painful. Coupled with the data lake often being the most upstream source that data teams have access to, many data quality issues are perceived to point back to the data lake (S3).

Data Quality Considerations for S3

A great starting point in considering data quality in S3 is to look at AWS Glue. While not everyone will leverage Glue, it provides insights into AWS’s perspective on data quality for its own platform. It was actually my first step when I was tasked with scoping what data evaluation on S3 would look like for a project I worked on. Below are the categories supported by Glue’s Data Quality Definition Language (DQDL); note that these are categories I created, but the options below are from DQDL:

Outside of Glue, we can also reference the O’Reilly book, Data Quality Fundamentals, which categorizes the components of data quality as:

Freshness
Distribution
Volume
Schema
Lineage

From these two sources, a pattern emerges where data consumers consistently care about whether data is up-to-date, if its schema matches expectations, and if the data's semantics are changing unexpectedly (beyond typical data drift). While ensuring data quality in a relational database is challenging, achieving it in a data lake becomes a herculean effort for two main reasons: volume and variability.

Challenges Data Teams Need to Overcome in S3

What it’s like navigating your data swamp.

We now understand the levers we can pull to manage data quality in S3, but why is it still so difficult to enforce? As mentioned, the primary challenges data teams face in S3 (and data lakes generally) are volume and variability. Unfortunately, data quality issues can't be resolved with tooling alone. I often proclaim, “Poor data quality is a people and process problem masquerading as a technical problem.” The challenges of volume and variability underscore this point.

The strength of S3 is also its Achilles' heel: you can store virtually anything in it. Structured data, unstructured data, incomplete data, image data, audio data, Avro, Parquet, JSON—you name it, and you can likely store it in S3. This flexibility is incredibly powerful for data workflows involving machine learning or other downstream workflows requiring diverse use cases. However, it also creates a long tail of potential data quality problems, with silent edge cases often causing the most significant issues.

Another strength—but also an Achilles' heel—of S3 is its volume capabilities. S3's ease of loading data and its infinite scalability can turn it into a data dumping ground, especially when the data's use case is unclear. This lack of constraints within S3 fosters an environment ripe for data quality disasters, as teams can load any data into the lake without limits. As the data volume grows, teams must navigate the long tails of variability while facing a "needle in a haystack" challenge to identify and resolve issues.

The lack of technical limitations in S3 means that it’s up to the people and processes of technical teams to protect the data lake. Organizations should be asking themselves the following questions:

How do we differentiate between data we need to operationalize and data we are still evaluating for value?
What data should we avoid storing in S3?
For the data in S3, how do we keep track of available data assets, their schemas, and underlying data statistics (this is why teams often consider AWS Glue)?
As use cases grow and data variability increases, how do we ensure the engineering and data teams stay aligned on data usage?
What is a data asset's “blast radius” within a data lake if it suffers from data quality issues, and which teams need to be notified?

The challenge of volume and variability within the data lake eventually pushed me towards evangelizing data contracts and inspired me to write a book on the topic. As I delved into data pipelines upstream from the data lake, I saw how siloed pre- and post-data lake teams were. In my article OLTP vs. OLAP: The Core of Data Miscommunication, I discuss these silos in detail. The core idea is that upstream software engineers have constraints irrespective of the data they use, while downstream data teams don’t account for how transactional data needs to be structured. Both teams want to collaborate effectively, but broken communication between them—stemming from issues with people and processes—hinders this. I argue that data contracts can improve communication between these silos, ensuring data assets meet consumers' needs, especially for data lakes leveraging S3.

Data Products

Discussion about this post