Note from Chad: 👋 Hi folks, thanks for reading my newsletter! My name is Chad Sanderson, and I write about data products, data contracts, data modeling, and the future of data engineering and data architecture. I’ve been hard at work building Gable these last few months, but today, Mark is taking over and providing a deep dive into how data teams manually resolve data quality issues. As always, I‘d love a follow and share if you find it valuable.
While the data industry loves to highlight shiny new advancements **cough** generative AI **cough**, the reality is that many data practitioners find themselves in low-data maturity organizations. This is less a reflection of those businesses, but rather highlights just how hard data is to work with in the Modern Data Stack™ era.
I myself was in this position, but it served as the best training for understanding data quality in real-world settings. Through this, I was able to establish a data quality resolution processes that helped me navigate many data fires. It consisted of the following steps…
The Data Quality Resolution Process
0. Stakeholder Surfaces Issue
1. Issue Triage
2. Requirements Scoping
3. Issue Replication
4. Data Profiling
5a. Downstream Pipeline Investigation
5b. Upstream Pipeline Investigation
5c. Consult Technical Stakeholders
6. Pre-Deploy - Implement DQ Fix
7. Deploy - Implement DQ Fix
8. Stakeholder Communication
Below I provide details, code examples, diagrams, and communication strategies to help you resolve data quality issues when you are a low-data maturity company. While this can be adapted to many parts of the data lifecycle, this specifically takes the perspective of working with a data warehouse and or analytical database.
Stakeholder Surfaces Issue
Stakeholders on the business side will either message the data group in Slack or create a Jira ticket regarding a data quality issue.
This stage in the process is a dangerous area for data teams! Specifically, you need to find the right balance of being responsive but not reactive. Responsiveness fosters trust between data teams and stakeholders, while reactiveness fosters burnout among data teams.
How you respond to these inquiries from the business will greatly shape your culture around data quality and, more importantly, the boundaries of the data team! Thus, this stage falls on managers and team leads within data departments to manage expectations among their direct reports and managers of the teams that make these requests.
Ways I’ve seen this become successful:
Absolutely no requests via private messages as these requests often fall through the cracks (lose trust), are not visible to managers for workload (burnout), and lead to duplicate requests on the team (communication breakdown).
I respond to private requests with the following: “Thanks so much for bringing this to my attention! Can you please either post in <slack channel> or submit a Jira ticket for this? This helps the data team keep track of things and highlight the issue to others who might already have answers.”
Creating a Slack channel specifically for data quality requests or questions that the data team monitors; BONUS: I use these channel posts as proof points as to why we need to invest in infra or another data initiative.
VERY IMPORTANT - Managers go out of their way to protect the time of their respective data team, provide larger context to problems being surfaced, and ensure the data team has a balanced workload to avoid reactive burnout.
Issue Triage
I determine if the request is a "just curious" (backlogged), an issue that can wait (ticket assigned and scheduled), or an immediate need (work on request now).
I argue that the root of this prioritization challenge is the asymmetry of understanding of data requests and business context for respective parties.
Stakeholders often see the "symptom" of bad data quality but don't have access to see the "cause" of the issue. While data teams understand the underpinnings of data sources, they may not fully see how the business is utilizing the numbers. Thus, ask clarifying questions such as:
For the impacted data, how do you or your team utilize these numbers?
Are these numbers shown to any external parties that are important (e.g., customers, regulators, etc.), and if so, when?
How does this issue impact the timelines of downstream parties?
How much "risk" is associated with the decisions made from these numbers; do these decisions require validated numbers or directional estimates?
Are there any alternative data sources you can use in the meantime that will still help you resolve your issue?
These are just examples, but my goal is to get both myself AND the requestor to understand if the issue can be a) "backlogged," b) "assigned and scheduled," or c) "work on request now."
Requirements Scoping
My issues issues were often in the data warehouse, but first, I reach out to the stakeholder to assess where and what they are seeing (bonus if they leave a detailed ticket description so I can skip this).
A common theme you will hear me say is that "data is abstract," and it's only amplified for folks who aren't data professionals. Thus, it can be a considerable challenge for stakeholders to communicate their challenges and needs properly. It's our job as data pros to help our stakeholders get there.
A huge trap is to rush into solving the problem first presented to you! Instead, take a moment to understand WHERE they are experiencing the DQ issue, WHY they think it's a DQ issue, and WHAT it impacts for them. Some great questions to ask:
What surface (e.g., product feature, dashboard, etc.) were you using when you first noticed the potential issue?
Are you seeing conflicting numbers in different surfaces for the same entity, and or seeing the potential DQ issue propagate to other systems?
What other stakeholders are impacted by this that I should keep in the loop as we investigate this?
Then comes the fun part of combining the information from your stakeholders with your expertise of the data systems at your org. Some potential considerations:
What's the lineage of the impacted data?
For the systems touched by this lineage, do they have any quirks?
Was a change recently made to the related codebase?
How complex is the logic tied to the impacted data?
Are there any logs to give clues, or is this a dreaded silent error?
Once you have a grasp of the impact and the technical implications, you go back to the stakeholders to provide a realistic timeline and proposed solution.
Issue Replication
I now try to replicate the issue either by using the dashboard or pulling the data in the data warehouse via SQL-- this step helps me determine if it's a user error or an actual data quality issue.
You can do this check anywhere in the data lifecycle, but for this scenario we are focusing on the data warehouse! Replicating the issue serves two roles:
One:
You can quickly assess whether it's a user error with the data or a genuine data quality issue. Even if it's a user error-- this is more common than you think-- you are not off the hook! A user error hints at a potential issue in processes, how the data or tool is presented, or a lack of training impacting your data culture.
Don't just write off user errors, instead make sure you can address the behavioral component of data quality as well. This behavioral component is the most difficult to manage, but will have the largest ROI if you successfully create a cultural change, in my opinion.
Two:
If it is a true data quality issue, replicating the issue gives you your first tangible clues in your investigation. In the scoping phase you are making informed assumptions and relying on previous knowledge of the data systems. In this step, you are validating that these assumptions are true.
Most times, your informed assumptions are validated, but this step becomes very interesting when your assumptions don't hold up. These are the most valuable clues of them all, as you may have stumbled across a hidden data quality issue, and now you need to investigate its actual root cause and impact.
Data Profiling
If it’s a true data quality issue, I then profile the data to determine: a) the timelines of the data, b) any nulls in the data (bonus if nulls have a pattern), c) any spikes or drops in counts, and d) a lot of groupby metrics by populations (eg. org_id).
I can hear people in the comments already saying, "WHY DON'T YOU AUTOMATE THIS VIA A DATA OBSERVABILITY TOOL?" To which I respond that I completely agree... but... companies who invest in data observability are further along in data maturity. There are a lot of businesses, especially non-tech industry businesses, that struggle with the basics. At a previous company I was at, we spiked on some areas of data maturity, but data observability was not one of them.
Thus, your best bet is hopping into SQL and profiling the data to validate your assumptions and see where things break. Below is a made-up example of a purchases table with some weird data.
Keep in mind that this is the exploration phase, so don't run out to your stakeholders once you find something weird. Instead, collect information to start building up a list of facts (e.g., "max purchase_date year is showing 2105, which isn't possible."). I especially like to save these in the Jira ticket so others can see what I explored if they are a ticket watcher or referencing from a later date.
Once you have this list of facts, you can review at a higher level what's happening with the data and form some great questions to ask stakeholders and engineering to start identifying the root cause.
Example of how I profile data:
Downstream Pipeline Investigation
From the problem table, I trace the lineage of transformations in the data warehouse all the way back to the ingestion table, specifically looking for bugs in the SQL code.
This is the step where we start diving into the data to identify the cause of the data quality issue. My main goal here is to begin filtering down my scope to help identify where exactly the problem is presenting itself.
Thus, these are the two levers I utilize to make that happen based on the data profiling I did in the step before:
1. Data Lineage
There are probably numerous tables that can be the issue (some companies have thousands), but you can quickly assess the potential tables of interest utilizing data lineage to reference the DAG (directed acyclic graph). The DAG will show how the tables connect to each other and where each value is sourced. Ideally, you will have a data lineage tool to do this or utilize dbt docs. If not, you then have to do it manually by referencing the FROM statements in the SQL code.
2. SQL File Review
Once you have your tables of interest, review the SQL code utilized to generate the tables. This step is why having maintainable and easy-to-read SQL code is so important! Here, I’m looking for business logic that either introduces an error, interprets logic incorrectly, or misses an edge case. In addition, this step can give you clues about potential upstream issues as well.
Remember our dataset in question?
This is the DAG behind the problem table:
Remember, these were our errors:
Since the error is related to purchases, we will focus mainly on the highlighted tables:
Drill down further to isolate the error:
Results show that the issue is isolated to product_a:
We now review the SQL file of the potential error, and we find the following:
There is an error in the transformation logic within the SQL file combine_transactions.sql. It seems like the quick fix to an upstream change didn't account for an edge case.
The next steps are to determine the best way to immediately resolve this error in the SQL code AND explore what's happening upstream in "product_a."
Upstream Pipeline Investigation
Once I identify the potential bugs in the SQL code, I need to determine if we a) misinterpreted the upstream logic in our SQL code, b) if the logic changed upstream, or c) if there is a bug upstream by looking into the engineering codebase for changes as well as review their documentation.
As a data consumer, such as a data scientist or data analyst, one of the most powerful skills you can have is being able to navigate production codebases. In my last role as a data scientist, this made me an extremely valuable asset as we no longer had to request upstream eng users to investigate (hint: it’s typically not a top priority for them)-- ultimately saving days or weeks in resolving data quality issues.
Specifically, the initial search for impacted files, classes, and functions in the repository related to the data quality issue makes it substantially easier for the engineering team to help you. In addition, they most likely use version control, so you can identify who made the latest changes to the impacted code.
Similar to the data warehouse, I am tracing the lineage of the function calls that generate the impacted table in the data warehouse. Sometimes, this is very straightforward. Other times, you are clicking through functions and classes until you reach the file you need.
This again highlights the importance of having well-documented and clean code to resolve data quality. Below is an example code use case of this process.
Previously, we identified that the data quality issue is related to "product_a":
We could just resolve the issue by updating the logic in the SQL file... but that fixes the symptom and not the root cause. Thus, we need to go upstream in the codebase to understand what's happening in product_a.
We trace back the Python files that produce the 'product_a' table in our data warehouse to 'product_a.py'. Let's investigate!
In 'product_a.py' we find the issue in the function 'get_purchase_data()' within the 'ProductA' class.
Specifically, this function logs the problem 'purchase_date' value as a text field that the user enters in manually. It looks like there are cases of users accidentally putting in the wrong dates.
The next steps are to:
1. Identify the eng team who owns the logging for 'product_a'.
2. Surface the downstream data issue of self-entered fields and get buy-in from eng to update it.
Consult Technical Stakeholders
Once I gather the facts from above, I reach out to engineering, confirming my logic and coming up with a plan to resolve either upstream in the eng codebase or downstream in the data warehouse.
The main surprise when I first entered my data career was how much communication there is. We are working on very complex and abstract problems, thus, our communication needs to be even stronger than our technical skills!
A huge mistake I say data professionals make at this step is just stating the problems but not connecting them to the needs of their technical stakeholders for them to care. A great way to make this connection is to articulate the problem clearly, the impacted downstream stakeholders, what you have done to resolve and research, and tangible ways they can help.
A great framework for communicating technical problems is via the SBAR method from nursing, which stands for Situation, Background, Assessment, and Recommendation. The below example breaks down this framework with a mock use case of me speaking to the “eng team.”
Recap of our data quality issue:
Downstream users are seeing purchase dates that shouldn't exist.
Data lineage and profiling of the data points to the 'product_a' table.
There is a bug in the SQL transformation not accounting for the year edge case (that we can fix).
Upstream in the eng codebase, we traced the data quality issue to 'product_a.py', specifically the 'get_purchase_data()' function within the 'ProductA' class.
This function logs the problem 'purchase_date' value as a text field that is manually entered in by the user incorrectly.
Just sending this list of issues to a technical team is the wrong approach. You need to frame the information in a way that builds buy-in for other teams to care about resolving the data quality issue. Also, remember you have been close to the problem that these stakeholders have not seen yet... so you need to make it as easy as possible to understand the situation.
Situation
Background
Assessment
Recommendation
Next, we will create a fix for this data quality issue and test the changes before deploying!
Pre-Deploy - Implement DQ Fix
I implement the change, profile the data from before and after to confirm the new change only impacts the expected value, and have the change go through code review.
One of the dangers of fixing a data quality issue is introducing a new data quality issue with your "fix."
Below is my 5-step process for mitigating this risk:
Get the counts of the impacted table(s) BEFORE the fix, such as unique ID counts, table row counts, or counts by a specific time period.
Get the same counts you utilized in the above step AFTER implementing the fix for all downstream tables in the next two "layers" of the DAG.
Document what changed and what stayed the same before and after the fix, and why these changes match your expectations.
If there is a downstream component that a stakeholder can see (e.g. dashboard), ensure the changes are correct and document the difference to assist with change management with your stakeholders.
Include the above due diligence in your pull request for the change so your code reviewer has the proper context.
The goal of this exercise is to conduct and document your due diligence to foster trust in your data within your organization. Your data is bound to get messed up, you can't catch everything, but you can control the perception of the data processes within your organization. Seeing this due diligence work puts your stakeholders at ease, even when inevitable data fires break out.
Deploy - Implement DQ Fix
Once the change is merged, I again review the production system and determine if the stakeholder's issue is resolved, and write up the issue cause and how we solved it for others to review in the future.
In the last step, we did a bunch of due diligence work to ensure our data quality fix didn't create new bugs and we saw the expected changes in downstream tables.
You have two scenarios after doing this work and deploying the code...
1) You have no infrastructure for CI/CD and unit tests (surprisingly very common among data teams), and you hope your change doesn't create a silent failure in production.
2) You have implemented CI/CD checks, took the time to create unit tests, and you are confident that you deployed code that won't break silently for known issues.
If you find yourself in scenario one, I implore you to start creating tests for your changes and utilizing CI/CD. Doing this often forces you to write better code that is easier to maintain; especially important when dealing with data semantics.
Once the code is deployed, I write up the issue cause and how we solved it for others to review in the future. This helps train others on your data team and further pushes a culture of data quality.
Stakeholder Communication
I communicate the updates to the impacted stakeholder(s) with an emphasis on maintaining trust in our data and how we are able to resolve issues quickly.
One of the biggest mistakes I see data professionals make is forgoing communication about their impact. Many data professionals assume that the impact is apparent once the code is merged, but that's the furthest from the truth.
You need to answer three key questions for your stakeholders:
What was the problem and impact of not addressing this problem?
What is the solution that you implemented?
[Most Important] How can they use the solution today?
Why go through all of this effort? Because your most valuable asset as a data professional is TRUST-- trust is built off fostering relationships.
You can have the most accurate data, the most advanced tech stack, and the most valuable insights... but it all means nothing if your stakeholders don't trust you or the data. Every time you communicate (or lack communication) about your data, especially among non-technical stakeholders, you are impacting that trust.
Final Thoughts
If this sounds like a lot, it's because it is. The hidden costs of data quality issues are 1) trust in your data, and 2) the time spent by expensive staff reactionarily putting out fires rather than generating value. All of which is further amplified when the data quality issue is tied to a revenue-generating data pipeline.
This is why automating the various steps in the data quality checks above is so important. This question of: “How can I improve this arduous process and prevent issues?” is what ultimately led me to quit my previous startup job to help build a data contract platform at Gable with Chad. Please feel free to reach out on LinkedIn if you want to learn more.
Though this process focuses on the manual side of resolving data quality, this is often the starting point for many data teams. Use this as a foundation to build off of once you are ready to automate!
Note: This was originally a LinkedIn post of mine. Felt that Substack was a better format for it.
Great article! You touched on a lot of important things- boundaries, trust, processes, communication. Essentially soft skills that many data practitioners don't think about.