What to Consider When Building a Data Quality Strategy

Data Quality Series Part 2: Ensuring data quality is about finding the right balance—over-cleaning can remove valuable insights, while evolving data demands flexibility. This blog post explores how businesses can define quality thresholds, manage costs, and leverage AI-driven automation to maintain consistency and usability.

Blog categories: Pentaho Data QualityPentaho+ Platform

When we talk to customers about their data quality challenges and needs, regardless of the industry or company size, we hear a few common themes:

  • How do you define “quality”?
  • Can data be “too clean”?
  • How can we consistently apply data quality rules when data changes every day?
  • How can we ensure data quality within budget?

In this blog, we’ll review each of these topics with guidance on where data leaders and their teams need to focus to build a strong and lasting data quality strategy.

Current Quality vs. Ideal Quality: Striking the Right Balance

The struggle between current quality and ideal quality often comes down to setting a threshold of desired quality. In traditional data systems, quality was often assessed in a silo, but today, businesses need to think about data quality in the context of its broader usage in achieving business outcomes. What’s the quality score threshold required to meet business needs?

Ultimately, it is necessary that the quality of data is adequate to support the correctness of decisions in the context of business goals.  While pushing to achieve higher quality is important, it’s critical to balance quality with business goals, as perfection is not always necessary if the data serves its purpose.

The Risks of Over-Cleaning Data

While cleaning data is necessary, there’s a risk of over-cleaning, especially when the cleaning process removes important details. A great example of this is middle initials in names. If you clean this data too aggressively, you might lose valuable information, potentially leading to bias in the data. Furthermore, customer data might be incorrectly excluded if there’s a mismatch in the golden record, causing critical information, like address changes, to be missed.

In some cases, too much cleaning could unintentionally eliminate valid records that would have been useful. It’s important to remember that data quality should not just be about removing “bad” data but also about understanding which data is valuable to retain.

The Changing Nature of Data

Over the past decade, the landscape of data has drastically changed. The concept of a golden record—a single source of truth—has become more complex. With the rise of social data and real-time interactions, organizations now need to be more flexible in how they collect and use data.

When organizations look back at their data from 10 years ago, they must acknowledge that it may no longer be as relevant. The world has changed, and so has the data we use to make decisions. The need for more dynamic and up-to-date data has become more critical.

Data as an Asset and Its Cost

Data is often referred to as the new oil, but it comes with significant challenges. Organizations must grapple with the balance between how much data they collect, the regulatory limitations surrounding it, the cost of storing and cleaning it, and whether it will ultimately be useful. Moreover, when models are trained using data from one region, they may not translate effectively to another. For instance, a model trained on US data may not perform well with EMEA data due to cultural and regulatory differences.

Creating the Conditions for Consistent Data Quality

These challenges – how to define quality, thresholds for cleaning, data’s changing nature and the cost of cleaning data for different purposes – are only going to increase in complexity as we go forward.

No organization can meet a 100% quality threshold – doing so is overly cost prohibitive and would grind operations to a halt.  Data leaders need to create a consistent policy approach and have clear guidelines on what quality means based on use case and role.

Data leaders also need to consider how to leverage AI and machine learning to automate many of the processes that inform data quality – classification, scoring, and sensitivity. Solutions that enable the automation of data quality processes can do the heavy lifting while containing costs, enabling the organization to scale its ability to deploy a consistent data quality framework across the business.

In our next blog on data quality, we’ll explore what data quality means in the age of GenAI and Agentic AI.