Data Quality in the Age of AI and Machine Learning

Data Quality Series Part 3: Data quality is a crucial aspect of any organization’s operations, and its impact is growing as artificial intelligence (AI) and machine learning (ML) continue to evolve. However, determining what qualifies as "good enough" data can be a challenge.

Blog categories: Pentaho Data QualityPentaho+ Platform

Data quality is a crucial aspect of any organization’s operations, and its impact is growing as artificial intelligence (AI) and machine learning (ML) continue to evolve. However, determining what qualifies as “good enough” data can be a challenge. How do we define where to stop when it comes to ensuring data quality? What are the costs involved, and who is responsible for paying for it? These are just some of the questions that arise as businesses increasingly rely on data and AI for decision-making. Let’s break down some of the key considerations.

Tailored vs. Generic Data Cleaning Approaches

When it comes to cleaning data for AI or machine learning projects, the approach is typically use-specific or project-specific. Data scientists go straight to the source, shape, cleanse, and augment the data in the sandbox for their project, modifying the data in a way that aligns with the specific needs of the model. This sits in contrast to traditional data cleaning efforts within a data warehouse, where there are multiple levels of approvals and checks in place.

The key question here is whether you can rely on the data warehouse as a source for your AI model. The rise of AI and generative AI (GenAI) has led to a diminished reliance on data warehouses, as models often need data in its raw, unprocessed form to make accurate predictions and discoveries.

Who Pays for Data Cleaning?

One of the most significant challenges in data management is understanding who bears the cost of data cleaning. It’s not always the same team that uses the data that pays for it. In traditional use cases, a line of business (LOB) would determine whether data quality is sufficient for their needs. However, in an AI-driven world, there’s a new intermediary—data scientists or developers—who often sit at the center of the decision-making process when it comes to data quality.

For instance, in a marketing email campaign, the LOB is directly involved in evaluating the data’s quality. However, for a sales territory analysis, the CRO or data scientists are more likely to decide what constitutes acceptable data quality. Data scientists may not always grasp the full impact of quality issues on the data’s usability, for purposes other than data science purposes or ML/AI, as they often don’t experience the consequences of incomplete or inaccurate data directly.

AI and the Element of Discovery

AI’s role in automating data processes has already proven invaluable. However, it also introduces the potential for discovery. AI might uncover correlations that were previously overlooked, but these insights can only emerge if the data hasn’t been excessively cleaned beforehand. For example, small shifts in data, like divorce statistics or the transition from landlines to cell phones, might go unnoticed until systems are updated to account for these changes. AI and ML can help spot these trends and offer valuable insights—but only if the data is allowed to evolve and not prematurely scrubbed of its nuances.

The Governance Dilemma

The evolution of data governance becomes increasingly complex as organizations adopt AI and machine learning. Technologies like Hadoop highlighted some of the risks associated with direct pipelines, such as losing data lineage or creating copies that introduce potential privacy concerns. These risks are magnified with large language models (LLMs), where there is no human gatekeeper overseeing the quality of data. Poor quality data can lead to misleading outputs, with no clear way to detect or correct these issues.

Striking the Right Balance

Data quality is now clearly recognized as a key component in providing AI with data that can be confidently used to create insights that drive decisions, either by a human or by a downstream application or system.  Getting the data quality balance right – where the data models use can be accurate and trustworthy while still leaving room for exploration and deeper insights – will only become more important as companies rush to adopt Agentic AI. The Air Canada customer experience snafu is a clear public example of where having strong data quality parameters in place is vital to democratizing AI and having both organizations and their customers trust and adopt AI experiences as authentic and valuable.