The Importance and Value of Strong Data Quality Fundamentals (Data Quality Series Part 1)

Blog categories: Pentaho Data QualityPentaho Platform

Per the Oxford English Dictionary, quality is defined as “the standard of something as measured against other things of a similar kind; the degree of excellence of something.”

Data quality is both a quantitative and qualitative measure of its excellence. Together, they provide real insight into the value of data. Quantitative measures, typically driven by statistical insights, are easier to measure, can be interpreted readily, and provide a level of clarity on the suitability of data.

Qualitative measures, when applied to data or information, typically are information that is subjective and open to interpretation. I like to consider qualitative as ‘in context of’ or ‘in reference to’ when applied to data quality.

When breaking down data quality, the most common framework is quality dimensions. Quality dimensions mix quantitative and qualitative evaluation models that can be measured in isolation but are most useful and powerful when they are brought together. Consider completeness, uniqueness, and consistency as a starting point for quantitative dimensions.

Completeness is ensuring records or values are not missing
Uniqueness identifies if values are repeated
Consistency (or conformity) is measured against a standard form of expected outputs.

All of these lack external references so by themselves do not inform the appropriateness of data for a given use. This is where additional qualitative insights are needed, including accuracy, timeliness, and correctness (or validity). Timeliness provides details on data’s age. Correctness ensures that, for instance, a phone number provided for an individual in the US is indeed a valid US phone number with 10 digits. Continuing with this example, accuracy determines if the phone number given for an individual is their actual phone number. These are crucial elements that inform policy design and application that feed data quality scores.

It becomes very clear very quickly that without context, data quality efforts will fall far short of what organizations need, not only for core operations but also for AI and GenAI. This context, in many cases, relates to unstructured data, so crucial for AI and GenAI, and which we know most organizations struggle to organize, classify, analyze, and activate.

The potential gaps in this one small example are writ large when you consider a mid or large enterprise with hundreds of thousands of customer records. This is why hospitals, banks, or commercial enterprises of any size struggles with data quality when not using an end-to-end approach that leverages automation to apply policies, lineage, traceability, and quality across its data estate.

Pentaho considers and accounts for all of the above in our platform. It’s why we’re so focused on the relationships between data, the importance of accurately classifying data at the source, and the importance of carrying metadata properties throughout the lifespan of data.

In the next blog post, we’ll explore how these fundamentals impact the considerations teams must allow for to have a strong and scalable data quality strategy, how data quality is shifting in an AI world, and what data quality means when getting ‘data fit’ for an AI world.