How to Assess Data Quality Readiness for Modern Data Pipelines

For growth-minded organizations, the ability to effectively respond to market conditions, competitive pressures, and customer expectations is dependent on one key asset: data. But having just massive troves of data isn’t enough. The key to being truly data-driven is having access to accurate, complete, and reliable data. In fact, Gartner recently found that organizations believe poor data quality to be responsible for an average of $15 million per year in losses – a figure that can cripple most companies. Unfortunately, ensuring – and maintaining – data quality can be incredibly difficult. This is being exacerbated by the data architecture choices of an organization. Legacy architectures often lack the ability to scale to support the ever-increasing volumes of real-time data and cause data silos that slow the necessary democratization of data needed for an entire organization to benefit from.

Now more than ever it’s critical that the highest quality and reliable data drive business decisions. But what is the best approach to ensuring this? Do you need to improve your data quality implementation? And where should you start and what quality metrics should you focus on? This two-part blog series provides a step-by-step guide to help you decide for yourself where your organization stands from a data quality readiness standpoint.

Understanding the Core Symptoms of Bad Data

It is important to understand that not all data is created equal. As much as 85% of data collected by an organization is data acquired through various computer network operations (e.g., log files) but not used in any manner to derive insights or for decision-making.

It’s the remaining 12–15% of data that are business-critical and actively used for informed decision-making, or which can be monetized, that matters the most for many organizations. It is this data where quality and reliability are paramount. Here are some common business scenarios that are symptomatic of poor data quality:

Data errors that trigger compliance penalties
Inaccurate risk assessments that lead to poor decisions (e.g., approving bad credit)
Misbehaving fraud detection models that lead to excessive risk or denial of service
Executives complaining about incorrect BI dashboards and reports
Revenue being lost to pricing errors caused by bad data
Your data partners are complaining about your feeding them bad data
Your data teams are spending too much time fixing broken data

Do any of these sound familiar?

If you are running into issues like these, it is highly likely that you have gaps in data quality coverage and readiness. Now let’s look at how to evaluate your data quality.

Considerations for Evaluating Your Data Quality Readiness

First, it’s important to characterize the data volumes that your organization is actively working with to help derive insights. The higher your data volumes, the more opportunities there are for data quality to be an issue. Conversely, if you’re working with limited or smaller data volumes, the greater the immediate impact on the business of any poor-quality data. The fewer the variables, the more any individual or type of data quality issue will affect insights. Whether you need basic checks on lots of data, or you need deep checks on a small set of data elements, volume significantly affects your approach to data quality.

Second, it’s helpful to understand the behavior of your data pipelines including where data is sourced, how it is being transformed and optimized, how often data is updated; and, is it arriving in a state that can be analyzed and used to develop reliable business insights. This tells you where data is most likely to show defects.

Finally, it’s important to understand how these elements of your data landscape work together. Knowing what to watch for and what data quality indicators (DQIs) you should be monitoring to ensure that data quality is maintained so that your analytics, decision support dashboards, or reporting front end is providing accurate, actionable information.

Once you have this broader picture of your environment, and as you operate your data pipelines, there are minimum service levels you should check that contribute to higher data quality.

These include:

Updating on time according to the expected update cadence (e.g., hourly, daily)
Getting the expected amount of new data on every update for each data entity
Ensuring new values are populated with data and are not coming in empty or missing
Having confidence that new values being added to an entity conform to the expected schema or data type
Confirming that new values fit an expected data distribution and are not invalid
Certifying that new values in an entity are consistent, with a reference point in the data pipeline (such as at the point of ingestion)

This is not an exhaustive list of data quality checks, but it lays out the most common assertions one can make on a continuously operating data pipeline. These are fundamental checks which should be alerted should one fail.

If you’re running into issues with your data quality coverage, don’t feel like you are alone – many organizations are not properly addressing their data quality posture. In the second part of this series, we will take a look at how to quantify your data quality health.

Originally published on the Lightup blog.

TRAIN TO GET CERTIFIED AS A DATA QUALITY SPECIALIST