Advertisement

Testing and Monitoring Data Pipelines: Part Two

By on
Read more about author Max Lukichev.

In part one of this article, we discussed how data testing can specifically test a data object (e.g., table, column, metadata) at one particular point in the data pipeline. While this technique is practical for in-database verifications – as tests are embedded directly in their data modeling efforts – it is tedious and time-consuming when end-to-end data pipelines are to be examined.

Data monitoring, on the other hand, helps build a holistic picture of your pipelines and their health. By tracking various metrics in multiple components in a data pipeline over time, data engineers can interpret anomalies in relation to the whole data ecosystem.

Implementing Data Monitoring

To understand why and how to implement data monitoring, you must understand how it lives in perfect harmony with data testing.

To write data tests, you need to know in advance the scenarios you want to test for. Large organizations might have hundreds or thousands of tests in place, but they’ll never be able to catch data issues they didn’t know could happen, often due to extreme complexity and unknown unknowns. Data monitoring allows them to be notified about oddities and find the root cause quickly.

Data changes. Downstream tests are rarely designed to catch data drift, or changes in the data input. Additionally, businesses evolve, and their data products evolve with them. Implemented changes often break the existing logic downstream in ways the available tests don’t account for. Proper monitoring tools can help identify these problems fairly quickly, both in testing and production environments.

An organization’s data pipelines might have been in place for years. They could be from an era when internal data maturity was low and testing was not a priority. With such technical debt, debugging pipelines can take an eternity. Monitoring tools can guide organizations in setting up proper tests.

Data Monitoring Approaches

Data monitoring’s main task is to constantly produce metrics about existing data sets, whether they’re intermediate or production tables. To do this, it processes data objects and their metadata on a recurring basis. For example, it counts rows in a table. If the number of rows suddenly rises spectacularly, it should produce an alert to the data team that manages that table.

Since many data pipelines span multiple data storage and processing technologies (e.g., a data lake and a data warehouse), data monitoring should encompass all of them. As with data testing, end-to-end monitoring is extremely valuable for root cause analysis of data issues.

On top of monitoring tables and their metadata, it’s possible to monitor the data values. This way, organizations establish oversight of their data pipelines and automated processing, and the data that moves through the pipeline is seen and examined. Let’s assume you’re alerted that today’s data lake partition contains a much higher number of rows compared to last week (information gathered by monitoring the metadata). By also monitoring the data itself, you can see anomalies in the data (e.g., new regions). You automatically will know that your data filter and transformations upstream did not work.

Data Monitoring Considerations

To implement data monitoring or to choose a monitoring tool, there are some things to consider.

No-Code Implementation and Configuration

Unlike data testing, the trade-offs with data monitoring regarding how and where to implement it are less distinguishable. That’s because setting up data monitoring is primarily a turnkey operation. Today’s data monitoring tools, often marketed as data observability tools, have out-of-the-box integrations with various databases, data lakes, and data warehouses. This way you don’t have to figure out how to read and interact with each system’s dialect and implement testing frameworks across each step of your pipeline. 

However, just because the trade-offs are less clear-cut doesn’t mean they aren’t there. Like with data testing, the same principle holds: end-to-end monitoring trumps partial monitoring.

Automated Detection

As data monitoring is indeterminate, neither you nor your monitoring tool know exactly what to look for. That’s why data monitoring tools offer visualization capabilities. Instead of staring at numerous metrics, data monitoring tools allow you to explore the collected data quality metrics over time.

However, exploring data is a time-consuming, manual process. For this reason, many monitoring tools have ML-driven anomaly detection capabilities. In other words, when a measure deviates from its normal pattern, it will automatically make that visible to you and produce an alert to a channel of choice.

Scale as Data Grows in Complexity and Volume

Data is always changing. Unlike data testing that adjusts to new formations and unknown unknowns the hard way, requiring unexpected data downtimes, data monitoring observes data over time, learning and predicting its expected values. This allows data monitoring to detect unwanted values and changes early and ahead of downstream business applications.

Conclusion

This article elaborated on the need for thorough data testing and monitoring, both of which are needed to prevent data issues and minimize time spent debugging and downstream recovery. Implementing data testing in an end-to-end manner can be a daunting task. Luckily, there’s data monitoring to detect the issues your tests didn’t account for.

A data observability tool that provides a holistic overview of your data’s health and can be embedded across the entire data pipeline will help you monitor data in structured, semi-structured, or even streaming forms, from ingestion to downstream data lakehouses and data warehouses. Consider a no-code platform for a simple, fast, and automatic way of monitoring your data drifts and analyzing the root cause of data quality issues, and avoid burdening your data engineering resources with implementing code-heavy data testing frameworks.