Last week, a payment processing error disrupted the regular operations of several US banks, as reported by the Federal Reserve. Specifically, it resulted in delays in paychecks and consumer payments being delayed. The issue stemmed from a processing error within The Clearing House, responsible for managing the Automated Clearing House (ACH) system. This system facilitates the electronic transfer of funds between banks, handling a wide array of transactions, including direct deposit paychecks and customer payments for mortgages and utility bills.
The Clearing House was quick to explain that this was not a cybersecurity breach, but rather, the result of “manual error.” The explanation is a bit of a cop-out, as if the average media consumer will be placated into thinking, “Thank goodness it wasn’t a security hack, it was just manual error.” It sounds like someone simply forgot to hit the submit button, but when discovered, all was fine and well.
But all was not fine and well. People missed payment deadlines. Expected paychecks weren’t deposited in bank accounts for people who desperately needed them. And let’s be very clear, this wasn’t just a simple oversight - it was bad data management.
Diving in more deeply, it’s clear that the root cause was data errors likely related to schema drift and data reconciliation process issues. These are major concerns for data teams, and when they occur, bad things like this happen and it makes headline news. But thankfully, they don’t have to happen if data teams use data observability to establish and operate with rigorous data reliability standards. When they do, these issues are captured before they can create a domino effect of data quality issues and bad outcomes.
Defining Data Quality
Better data quality is essential for informed decision-making, but it begs the question: is superior data quality alone sufficient? If your data adheres to correct formats and maintains consistent coding, can you confidently assume that it seamlessly reconciles with its source? Is it reasonable to expect that you can balance this month's total monthly sales back to the source systems without further scrutiny? Similarly, if you anticipate a specific data distribution and encounter deviations, wouldn't you want to be alerted to these discrepancies? Consider a scenario where you typically receive data from all 50 states daily but find that only 45 states are represented today—wouldn't you be compelled to investigate, even if the data has passed all your quality checks?
In gauging the reliability of your data, a multifaceted evaluation is imperative. Consider the following dimensions:
- Does your data align with your established quality standards?
- Is your data easily reconcilable with its sources?
- Have there been any noticeable deviations or drift in your data?
Much like a three-legged stool, data reliability requires the stability of each component. If you remove one leg, the balance becomes somewhat precarious. However, with all three legs securely in place, you can comfortably sit and make data-driven decisions with unwavering confidence.
Start with Data Quality
The topic of data quality is far from new, with many organizations dedicating substantial efforts to enhance their data quality over the years. It's likely that your organization has already established a comprehensive set of data quality rules, meticulously defined and maintained by a center of excellence, covering a wide range of scenarios from fundamental to highly intricate ones.
This leads us to the initial question: Why do data quality issues persist? The answer is relatively straightforward: Coverage. The most recurring issue is the failure to apply data quality rules across a sufficiently extensive portion of your data. Often, it's the most fundamental checks that are inadvertently overlooked—questions like whether there are null values where they shouldn't be, whether data adheres to the correct format, or whether values are genuinely unique.
Without thorough coverage of these foundational aspects, data quality issues will persist. What's needed is a more effective and straightforward method for implementing these data quality checks, akin to data stakes. Why should you have to guess which checks to apply? Can't your data quality system proactively guide you, automating 80% of the work with just 20% of the effort?
This brings us to the second question: What about the present moment? You might be confident about data quality yesterday, but what about today or tomorrow? Shouldn't these autonomous checks be seamlessly integrated into the data pipeline that provides you with the information you need? If you can trust that you'll receive alerts when something goes awry, you can be assured that your data remains accurate and dependable.
The Importance of Effective Data Reconciliation
Data reconciliation goes way beyond merely tracking the number of rows fed into your pipeline and processed. It's about the assurance that business-critical data can be linked back to its source, that aggregate sales figures align with transaction-level details, that essential data fields remain uncorrupted, and that filters are accurately applied. This assurance extends beyond the present feed to encompass historical and future data feeds.
While many data quality platforms delegate this responsibility to the tooling or integration platforms, perceiving it as primarily operational, the intrinsic value of independent validation should not be underestimated. Just as you wouldn't expect an accountant responsible for posting credits and debits to audit the books, creating an independent reconciliation process builds trust in your data and its processing.
Incorporating independent reconciliation into your overall pipeline is essential, and it should be seamlessly triggered by your pipeline, serving as a pivotal decision point in the pipeline's workflow.
Why Schema Drift and Data Drift Must be Addressed
Drift comes in two distinct flavors, namely schema drift and data drift. While either of these can either pose challenges or be inconsequential, both, when undetected, have the potential to undermine the third crucial aspect of data reliability.
Schema drift can be broadly defined as alterations in the structure of your data. These changes may involve the addition of new columns, the removal of existing ones, or modifications to the precision and format of fields. In the past, when you had control over all your data sources, such changes were typically well-documented events. However, in today's landscape of external data sources, schema alterations occur frequently. Depending on your data processing and analysis setup, this may or may not be problematic. Nevertheless, discovering a schema change before it ripples through your environment can save considerable trouble. Consider a scenario where a vendor introduces an additional column in a daily CSV file, causing misalignment during the file load into your cloud data warehouse. While your data quality checks should eventually detect this, the subsequent cleanup can be arduous. It would undoubtedly be more advantageous to receive an alert about the schema change before it impacts your operations.
On the other hand, data drift is essentially a shift in the "shape" or distribution of your data. Interestingly, this data may not necessarily violate your primary data quality rules and could reconcile seamlessly across various tests, yet it remains a cause for concern. Let's explore a few examples:
- An insurance company evaluates claims reserves based on a monthly claims feed typically received from all 50 states but, this month, only 45 states are represented. Although the coding is accurate, and all data was successfully processed and reconciled, the evaluation of claims reserves may be compromised.
- A customer service department relies on an AI model for determining the Next Best Action (NBA) in critical customer interactions. The model, initially trained on specific demographic features, experiences a drop in performance as the median age of callers shifts over time.
- An online retailer, renowned for swift order fulfillment, closely monitors order distribution with multiple BI platforms. However, one day, there's a shift in orders for a popular product in a specific region, which goes unnoticed, resulting in a drop in fulfillment efficiency.
In each case, there may be valid reasons for the data distribution shift. Perhaps statistical anomalies caused the lack of claims in certain states, or the AI model adapts to the age shift, or the change in order distribution ultimately doesn't significantly impact fulfillment. Nevertheless, early awareness of such data drift can prove invaluable. Whether it's acquiring the complete claims data before resetting reserves, retraining the NBA model with current data, or promptly addressing the shift in order distribution, proactive action is always preferable.
Ultimately, monitoring for data drift is indispensable, even if you're uncertain about the critical features to monitor. An ideal data observability solution would autonomously scan all your data for anomalies, serving as an early warning system, much like a "canary in a coal mine," alerting you to potential issues before they become significant concerns.
Photo by AbsolutVision on Unsplash