Best practices provide the foundation on which great data teams can optimize their platforms, processes, and operations. They are well-established in many mature product categories, and provide guardrails to development and engineering teams that enable them to innovate, move quickly, and adapt to changing product and market needs.
In emerging sectors such as data observability, best practices not only allow data teams to optimize their efforts but also deliver a learning experience for “how to” and “what to do.”
In this guide, we will outline some best practices for data reliability, which is an essential component of data observability. As data engineering teams ramp up their data reliability efforts, these best practices can show teams how to effectively scale their efforts in ways that don’t require significant investment in new resources.
Solving Increasingly Complex Data Supply Chain Issues
As analytics have become increasingly critical to an organization’s operations, more data than ever is being captured and fed into analytics data stores, which helps enterprises make decisions with greater accuracy.
This data comes from a variety of - internally from applications and repositories, and externally from service providers and independent data producers. For companies that produce data products, an even greater percentage of their data may come from external sources. And since the end product is the data itself, reliably bringing together the data with high degrees of quality is critical. In essence, data of high quality can help the organization achieve competitive advantages and continuously deliver innovative, market-leading products. Bad quality data will deliver bad outcomes and create bad products. That can break the business.
The data pipelines that feed and transform data for consumption are increasingly complex. The pipelines can break at any point due to data errors, poor logic, or the necessary resources not being available to process the data.
The Role of Data Reliability
The data within the data pipelines that manage the data supply chains can generally be broken down into three sections:
- The data landing zone, where source data is fed,
- The transformation zone, where data is transformed into its final format, and
- The consumption zone, where data is in its’ final format and is accessed by users.
In the past, most organizations would only apply data quality tests in the final consumption zone due to resource and testing limitations. The role of modern data reliability is to check data in any of these three zones as well as to monitor the data pipelines that are moving and transforming the data.
Cost Implications of Poor Data
In software development, as well as other processes, there is the 1 x 10 x 100 rule which applies to the cost of fixing problems at different stages of the process. In essence, it says that for every $1 it costs to detect and fix a problem in development, it costs $10 to fix the problem when that problem is detected in the QA/staging phase, and $100 to detect and fix it once the software is in production.
The same rule can be applied for data pipelines and supply chains. For every $1 it costs to detect and fix a problem in the landing zone, it costs $10 to detect and fix a problem in the transformation zone, and $100 to detect and fix it in the consumption zone.
To effectively manage data and data pipelines, data incidents need to be detected as early as possible in the supply chain. This helps data team managers optimize resources, control costs, and produce the best possible data product.
Best Practices for Data Reliability
As with many other processes, both in the software world and other industries, utilizing best practices for data reliability allows data teams to operate effectively and efficiently. Following best practices helps teams produce valuable, consumable data and deliver according to service level agreements (SLAs) with the business.
Best practices also allow data teams to scale their data reliability efforts in these ways:
- Scaling up to increase the number of quality tests on a data asset.
- Scaling out to increase the number of data assets that are covered.
- Scaling the data incident management to quickly correct issues.
Let’s explore some areas of best practices for data reliability.
Data Reliability Across the Entire Supply Chain
We mentioned earlier how data supply chains have gotten increasingly complex. This complexity is manifested through things like:
- The increasing number of sources that are being fed.
- The sophistication of the logic used to transform the data.
- The amount of resources required to process the data.
We roughly grouped data into the zones – the landing zone, the transformation zone, and the consumption zone. Our first best practice is to apply data reliability checks across all three zones and over the data pipelines. This allows us to detect and remediate issues such as:
- Erroneous or low-quality data from sources in the consumption zone.
- Poor quality data in the transformation and consumption zones due to faulty logic or pipeline breakdowns.
- Stale data in the consumption zone due to data pipeline failures.
Shift-left Data Reliability
Consider that data pipelines flow data from left to right from sources into the data landing zone, transformation zone, and consumption zone. Where data was once only checked in the consumption zone, today’s best practices call for data teams to “shift-left” their data reliability checks into the data landing zone.
The result of shift-left data reliability is earlier detection and fast correction of data incidents. It also keeps bad data from spreading further downstream where it might be consumed by users and could result in poor and misinformed decision-making.
The 1 x 10 x 100 rule applies here. Earlier detection means data incidents are corrected quickly and efficiently at the lowest possible cost (the $1). If data issues were to spread downstream they would impact more data assets becoming far more costly to correct (the $10 or $100).
Effective Use of Automation
With data becoming increasingly sophisticated, manually writing a wide number and variety of data checks can be time-consuming and error-prone. A third best practice is effectively using automation features in a data reliability solution.
The Acceldata Data Observability platform combines artificial intelligence, metadata capture, data profiling, and data lineage to gain insights into the structure and composition of your data assets and pipelines. Using AI, Acceldata:
- Scours the data looking for multiple ways in which it can be checked for quality and reliability issues.
- Makes recommendations to the data team on what rules/policies to use and automates the process of putting the policies in place.
- Automates the process of running the policies and constantly checks the data assets against the rules.
The Data Observability platform also uses AI to automate more sophisticated policies such as data drift and the process of data reconciliation used to keep data consistent across various data assets. Acceldata uses the data lineage to automate the work of tracking data flow among assets during data pipeline runs and correlates performance data from the underlying data sources and infrastructure so data teams can identify the root cause of data incidents.
Because the number of data assets and pipelines continues to grow, there is a corresponding growth in data volume. It is critical for data teams to use best practices to scale their data reliability efforts, and as, we saw earlier, there are three forms of scaling your data reliability: scaling up, out, and your incident response. Let’s explore two of these:
- Scale Up: Using automation features such as those described above, data teams can scale up the number of tests and checks that are performed on a data asset and put in place more sophisticated checks such as schema- and data drift. Other policies can also be automated such as data reconciliation.
- Scale Out: Automation features help with scaling out, but creating templated policies that can contain multiple rules which can be applied to data assets in one clean sweep, and applying the policies across many data assets helps data teams gain greater data reliability coverage on more assets.
Our last form of scaling is incident management. With data pipelines running more frequently and touching more data and the business teams’ increased dependency on data, there needs to be continuous monitoring to keep the data healthy and flowing properly. A principal ingredient of that is effectively handling incident management.
Having a consolidated incident management and troubleshooting operation control center allows data teams to get continuous visibility into data health and enables them to respond rapidly to incidents. Data teams can avoid being the “last to know” when incidents occur, and can respond proactively.
To enable continuous monitoring, your data observability platform should have a scalable processing infrastructure. This facilitates the scale-up and scale-out capabilities mentioned earlier and it allows tests to be run frequently.
To support continuous monitoring, data reliability dashboards and control centers should be able to:
- Offer instantaneous, 360o insights into data health.
- Provide alerts and information on incidents when they occur.
- Integrate with popular IT notification channels such as Slack.
- Allow data teams to drill down into data about the incident to identify the root cause.
Identifying and Preventing Issues
Quickly identifying the root cause of data incidents and remedying them is critical to ensure data teams are responsive to the business and meet SLAs. To meet these goals, data teams need as much information as possible about the incident and what was happening at the time it occurred.
Acceldata provides correlated, multi-layer data on data assets, data pipelines, data infrastructure, and the incidents at the time they happened. This data is continuously captured over time providing a rich history of information on data health.
Armed with this information, data teams can implement practices such as:
- Perform root cause of any incident and make adjustments to data assets, data pipelines, and data infrastructure accordingly.
- Automatically re-run data pipelines when incidents occur to quickly recover.
- Weed out bad or erroneous data rows to keep data flowing without the low-quality rows.
- Compare execution, timeliness, and performance at different points in time to see what’s changing.
- Perform time-series analysis to determine if data assets, pipelines, or infrastructure is fluctuating or deteriorating.
Data volumes are constantly growing and new data pipelines and data assets put additional load and strain on the data infrastructure. Continuous optimization is another data reliability best practice data teams should embrace.
Multi-layer data observability data can provide a great deal of detailed information about incidents, execution, performance, timeliness, and cost. Not only can this information provide insights to identify the root cause of problems, but it can also provide tips on how to optimize your data assets, pipelines, and infrastructure.
Acceldata provides such detailed multi-layer data insights and goes beyond by making recommendations on how to optimize your data, data pipelines, and infrastructure and in some cases automate the adjustments. These recommendations are highly tuned and specific to the underlying data platforms being used, such as Snowflake, Databricks, Spark, and Hadoop.
Get the Entire Team Involved
Data teams are skilled at knowing the technical aspects of the data and the infrastructure supporting it. However, they may not be as aware of the nuances of the data content and how the business teams use the data. This is more in the domain of data analysts or scientists.
Another best practice for data reliability is to get a wider team involved in the process. Data analysts can contribute more business-oriented data quality checks. They can also collaborate with data teams to determine tolerances on data quality checks (e.g., percentage of null values that is acceptable) and the best timing of data pipelines for data freshness that meets the needs of the business.
Acceldata provides collaborative, easy-to-use low-code and no-code tools with automated data quality checks and recommendations so data analysts, who might have sophisticated programming skills, can easily set up their own data quality and reliability checks. Acceldata offers role-based security to ensure different members of the wider data team work securely.
Resulting Benefits From Data Reliability
Implementing data reliability best practices will result in a number of benefits, including:
- Better deployment of your data engineering resources to maintain or lower data engineering costs.
- Better management and visibility of the data infrastructure to keep those costs low.
- Lower legal risk around your data.
- Keeping a high reputation for your data products and increasing trust in the data.
- Maintaining strong compliance around your data and eliminating potential fines for non-compliance.
Best practices are an essential part of every domain in the IT and data world and that now includes the category of data reliability. Best practices not only allow teams to optimize their efforts and eliminate problems, but they also provide a faster ramp-up in the solution area.
In this document we have described a number of key best practices for data reliability that data teams can incorporate, including:
- Apply tests across the entire supply chain.
- Shift-left your data reliability to test data before it hits your warehouse/lakehouse.
- Effectively use automation in data reliability solutions.
- Scale up and scale out your data reliability to increase your coverage.
- Continuously monitor and receive alerts to scale your data incident management.
- Take full advantage of detailed multi-layer data to rapidly solve data incidents.
- Continuously optimize your data assets, pipelines, and infrastructure using recommendations from your solution.
- Get the wider team involved in your data reliability processes and efforts.
These best practices enable data teams to scale their efforts in ways that don’t require significant investment in new resources while also allowing them to run efficient and effective data operations. Incorporate these into your planning and everyday work for smooth data reliability processes.