Companies are embracing the modern data stack today because of the promise of speed, agility and cost savings. And make no mistake: the modern data stack overwhelmingly delivers on these promises.
Data used to take weeks to be collected, ingested, and analyzed by an on-premises data warehouse and exported either as an executive dashboard or paper report.
With managed, cloud-native services, this happens in hours or minutes.
Point-and-click user interfaces radically simplify the creation of new data pipelines, lowering the bar to create a production workflow, and democratizing access to analytics.
So do pay-per-use models that start low and scale, unlike the huge upfront fixed investments in on-premises software and servers.
We are big fans of the modern data stack at Acceldata. We believe it is the right architecture for today’s data-driven enterprises.
However, everything has a flip side.
The initial ease and agility of deploying and managing a modern data stack masks the creeping complexity that inevitably emerges.
So will performance bottlenecks and data errors, just in different and unexpected places than your older on-premises system.
And cloud costs, unless closely-watched and managed, have a tendency to spiral out of control.
What is the Modern Data Stack?
Before diving deeper, let’s quickly define what the modern data stack is. Typically, it includes most if not all of these technologies:
- Cloud data warehouse (e.g. Snowflake, Amazon Redshift, or Databricks)
- Data integration service
- ELT data transformation tool (e.g. dbt)
- BI visualization tool (e.g. Looker or Mode)
- Reverse ETL tool (Census or Hightouch)
These tools started emerging a decade ago. The bloom was starting to come off the rose that was big data, and companies were starting to look for alternatives to hard-to-manage, expensive Hadoop data lakes (a process of migration that continues today). Others were seeking a better way to do BI than the rigid, sluggish, and tightly-controlled on-premises data warehouse.
The modern data stack solves a ton of problems that both of these legacy technology stacks could not. However, even well-oiled machines break down, especially when you do not monitor and maintain them. And the modern data stack is an exceptionally complex machine, much more so than prior technologies. Problems can emerge anywhere, and they will catch data engineers by surprise.
Why is that? Here are four reasons.
The cloud-native modern data stack is more dynamic than the on-premises stack
A newly-deployed modern data stack is a pristine, clean thing of beauty. However, because of the agility and ease of modifying and adapting the modern data stack, it can also quickly degrade.
The easier it is to deploy new data pipelines and build production workflows, the more your pipelines and workflows will multiply like weeds.
The more agile your infrastructure, the more that your data ops team will change it, forgetting to document and communicate the updates to the rest of the organization.
The faster it is to transform, transport and analyze your data, the greater the risk of data errors, data silos, and overextended or lost data lineages.
The easier it is to scale your queries and data writes, the bigger the risk that your costs will scale out of control.
It’s true that the modern data stack is, on the whole, much lower ops than on-premises-based data warehousing systems. There is a lot less manual, active management required for things to just run. But low ops does not equal a license to neglect management. But if you don’t monitor and manage your modern data stack, any or all of the above four scenarios can and will happen.
The Modern Data Stack Broke Governance Best Practices That Haven’t Been Replaced
With on-premises databases and data warehouses, every change or upgrade was expensive, time-consuming, and required the work of highly-trained IT and database administrators.
That lack of agility was one of the reasons that the cloud was so attractive to business divisions who had long chafed at the naysaying attitude of IT. And the fact that cloud tools were so inexpensive to deploy meant that lines of businesses could go around IT and put into production the data workflows they wanted.
Shadow IT enabled the lines of business to wrest control of their technology destinies from the CIO. But it also broke all sorts of governance processes and best practices the IT department had built up over many years. Best practices such as strong security, centralized compliance to data privacy rules and other governmental regulations, storing data in hot or cold tiers based on the amount of use and their usefulness to the overall company, creating and maintaining a data catalog or an agreed-upon metadata vocabulary to enable easier data discovery and data sharing that reduced costs, etc.
Also, business divisions prioritize business goals. Strong data governance is not something their leaders care about, until problems occur.
Tristan Handy, CEO of modern data stack tools vendor dbt, identified governance as the weak spot in most companies’ data infrastructures, which the modern data stack only made “more painful.”
“Without good governance, more data == more chaos == less trust,” he wrote. And despite those outcomes, most companies apart from massive FAANG-equivalent tech companies are foregoing any governance at all, according to Handy.
The Modern Data Stack is in Heavy Flux
The modern data stack keeps evolving, as data-driven digital enterprises push the envelope and try to leapfrog competitors. For one, while most modern data stacks can ingest real-time event streams, they still take time to prepare data for queries. But these data latencies keep shrinking. Soon, what took hours or minutes will take a minute or just a few seconds. This moves analytics from the batch realm into near-real-time or real-time. And that opens up many new uses, such as in-product analytics (e.g. dashboards inside of your own product for your users), process automation, and operational intelligence that provides real-time visibility into mission-critical fleet management, logistics, inventory systems, and more.
Besides event streams, change data capture (CDC) data pipelines connecting OLTP databases and data lakes with cloud data warehouses continue to grow. CDC enables data to be synchronized near-instantly between two data systems. And it provides another real-time source of data that can feed operational intelligence, process automation, and in-product analytical workflows.
Also, some companies are introducing real-time analytics databases, which they argue are fundamentally faster than cloud data warehouses at ingesting and querying huge volumes of event and CDC streams. There is also plenty of innovation in how queries are outputted, whether it is through new visualization tools or analytics applications connected via data APIs. The use of reverse ETL tools to push analytical insights into the business applications that workers prefer such as Salesforce or Zendesk is also a burgeoning area.
The Modern Data Stack is Increasingly Mission Critical
Every company is looking to become a data-driven one, emulating the Ubers and AirBnBs of this world to disrupt its competitors. But when your entire business is built upon your real-time data pipelines and analytics, your data becomes your lifeblood. If your data doesn’t flow freely and reliably, your business literally falls apart.
How Can Multi-Dimensional Data Observability Help?
As observed in another blog, the old business expression, “You can only manage what you can measure,” is more true than ever in this data age.
A data observability platform can provide real-time data and analytical insights to keep your modern data stack healthy and efficient.
- Data performance — Modern data stacks may be easier to deploy and modify than legacy ones, but that means they need even more monitoring and management to ensure that ad hoc and tactical changes do not create accidental slowdowns and bottlenecks. When your company’s business model relies on instant analysis of streaming data, then slow is the new down, and alerts are too late. Reactive performance monitoring is not sufficient. You need machine learning analytics to help you get to root causes of slowdowns as quickly as possible, or, even better, predict bottlenecks so you can prevent them before they occur. That’s what data observability can provide.
- Data reliability — more data, more sources, more transformations equals more problems. A centrally-deployed data observability platform can fight this by minimizing the creation of data and schema errors. It can also provide the metadata tracking and visibility into data lineage that helps self-service users discover the most reliable, relevant data sources. Data observability platforms can also actively enforce governance rules that prevent the proliferation of dark data pools or single-purpose data silos. The combination of the carrot and the stick helps companies create a culture of data reuse.
- Data costs — cloud’s easy scalability means that processing and storage costs can spiral out of control, especially when IT no longer has authoritarian control over infrastructure. A data observability platform can arm data engineers with the real-time cost of every repository and pipeline, as well as ML-trained cost predictions, and cost policy tools that create bumper guards to allow companies to scale safely. Data observability is an essential tool for companies that are serious about value engineering and cloud FinOps.
As businesses become data-driven, data becomes mission critical. You need data on your data. But not just data, as that can become an unintelligible firehose of false alarms for your overworked data engineers. What is needed are analytical insights that use machine learning and AI to correlate data gathered from every corner and layer of your data infrastructure. This enables smaller problems to be fixed automatically through autonomic self-tuning and human-set policies and scripts. This minimizes false positives that drain your data engineers’ focus and energy from more strategic projects. It also ensures that issues that punch past particular thresholds do result in alerts, as well as actionable insights to get to root causes and fix problems.
Acceldata provides a multi-dimensional data observability platform that can help businesses solve and prevent problems with their modern data stack. Learn more at www.acceldata.io.
Photo by Maciej Rusek on Unsplash