How to Manage Disparate Kafka Clusters in Your Enterprise Efficiently at Scale

For many organizations, Apache Kafka is the key layer in their modern data stack. Uber calls the event streaming platform “the cornerstone” of its tech infrastructure, processing trillions of real-time messages and multiple petabytes of data daily with Kafka. So do Walmart, Twitter, Tencent and many others.

What looks like overnight success usually took years. Kafka is no exception. First created and open-sourced by LinkedIn engineers in 2011, Kafka had to mature, build an ecosystem, and fend off no shortage of rival event streaming and message queuing platforms before emerging as a popular choice for reliable delivery of real-time data on a massive scale.

Similarly, the uptake of Kafka by most enterprises has neither been straightforward nor master planned. Most businesses inadvertently got started with Kafka when an individual department or team, driven by an urgent business need involving streaming data, went around IT to deploy it.

These shadow IT deployments accelerated with the rise of Kafka-as-a-service hosted by public cloud providers such as Google Cloud, AWS, and Microsoft Azure, and the fully-managed Kafka service offered by Confluent. For business departments, putting Kafka into the cloud was easier to deploy and manage. The pay-as-you-go pricing required less upfront investment, too.

Unscaled and Immature

Today, the state of Kafka is simmering chaos with the constant threat of boiling over. That’s because most companies still own a variety of small Kafka clusters. These clusters are owned by different departments, managed using different tools and best practices and according to different SLAs. They are also hosted in different on-premises and public data centers and/or fully-managed services. A surprising percentage are of the default 3-node size.

There is usually no central team with overall responsibility for Kafka. And if there is, they rarely have the tools to give them unified visibility and control over it.

Sure, there are exceptions with companies like Pinterest. Its central Logging Platform team oversees 50+ Kafka clusters with 3,000+ brokers (servers), for an average Kafka cluster size of 600 server nodes, all running the same version of Kafka (2.3.1). 

Gartner has developed what it calls a Maturity Model to measure how various technologies are being used inside an enterprise. The Gartner Maturity Model for Data and Analytics is displayed below.

Greater maturity is correlated with greater ROI and business value. For Kafka, the problem is not just that most enterprises are not particularly mature in their usage, but that maturity ping pongs wildly inside the organization, from department by department, or cluster by cluster. As a result, so does the ROI that Kafka is generating for that enterprise.

The chaotic state of Kafka inside most enterprises creates four serious implications:

  1. Higher management overhead. It takes a lot more time and work to manage 20 three-node Kafka clusters than a single 60-node cluster. As mentioned above, each cluster might be running different versions of Kafka and applications. They may be managed using different tools and SLAs, and/or scattered in various on-premises and third-party cloud locations. And that’s exacerbated by Kafka’s complexity and already-high ops requirements. Non-cooperating teams don’t develop best practices across the company. That results in more work overall. And that translates into more cost at the end of the day.
  1. No economies of scale. A single unified customer can bundle its server, storage and cloud needs together to negotiate for volume discounts on Kafka licenses or hosting contracts from service providers. Twenty smaller customers have no such chance. Even more aggressively, companies can consolidate small disparate clusters into a single one that shares licenses, storage and bandwidth. This is easier to manage than having to provision multiple clusters. Moreover, this reduces the rise of expensive duplicative data caused when you have multiple clusters and low visibility into them, aka data silos. And such multi-tenant systems can be created without compromising security.
  1. More data blindspots and data errors. Kafka is already notoriously difficult to manage. There are too many options to manually set and metrics to watch. Even the most-experienced Kafka admins will deploy Kafka clusters that over time will create consumer lag, or lead to replicas that get out of sync, resulting in lost data. That’s not the admin’s fault or Kafka’s fault. It’s the dynamic nature of heterogeneous data pipelines pumping massive volumes of data and adapting to fast-changing business conditions. But without central visibility and insight into things such as Kafka Producer-Topic-Consumer lineage, it’s impossible for Kafka admins to be alerted to bottlenecks and data errors in real-time. Mean Times to Resolution (MTTRs) also balloon out. And problems proliferate, since admins lack the ability to correlate events across infrastructure, data layers and pipelines and predict future issues. This leaves Kafka admins and the rest of your data ops team in constant firefighting mode, which will eventually destroy the morale of even the best teams.
  1. No time for transformative projects that create business value, causing innovation to suffocate — and die. Kafka has the potential to transform your company’s operational processes and your business model. Customer personalization, supply chain management and logistics, fraud detection and cybersecurity are among the real-time Kafka use cases that can turn your company into a data-driven disruptor. Take the company ACERTUS, which built an automated end-to-end vehicle fleet management system around Kafka that generated $10 million in revenue in its first year and replaced a largely-manual system. Unfortunately, such potential remains unfulfilled for most Kafka users due to reasons above. Most Kafka users don’t even have the time to upgrade from the obsolete and inefficient Lambda architecture, with multiple pipelines for real-time and batch applications, to a streamlined, stream-centric Kappa architecture

Tradeoffs of Fully-Managed Kafka

Faced with these building problems around Kafka, some companies have thrown their hands up and fled towards the security blanket of a fully-managed Kafka service, aka Confluent Cloud. Handing over entire control of their Kafka infrastructure to a third party can greatly simplify things for enterprises, and solve in part some of the problems above. 

However, there are tradeoffs. By migrating to a fully-managed service, you are jumping from one extreme — the full manual control and customizability of Kafka — to another extreme, where you have much less control over Kafka, and even less visibility than you had before. That can have a negative effect on the rest of your data pipeline. 

Also, the migration into managed Confluent Cloud is not a simple lift and shift into the cloud, but a total refactoring into a multi-tenant, managed system. With or without planning, this can turn out to be a long-term, massive disruption that can lead to broken data pipelines and output errors.

Finally, Confluent Cloud’s low ops requirements also means users have low visibility and control over your costs. With the huge volumes of events being streamed through Kafka today, users moving to Confluent Cloud face similar risks.

A Better Way: Data Observability

Acceldata provides a turnkey solution that grants enterprises’ ongoing visibility into Kafka and into their always-changing data pipelines. Our Kafka dashboard predicts and prevents potential problems, alerts you immediately when issues do arise, and enables you to safely cost-optimize your Kafka clusters and other infrastructure. 

For instance, Acceldata helps data engineers monitor for instability and bottlenecks that can cause replicas to get out of sync in Kafka, increasing the chance of data loss. If Kafka is set to hold archives for seven days, and a Consumer is already lagging by 4 days, then Acceldata can proactively send an alert to the Kafka administrator to take action before any data is lost.

Another common issue with Kafka clusters is that Topics — the collection of events from a stream — can get imbalanced, or skewed. This is because Kafka tries to synchronize Topics among multiple Brokers (servers) for fault-tolerant backups and also to maintain pipeline performance. With Acceldata, it is easy to maintain a global view of Brokers that are failing or slowing down, thus causing Topics to become skewed. 

Our newest features such as our Kafka Observability Utility for Topic Lineage, Kapxy, provides granular visibility into the root causes of Kafka performance issues and bottlenecks, and also optimize price-performance.

Real-World Customer

Here’s an example of a company suffering from the four Kafka problems outlined above — and how it solved them with Acceldata. 

PubMatic is one of the largest adtech companies in the United States, serving 200 billion ad impressions and processing 2 Petabytes of new data every day. Core to this real-time flow of data is Kafka, which includes 50 small Kafka clusters with 10-15+ nodes in each cluster. 

Such scaled-out, disparate infrastructure was typical for Pubmatic, which had been in hyper-scale mode for years trying to keep up with business demand. Unfortunately, this also meant that mission-critical technologies such as Kafka and Hadoop — its 150 PB, 3,000-node servers fed by Kafka — were extremely brittle, suffering from frequent outages, performance bottlenecks, and high MTTRs. Daily firefighting was the norm, and high operational, infrastructure and OEM support costs were the result.   

So PubMatic deployed Acceldata’s data observability platform and immediately gained improved visibility into its complex, interconnected network of data pipelines, repositories, and infrastructure, including Kafka. This enabled Pubmatic to predict and prevent bottlenecks and data errors while also cost-optimizing the performance of its data pipelines, including Kafka. 

Besides eliminating day-to-day firefighting by its engineering department, PubMatic heavily consolidated its Kafka clusters, reducing license, infrastructure, and management costs. PubMatic also reduced its overall support costs by $10 million, and its Hadoop storage footprint by 30 percent. 

Conclusion

Deploying a data observability platform is the first step to gaining control of your disparate Kafka clusters in order to manage them efficiently at scale. 

Gaining these benefits does not require that a company completely centralize how Kafka is used and managed. For companies like Pubmatic, it made sense to aggressively consolidate its Kafka clusters in order to reduce the number of Kafka server licenses, duplicated data and data pipelines, and centralize management. 

Other companies can choose a less radical approach: sharing ownership between a central data ops or IT team, and individual business owners. Teams can work together to create SLAs and cost models that make the most business sense. Server consolidation and multi-tenant architectures are encouraged, but not mandated. 

This approach, which some such as JP Morgan Chase are calling a data mesh, offers an operational alternative for the data stack, one that supports business agility, economies of scale, and an optimized environment for Kafka servers and data pipelines.

Also, companies can use Acceldata with a fully-managed Kafka solution such as Confluent Cloud. Acceldata provides a plethora of features to make complicated migrations much safer and easier, with automated data preparation, validation, and more. And once data is migrated into Confluent Cloud, Acceldata provides additional visibility and control over performance and cost optimization that is natively lacking.

Whichever approach you ultimately take with Kafka, Acceldata will provide huge benefits.

Request a demo of the Acceldata Data Observability Cloud. 

Photo by jet dela cruz on Unsplash