How Acceldata Revamped Data Observability at PhonePe at the Infrastructure Layer

PhonePe is a Walmart subsidiary which provides more than 350 million consumers across India with the ability to send and receive money, and make payments at more than ten million physical and online retail stores. They also offer the facility to transact at ATMs, and provide assistance in the investment of mutual funds and other securities. 

Enterprise data challenges

As PhonePe embarked on a journey of massive expansion, the company’s data team experienced immense pressure to manage system performance, even in the early stage. The team had to consistently tackle issues, and identify causes of glitches in their data applications and unnecessary downtime. This is when Burzin Engineer, PhonePe’s Chief Reliability officer, realized that they required advanced tools that could enhance visibility into their data operations. As PhonePe’s OLTP (online transaction processing) and OLAP (online analytical processing) are extremely complex; without the use of sophisticated solutions that match their critical data initiative, it would mean putting the company’s future in jeopardy.

Using a data observability solution

Acceldata played a key role in improving productivity by minimizing everyday emergencies and downtime for PhonePe. Acceldata’s compute performance monitoring solution, helped PhonePe to monitor HBase, Spark, and Kafka pipelines to distinguish between seasonal and campaign-based anomalies. Furthermore, Acceldata also assisted them to scale the data infrastructure from 70 to 1500 nodes, which actually meant 2000 percent expansion of the entire technology framework of the organization. 

PhonePe’s Burzin Engineer said that in a rapidly growing environment like this, what was required was a service from soup to nuts, and someone who could help them study the job scheduler efficiently, and in precision. When asked by Acceldata’s CEO and co-founder, Rohit Choudhary, about PhonePe’s preference in terms of priority among reliability, SLAs and SLOs, and engineering productivity, Burzin Engineer chose reliability as the most crucial. In his view, reliability is the most important factor, moreso, when you know that the livelihood of millions of your customers is dependent on your services, the platform must be highly secure, and reliable.

Improving business outcomes with better data quality

With transactions going up to 400 Million in number per month, and up to three billion dollars a day, PhonePe required immediate assistance to deal with their staggering expansion of services. Besides reliability, what’s also important for PhonePe is real-time monitoring for their high-volume data operations, primarily owing to the responsibility of being able to report business performance and on-system status 24x7 to their external stakeholders. 

Before Acceldata, PhonePe had tried to leverage other commercially available tools such as HBase Console, and Ambari, and building of single-metric Grafana dashboards for root-cause analysis, but found them to be insufficient in tackling the issue. HBase Console, for example, only provided aggregated information and significant time and analysis from highly experienced data engineers, before it could deliver useful intelligence. 

According to Burzin Engineer, “Acceldata was able to implement their data observability solution in less than a day and provide ad hoc identification of problems with HBase region servers and tables, especially for those under pressure. The Acceldata platform was also able to help PhonePe differentiate between HBase cluster issues caused by hardware and poorly designed tables, and anomalies resulting from seasonal and campaign-related surges. Acceldata enabled us to direct users to the problem’s root-cause quickly and clearly through automated alerts and easy-to-read dashboards. In many cases, Acceldata even recommends fixes to solve the problem. 

The efforts of Acceldata resulted in 65 percent reduction in the cost of managing data warehouses, which is equivalent to five million dollars of savings. Furthermore, it also meant the elimination of expensive commercial data warehousing licenses. Hence, within the first 18 months of leveraging services from Acceldata, PhonePe was able to manage the hyper-growth of the world’s largest instant payment systems. Burzin found Acceldata unique, with its ability to support multi-cluster data and workload management with uniform configurations in a short time span. 

You can read more about how PhonePe is working with Acceldata in this case study.

Photo by Clay Banks on Unsplash