Live Webinar: Reduce CDH/HDP Costs by 50%. Improve Performance by 40%. Sign up to attend the live webinar September 29, 10am PT.

×
back arrow
Back
September 8, 2022
Sameer Narkhede

Five Ways That Solving a Rubik's Cube Is Just Like DataOps

Five Ways That Solving a Rubik's Cube Is Just Like DataOps

The Rubik’s Cube was invented in 1974, the same year that IBM released the first relational database. The similarities don’t end there. Since its original rise in the 1980s, the Rubik’s Cube has become the world’s most-popular puzzle toy. More than 400 million Rubik’s Cubes have been sold in the last four decades. The constant release of more complex variants as well as the popularity of speedcubing competitions around the world has kept the Rubik’s Cube just as challenging and relevant in 2022 as it was in 1982.

Similarly, data started its rise into the business world’s most valuable resource with the embrace of OLTP databases in the 1980s, business intelligence and data warehouses in the 1990s, big data analytics in the 2000s, machine learning and data science in the 2010s, and now real-time AI and customer personalization systems. Worldwide investment in data and analytics is growing from $216 billion in 2021 to $349 billion in 2025, according to IDC, a CAGR of 12.8 percent. With every business today becoming data driven, DataOps has never been more mission-critical, nor more challenging.

In this blog, I’ll explore some of the striking similarities between solving a Rubik’s Cube and managing DataOps. With each point, I have also included a link to a relevant report or whitepaper on DataOps or data engineering. You can also skim our full resource library here. 

The Rubik’s Cube is a Highly-Complex, Multi-Dimensional Logic Problem

When Hungarian professor Ernő Rubik invented his namesake puzzle, he quickly realized that its deceptive simplicity hid a deep complexity. Despite just eight corners and twelve edges, there were too many starting positions for Rubik, who was an architect, not a mathematician, to even begin to calculate. Not only did Rubik have no idea how to solve his creation, he was unsure whether it was even possible.

Rubik eventually solved his invention after a month sequestered in his bedroom. And when the Rubik’s Cube debuted in America in 1981, it was advertised as having “over 3,000,000,000 (three billion) combinations but only one solution.” 

That’s a huge number, but mathematicians and computer scientists knew this estimate was low. Through constant research and mathematical proofs, they kept upping their count. Eventually, 36 years after it was created, they were able to settle on a final number: 43.2 quintillion different positions in a standard 3x3 Rubik’s Cube. 43 quintillion is 43 billion billions, or 43,000,000,000,000,000,000 (yep, that’s 18 zeros).

Parallel with DataOps: Today’s enterprise data infrastructures are far more complex than those of yesteryear. They are multi-layered systems consisting of on-premises and cloud data repositories including old-school data lakes, data warehouses and data marts and newer lakehouses and delta lakes. They ingest data from a network of real-time and batch streams leveraging Kafka and other event publishing middleware, and in turn pump out data to a constantly-changing web of reporting dashboards, real-time data applications, machine learning feature stores, and more. And rather than storing gigabytes or terabytes of data, their combined repositories are holding petabytes or even exabytes of data.

DataOps, needless to say, has become extremely complex and dynamic. Optimizing the cost, performance and reliability of your DataOps is a quantifiable, logical problem; as such, it can be solved. Yet, without the aid of best practices and tools, DataOps is also extremely difficult. 

“Hope is not a strategy” for solving the Rubik’s Cube

Algorithms and sequences of moves to solve the Rubik’s Cube were developed shortly after its debut. And they have only gotten faster and simpler over time. Without learning one of these methods, solving a Cube is basically impossible for anyone that is not an expert in group theory nor possesses A Beautiful Mind-level of pattern recognition. 

Parallel with DataOps: As outlined above, today’s enterprise data architectures are complex and ever-changing due to new business requirements, new data sources, the changing shape of your data, etc. Without a concrete, well-thought-out DataOps strategy, even the best data engineers will be stuck in exhausting daily firefights. Your business’s data performance and reliability will suffer, along with your business agility, while your data costs will spiral. 

Some businesses think they have found the cheat code to DataOps. Some completely outsource management of their data platforms to a third-party provider. Others try to migrate all of their legacy data repositories and data warehouses to a single, modern cloud-native solution that claims to be fully-automated and require zero administration. 

The nature of shortcuts is that there are always trade offs. Outsourcing your data infrastructure 100 percent to an outside company is expensive, reduces your visibility and control over your environment, and puts your business agility at the mercy of your provider. Migrating all of your data to a single, unified platform is a massive effort that could take years to complete and could fail at any time during the process. Or it may not be until many months or years post-migration for those data quality problems to emerge. Cloud-native platforms that claim to be fully-automated and zero-administration rarely live up to their claims. You’ll still need in-house data engineers to manage everything. And the tradeoff to low-ops is a loss of optimization and agility, and generally higher costs.

Read the whitepaper Increase Your Snowflake ROI with Data Quality, Resource Efficiency, and Spend Forecasting.

The Rubik’s Cube has a thriving expert community

In the community of twisty puzzle enthusiasts and experts, there are two main camps. The higher-profile group are the speedcubers. There were around 1,000 official speedcubing competitions around the world before the pandemic, many of them very popular on YouTube. While the fastest single solve ever recorded is just 3.5 seconds, speedcubers tend to focus on average times (official competitions require speedcubers to perform five solves, dropping their fastest and slowest times and averaging the remaining three times). The best speedcubers like Australia’s Felix Zemdegs can achieve winning average times of 5-6 seconds.

How do speedcubers achieve such impressive times? Through repeated practice, using methods with names like CFOP, Roux, ZZ and Corners-First and augmented by online trainers and the best equipment. Well-lubricated Chinese-made magnetic, stickerless cubes are generally favored by speedcubers; Rubik’s-branded cubes, ironically, are considered too stiff and unreliable, with an inconvenient tendency to spontaneously fall apart during competitions).

Parallel with DataOps: The DataOps field is burgeoning. Data engineers, including data reliability engineers and machine learning engineers, have replaced data scientist as the fastest-growing IT job today. Many data engineers are actually former data scientists, some of whom left after feeling burnt out by false career promises, and others that realized that they had mostly been doing data engineering work all along — and that they might as well enjoy the career growth benefits, too.

Being a successful data engineer or DataOps expert requires more than knowing how to track MTTR and other key data failure metrics. You need to be well-versed in data engineering and reliability best practices such as cloud data finops and value engineering, possess specific knowledge of popular platforms like Snowflake and cloud environments like AWS and Azure. And they ideally should be empowered by the best tools — in this case, a unified, multi-dimensional data observability platform. 

Learn how Gartner defines Data Observability

Rubik’s Cube Variants Can Scale Incredibly in Complexity

Besides speedcubers, many Rubik’s experts, having conquered the classic 3x3 cube, have clamored for ever-more-complex variants. Today, you can buy cubes ranging in size from 2x2 to 17x17, which provide a much-greater intellectual challenge, taking hours or days to solve. And twisting and rotating these massive puzzles also provides a demanding physical workout. The largest ever created — 3-D printed, actually — is a 33x33 fully-functional puzzle.

Parallel with DataOps: DataOps teams and infrastructures can vary wildly in size, from one-man teams where a lone data analyst or data scientist does double duty as the data engineer, to Big Tech and FAANGs with hundreds or thousands of in-house data engineers. Companies such as Facebook, which oversees dozens of exabytes of data, LinkedIn, with its one exabyte+ analytical data platform, Netflix with 100,000+ data server instances on AWS, Spotify, which ingests 500 billion events of data a day, and so many others.

Even if their DataOps has not scaled to the size of Facebook or LinkedIn, most companies still must contend with highly-diverse, changing, and fast-growing data architectures. Without an army of data engineers, the best way to efficiently manage this environment is by implementing best practices with the aid of a unified, multi-dimensional data observability platform. 

Download The Definitive Guide to Data Observability for Analytics and AI.

The Rubik’s Cube Is Solvable Thanks To Best Practices and Best Software

Despite its 43 quintillion different configurations, the Rubik’s Cube is actually quite solvable. Many algorithms have been developed. Speedcubers on YouTube have shown us how deliriously fast those algorithms can be performed. 

The same mathematicians and computer scientists that ascertained the 43 quintillion figure in 2010 also, with the aid of server time donated by Google, proved mathematically that any position in a 3x3 cube could be solved with a maximum of 20 moves, which they dubbed “God’s Number.” 

Even more impressively, engineers have built a software-driven robot that can manually twist and solve a 3x3 cube in just 0.38 seconds.

Parallel with DataOps: Managing data pipelines, applications, and repositories by manually monitoring dashboards and hand-configuring various knobs and settings is inefficient, expensive, and non-scalable. Today’s heterogenous, sprawling data environments require an unified data observability platform that uses machine learning to automate your management and autonomically implement your best practices.

Get a Feature Checklist to Choosing the Right Data Observability Solution.

When you’re ready to learn more about Acceldata’s enterprise data observability platform, reach out for a free demo.