It’s simple. Reliable data is applied machine learning’s bottleneck.
I want to see a future filled with dynamic, affordable products and services that are smart – products that react to their environment, that know and adapt to their users, and that let us do more without thinking.
What got me and so many others excited about my previous company, DataRobot, was the idea of removing a bottleneck to this future. By putting world-class data science practices into software, we made it possible for thousands of companies – not just an elite few – to make their products and services smarter.
But if you’re focused on that outcome, you have to keep asking, what’s the bottleneck now? Ryan Petersen’s thread about the Port of Long Beach put this better than I’d ever heard it:
When you're designing an operation you must choose your bottleneck. If the bottleneck appears somewhere that you didn't choose it, you aren't running an operation. It's running you.
In my meetings with hundreds of DataRobot customers, I almost never found the bottleneck that kept them from delivering better products or running better operations to be an insufficiently powerful model. Certainly not after they adopted DataRobot, and certainly not after the first couple use cases.
Rather, it was simply that the right data just didn’t exist, wasn’t doing what it should, or couldn’t be relied upon. If you can’t trust the data pipeline feeding and consuming models to be as reliable as, say, your company’s homepage, you can’t change your business with ML.
We’re all excited about the progress in the ML world. Data-centric AI and feature stores are starting to take off. Every month we get better neural network architectures, better libraries, better IDEs for all personas, better hardware, better accelerators, better data labeling services. It’s getting radically easier and cheaper to build high-quality models.
And yet…as crazy as it sounds…that amazing progress almost doesn’t matter for the widespread application of machine learning in the 2020s. Until the data infrastructure and teams are there to support and apply these models, the vast majority of companies will remain extremely limited in how they can use ML.
Think of it as spending $130,000 to buy full-self-driving capabilities for your Tesla Model S Plaid when you live in an area with no paved roads, just three cowpaths. Or fine-tuning a Maglev bullet train’s performance when you’ve just finished laying your first hundred miles of wooden trackways. It’s not unlike pining after a PS5 while it’s stuck in Japan as supply chains unravel. Until the infrastructure is in place, the gains from optimizing the “cool” part will remain extremely limited.
Data engineering is rarely seen as “cool” in the same way ML is. But anyone who works in the data world knows that data engineering is where the AI-enabled enterprise or digital transformation sausage gets made. Or, for a more timely metaphor, think of data engineers as the offensive line that makes it possible for so-called “skill position” players like data scientists to shine.
Pictured: Your star data scientist trying to make your business AI-driven without an empowered data engineering team. Does she have a 10-year, $450M contract?
The value of reliable data pipelines soar alongside the power of the ML models they unlock. And this makes navigating the “Modern Data Stack”, the shift to cloud, and – lest we forget – the decades of investment in legacy data / ETL platforms, one of the single most critical battlegrounds for data professionals today.
The startups that solve the data engineering bottleneck best will have a huge impact on modern data environments and be valued accordingly. Data observability – the ability to inspect, operate, optimize distributed data pipelines – has, accordingly, seen a surge of investment in the last couple years.
I’ve put my bet on Acceldata, the data observability with the most potential to win the space, for a couple simple reasons.
Across the board, and starting with the founders, Acceldata is made up of veteran big data engineers–people who not only can go deep into the internals of any data processing system, but have done so at some of the most complex enterprise deployments out there. Putting this hard-won expertise into our products allows our customers to immediately up-level the reliability of their data pipelines.
No one is going to solve the data observability crisis without connecting the data flowing through pipelines with the performance of the diverse engines actually crunching that data–and Acceldata is far ahead of the field in that dimension. With complex, distributed operational pipelines, rapid visibility is a critical edge, and it’s one I’m happy to have on Acceldata’s side.
Technical skills and market opportunity aside, when I got to know the Acceldata team, I saw that this was an experienced, ambitious, and yet grounded group – a combination you do not often see in high-growth startups! Seeing the incredible early customer adoption and the chance to build out the product organization, I decided to roll up my sleeves, close out my arXiv tabs on new ML techniques, and join the data observability startup with the deepest technical talent and most market traction in the industry.
We’re just getting started here and hiring bright, data-obsessed minds in all roles. I’m excited to get to know the next wave of volunteers fighting the data reliability bottleneck. Check our careers page or reach out to me to learn more.