We're talking about 2021’s hottest MLOps technology— feature stores. But what is a feature store, and what on earth do they do?
Atindriyo Sanyal, Technical Lead on the Machine Learning Platform team at Uber, shares Uber’s feature store journey, the benefits feature stores have provided for their data science team, and where they're going next.
Read on for the transcript!
Hi, I’m Monte Zweben, CEO of Splice Machine. You’re listening to ML Minutes, where cutting-edge thought leaders discuss Machine Learning in one minute or less.
This week, our guest is Atindriyo Sanyal, technical lead on the machine learning platform team (Michelangelo) at Uber. Welcome, Atin.
Thank you for having me. I'm really excited to be here.
I'm excited to have a conversation about Michelangelo, and in particular, Palette, the feature store within Michelangelo. But before we get there, tell us about your journey. How did you get to where you are right now.
So I grew up in New Delhi, in India, and the son of two doctors, and there's a lot of science and technology, people in my family so I always was surrounded by scientists and engineers, and always wanted to be a space scientist growing up. And a lot of kids want to do that. But I literally had an uncle who worked at NASA for over 40 years. So I used to hear stories about space science and the Apollo project. So it was always in science and technology. Then, I did my engineering from India, and moved to the US, where I joined UCLA. I did my masters from there, and I literally worked in the lab where the internet was founded. So it was an incredible experience for me meeting such great people. And I remember interning at a place like LinkedIn, and and I worked on distributed systems and eventually went to Apple, which led me to Uber, ever. I've been doing machine learning for the last five or six years now. So yeah, that's been my journey.
That's a great journey. And many of our guests have been motivated by the space program and Abed relatives in the space program, or were in the space program themselves. And so it's, it's great to have that theme, spreading to our episode today. Today's episode is about your journey at Uber building a feature store. But this is a pretty detailed concept. Could you first define what a feature store is?
Yeah, absolutely. So a feature store is a key component of most machine learning systems. And it's a relatively new concept, which has come up in the last two to three years really. And the idea is to simplify the the most painful part of the machine learning workflow, which is the feature engineering process. And it essentially automates taking raw data and creating features from them, which are the signals which go into machine learning models. So that in summary, is a feature store.
Excellent. And how did Uber discover that they needed a feature store? What was that that evolution?
Well, that's a great question. So Uber went through this hyper-growth phase in 2014-2015, where they literally scaled out worldwide. And it was during that time, they set up a lot of their micro services and their big data ecosystems. And it really led to this culture of building centralized services that are specialized for doing one particular job. And machine learning was one of these key future-facing platforms, which really was sort of the key, given the amount of data that Uber deals with. So that led to the evolution of the concept of Michelangelo, which is a central ML platform. And within that the feature engineering process was also sort of following the same principles of centralizing the the process of taking raw data and building features. So that led to the creation of Palette, which is Michelangelo's feature store.
Very interesting. So as you centralized these machine learning shared services, the democratization of ml, happened at Uber, because you really did have a focus on centralized Machine Learning Services. I'm curious, how far into the journey of building Michelangelo, did the team realize that they needed a feature store? Was Michelangelo there for a long time before that happened? Or was it at a certain point when a certain number of models went into production? How did that happen?
That's a great question. So the first version of Palette wasn't really a centralized service. It was essentially a couple of Java classes and Scala files, which were able to sort of join data from multiple tables and give you a single data frame for training. And this was kind of an epiphany, really from one of our, one of the founding engineers of Michelangelo, and still have the honor of working with him today. And that that was pretty much built over a couple of weeks. But it was the static file, which you literally had to instantiate, and run. But over time, we saw that people really started using the features of Palette and and people wanted more from it. So we we literally took, you know those few classes and built a centralized service out of it. And we've seen incredible benefits not just for machine learning systems, but features are now rules used in various rules, engines across Uber, like risk and fraud and safety rules, engines, and it has really led to the democratization of feature engineering.
That's fantastic. I'm curious, if you were were going to speak to the CEO of a major company in the world, and you had to tell him what the benefits were of utilizing a centralized feature store, how would you abstract away from some of the technical details that we love to focus upon in terms of how a feature store makes machine learning better? How do you think a feature store affects a business?
Oh, a feature store is critical to a business, especially as machine learning has evolved, and the ML footprint of most businesses has gone above and beyond especially in the last couple of years. And as data is increasing, the world is constantly changing. There's so much entropy and complexity in the data, there's a huge need for centralizing the signals that are the key ingredients for any machine learning model, and making those signals available to the rest of the company. It becomes critical as the business grows, you'll see your machine learning workflows get so complicated and intertwined. If you don't have proper centralization, of the most key ingredients of your models, which are the features
Perfect. How would you distinguish a feature store from the traditional projects that company might be very familiar with and have already initiated, like data warehouse projects, or data lake projects?
So a feature store is an abstraction on top of data warehouses, and data warehouses are the place where your raw data materializes on a day to day basis. And feature stores sit on top of that. One of the key differentiators between a data warehouse and a feature store is that a feature store encompasses both aspects of machine learning, which is the training and the evaluation phase of the model, and the serving phase of the model. And these two environments, the training and the serving environment are vastly different. They have different SLS, lead different latency requirements, different scale requirements, and are typically engineered in a very different way. So a feature store sort of provides you a common abstraction on top of the training and the serving environments, which makes the ML development much easier for for a data scientist because they don't have to wonder about the differences between training and they're serving and the engineering details a bit between the two environments.
Excellent. So the feature store has to exhibit very different computational capabilities, it needs to be able to perform very high throughput scaled, analytical computations for the training and evaluation capabilities that you mentioned, as well as the very low latency requirements to use a model in an application where it can make decisions, perhaps in milliseconds, at least in seconds. And those are two very different kinds of computational workloads. What this brings me to is a question of what tools do you use to be able to support those multiple workloads in palette?
That's a great question. There's work outside in the open source ecosystem around common SQL and others similar technologies which try to bridge this gap between training and serving. But in Palette, the fundamental way we do it is by essentially using Spark and Parquet. And our entire training ecosystem is based on Spark. So all our training jobs are Spark jobs, even the the feature join jobs, which happened during training, they're based off of Spark. And what we do is we sort of take the offline computations of the features. And we literally take the parquet representation of the feature tables and make those available to an online store like a Redis or Cassandra. And we do that at a user defined cadence. So that allows us to literally transport the the offline computations and make it available in a low latency online serving environment, which we then serve through a low latency online API.
Excellent. I think this separation of workloads is exactly why we started our company at Splice Machine and watching what you were doing with Michelangelo, and trying to make it a lot easier to bridge this Spark-based analytical computation with this very low latency serving. So I appreciate the combining of these tools. I wonder, what do you see as the next big breakthrough in feature stores?
That's a great question. A lot of it is still to be seen, but my personal view is that the feature stores will evolve more and more. Now, number one, to support different kinds of features, there's near real-time features, there's super real-time features. All kinds of features are used in machine learning models these days. So feature stores will evolve to encapsulate more different kinds of features. There's also the entire world of unstructured data, which is a much bigger black box than tabular data. I see feature stores evolving to support unstructured data as well. There's a lot of work, which happens around feature search and discovery, which I'm very passionate about figuring out the right features for the right models, using this mutual information between feature and model usage, and also mutual information across features to eliminate the noise and really hone in on the golden set of features for your models. So that's the evolution of feature stores in the in the shorter to medium term that I see.
That's wonderful. I've often heard machine learning engineers and data scientists say that unstructured data that is being used in say deep learning models doesn't require a feature store because the features emerge naturally, as part of a deep learning network, perhaps in a transformer or something like that. Do you subscribe to that point of view?