Do you really need a Feature Store? | by YUNNA WEI | Mar, 2023

By Jessie Hobb On Mar 17, 2023

Feature Store — the interface between raw data and ML models

“Feature store” has been around for a few years. There are both open-source solutions (such as Feast and Hopsworks), and commercial offerings (such as Tecton, Hopsworks, Databricks Feature Store) for “feature store”. There have been a lot of articles and blogs published around what “feature store” is, and why “feature store” is valuable. Some organizations have also already adopted “feature store” to be part of their ML applications. However, it is worthwhile to point out that “feature store” is another component added to your overall ML infrastructure, which requires extra investment and effort to both build and operate. Therefore it is necessary to truly understand and discuss “is Feature Store really necessary for every organization?”. In my opinion, the answer is, as usual, it depends.

Therefore the focus of today’s article is to analyze when a Feature Store is needed, so that organizations can wisely invest effort and resources in ML technologies that can actually add value to their business.

To answer this question, below are some critical considerations:

What kind of features do your ML applications need?
What type of ML applications do your organizations manage?
Is there a need to share and reuse features among various teams in your organization?
Is training-serving skew often an issue that negatively impacts ML model performance?

Other than answering the above questions, I will also explain the role of the Feature Store in an end-to-end ML lifecycle, if you believe Feature Store is necessary for your organization ML infrastructure.

Let’s deep dive into each of the above considerations in detail.

What kind of features do your ML applications need?

Features for ML applications can be roughly divided into the following categories:

Batch features — Features that remain the same for most of time, such as a customer’s metadata, including education, gender, age, and so on. Additionally, batch features are generally about the metadata of an entity, which is normally a key business entity, such as customer, product, supplier and so on. The input data sources for batch features are often data warehouses and data lakes.
Streaming features — Different from batch features, streaming features are features that need to be updated constantly in a low latency situation. For example, the number of transactions of a user, in the last 30 minutes. Streaming features are generally computed by streaming engines such as Spark Structured Streaming or Apache Flink, and pushed directly into an online feature for low-latency serving. The input data sources for streaming features are message stores, such as Kafka, Kinesis and Event hub.
Advanced features of combing batch and streaming — Features that require joining the streaming data with the static data to generate a new feature for ML models to learn. This type of feature is also computed by streaming engines, as it also requires low latency. The only difference from the streaming feature is that it needs to join with another data source.

If your ML applications require lots of streaming features, which need to be served in a very low latency, an online feature store could add significant value as one of the key functions of Feature Store is to allow you to pre-compute these streaming features, instead of computing the features at the model serving moment, which can slow down the model serving significantly.

What type of ML applications do your organizations manage?

The second consideration is to be clear on the type of ML applications that your organization manages? Each type of ML application requires quite a different ML infrastructure.

I categorize the ML applications into the following 3 categories:

The first category is batch feature engineering + batch inference : Feature engineering, model training and model serving are all conducted at a fixed interval. There is no need for streaming features and the model serving latency is not very low either. In this case, you do not need an online feature store and a streaming to pre-compute the features, as you have enough time to compute the features on demand.
The second category is batch training + online inference (with both batch and streaming features): ML models are trained at the batch level, but the model is generally wrapped as an API to be served online. In this case, in order to decide if a feature store is required or not, there are 2 important considerations. The first is serving latency, and the second is the number of features that need to be computed on the fly. If the serving latency is very low and there are quite a few features that need to be computed in a very stringent time limit, it is very likely that you need the support of a feature store to pre-compute these features so that when serving the ML model, you can fetch the required features from the online feature store, instead of computing them on the fly. The online store is a database that stores only the latest feature values for each entity, such as Redis, DynamoDB and PostgreSQL. On the opposite side, if the latency of model serving is not very low and the number of features required for model serving is small, you probably still have the luxury to compute the features on the fly, and therefore an online feature store is not absolutely needed.

Based on my experience, ML applications that require streaming features and extremely low latency serving are generally operational ML applications, such as fraud detection, recommendation, dynamic pricing, and search. For these types of ML applications, the function of feature store is to decouple feature calculation from feature consumption so that the complex feature engineering logic does not need to be calculated on demand.

Is there a need to share and reuse features among various teams in your organization?

The third consideration is that is there possibly a need to share and reuse features among various teams in your organization.

One of the key functions of feature store is a centralized feature registry where users can persist feature definition and relevant metadata about the features. Users can discover registered feature data by interacting with the registry. The registry acts as a single source of truth of information about all ML features in an organization.

For organizations where there are multiple data science teams, particularly when it is very likely these teams spend duplicated effort producing similar features, having a centralized feature store that allows teams to publish, share and reuse ML features can significantly improve team collaboration and the productivity of data science teams. Generally building and maintaining the data engineering pipelines to curate features required for ML applications takes a significant amount of engineering effort. If one team can reuse features already curated by another team, it can significantly reduce the duplicated engineering effort and save lots of engineering time.

Additionally having a feature store provides a mechanism to allow enterprises to govern the use of ML features, which actually are some of the most highly curated and refined data assets in a business.

Is training-serving skew often an issue that negatively impacts ML model performance?

The next consideration is training-serving skew, often an issue that negatively impacts the ML model performance. The training-serving skew is a situation where the deployed ML model in a production environment performs worse than the one data scientists developed and tested in their local notebook environment. The key reason for training-serving skew is that the feature engineering logic in the production environment is implemented differently (and maybe only slightly different) from the original feature engineering logic created and used by data scientists in their notebook environment.

Feature store can fix the training-serving skew by creating a consistent feature interface where both model training and model serving use the same feature engineering implementation as shown in the below chart.

If training-serving skew is a common reason why your ML applications perform worse than expected in a production environment, feature store can be a rescue.

So, where does feature store stand in an end-to-end ML Lifecycle

Based on the above analysis, if you have decided that feature store is useful to your ML applications and you are going to include it as a new component of your ML infrastructure, below is an explanation on how to use feature store in an end-to-end ML lifecycle.

Feature definition — data scientists can define required features from the raw data. The feature definitions include source data, feature entities, feature name, feature schema, feature metadata and time-to-live (TTL).
Feature retrieval for ML model training — Most feature store solutions provide functions that allow data scientists to construct a training dataset from defined features. A single training dataset possibly needs to draw features from multiple feature tables.
Feature retrieval for ML model serving — There are two types of ML model serving. One is batch scoring and the other is real-time predictions. Getting features for batch scoring is similar to getting features for an ML model training dataset, the only difference being that features for batch scoring are within a most recent time-stamp. Fetching features for real-time predictions is getting a feature vector for a particular prediction request. The feature vector is generally very small data, as it only contains the most recent feature value of a requested entity.

Using a feature store in the ML lifecycle | Image by Author

Summary

If you are rolling out real-time prediction use-cases that require lots of streaming features, Feature Store can help you achieve the low latency serving requirements by decoupling feature computation and feature serving.

If your organization’s data science teams have expanded quickly and there is a need to share and reuse work among various ML teams, feature store serves as a central registry for publishing and reusing features.

I hope this article can serve as guidance for you to decide if feature store is really needed for your organization. Please feel free to leave a comment if you have any questions. I generally publish 1 article related to building efficient data and AI stack every week. Feel free to follow me on Medium so that you can get notified when these articles are published.

If you want to see more guides, deep dives, and insights around modern and efficient data+AI stack, please subscribe to my free newsletter — Efficient Data+AI Stack, thanks!

Note: Just in case you haven’t become a Medium member yet, and you really should, as you’ll get unlimited access to Medium, you can sign up using my referral link!

Thanks so much for your support!