Feature Stores for machine learning (ML) are a new class of data platform for the organization, governance, and sharing of features within enterprises. A typical feature store is a dual database architecture, where pre-computed features for training are stored in a scalable SQL platform (Delta Lake, Apache Hudi, Apache Hive), while features served to online applications are stored in a low-latency database or key-value store (MySQL Cluster (NDB), Cassandra, or Redis). Feature Stores, however, do not provide a solution for real-time features (such as user-entered data or machine-generated data) that cannot be pre-computed or cached. If the feature engineering code that transforms the raw data into features is embedded in applications, it may need to be duplicated outside the application in pipelines for generating training data.
2. Feature stores
▪ Repository for curated ML features ready to be used
▪ Ensure consistency between features used for training and features
used for serving
▪ Centralized place to collect:
▪ Metadata
▪ Statistics
▪ Labels/Tag
▪ Spark summit 2020 talk:
https://databricks.com/session_na20/building-a-feature-store-arou
nd-dataframes-and-apache-spark
3. Real-Time Feature Engineering
▪ Data arrives at the clients making inference requests
▪ Features cannot be pre-computed and cached in the online feature store
▪ Data needs to be featurized before being sent to the model for
prediction
▪ One-hot encode
▪ Normalization and scaling of numerical features
▪ Window aggregates
▪ Real time features needs to be augmented using the Feature Store
▪ Not all features are provided by the client
▪ Construct the feature vector using with features retrieved from in the online feature store
4. Real-Time Requirements
▪ Hide complexity from clients
▪ Strict response time SLA
▪ Use-cases are usually user facing
▪ Avoid feature engineering in the client
▪ Feature engineering needs to be implemented for each client using the model
▪ Hard to maintain consistency between training and inference
5. Approach 1: Preprocessing with tf.Transform
▪ Write feature engineering in
preprocessing_fn
▪ Transformation is specific to a model
▪ Hard to reuse / keep track of
transformations at scale
▪ No support for window aggregations
▪ Doesn’t scale with number of
features/requests
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = x_centered * y_normalized
return {
'x_centered': x_centered,
'y_normalized': y_normalized,
'x_centered_times_y_normalized':
x_centered_times_y_normalized,
's_integerized': s_integerized
}
6. ▪ Deployed as separated service
▪ Duplicated feature engineering code
▪ No support for Window aggregations
▪ No support for feature enrichment from
the online feature store
▪ Not easily extended to save featurized
data.
Approach 2: KFServing Transformer
class ImageTransformer(kfserving.KFModel):
def __init__(self, name, predictor_host):
super().__init__(name)
self.predictor_host = predictor_host
def preprocess(self, inputs):
return {'instances': [image_transform(instance) for
instance in inputs['instances']]}
def postprocess(self, inputs):
return inputs
7. Approach 3*: Hof
▪ Independent from the model
▪ Pandas UDFs and Spark 3 to scale feature
engineering
▪ First class support for online feature store
integration
▪ Pluggable to save requests and inference
vectors.
*Third time’s a charm
*Third time lucky
*Great things come in threes
8. Hof
▪ gRPC/HTTP endpoint to submit feature engineering requests
▪ Mostly stateless
▪ Forward request to a message queue (Kafka)
▪ Messages are consumed/processed by Spark Streaming application(s)
▪ Messages are sent back on another queue
▪ Response is forwarded back to the user
9. ▪ One input topic
▪ N output topics
▪ N is the number of Hof instances running
▪ Message:
▪ Key: Topic to send back
▪ Message: Data to be feature engineered
▪ Topics lifecycle managed
automatically
▪ Hof instances talk to Hopsworks REST APIs to
create/destroy topics
Hof architecture
Message queue setup
10. Hof architecture
Spark Application setup
▪ Hof does not enforce the
schema in the request:
▪ Avoid additional deserialization
▪ If requests are self contained,
multiple Spark applications
can run in parallel
▪ Increase availability and throughput
11. Hof architecture
Spark Application setup
▪ Hof does not enforce the
schema in the request:
▪ Avoid additional deserialization
▪ If requests are self contained,
multiple Spark applications
can run in parallel
▪ Increase availability and throughput
12. Hof architecture
Addons
▪ Additional Spark applications
can be plugged in
▪ Save incoming data on
HopsFS/S3:
▪ Make it available for future feature engineering
▪ Save feature engineering
output:
▪ Auditing
▪ Model training
▪ Detect skews in incoming data
▪ Trigger alerts and model re-training
14. Application code
Show example with pandas_udf# Feature group definition
import hsfs
def stream_function(df):
# aggregations
return df
fg = fs.create_streaming_feature_group("example_streaming", version=1)
fg.save(stream_function)
# Processing
import hsfs
fs = connection.get_feature_store()
fg = fs.get_streaming_feature_group("example_streaming", version=1)
fg.apply()
15. Hof architecture
Streaming + Online Feature Store
▪ Not all the inference vector
has to be computed real
time
▪ Features can be fetched
from the online feature
store
▪ Features are referenced using the training dataset