Real-time Feature Engineering with Apache Spark Streaming and Hof

Real-time Feature Engineering
with Apache Spark Streaming
and Hof
Fabio Buso
Software Engineer @ Logical Clocks AB

Feature stores
▪ Repository for curated ML features ready to be used
▪ Ensure consistency between features used for training and features
used for serving
▪ Centralized place to collect:
▪ Metadata
▪ Statistics
▪ Labels/Tag
▪ Spark summit 2020 talk:
https://databricks.com/session_na20/building-a-feature-store-arou
nd-dataframes-and-apache-spark

Real-Time Feature Engineering
▪ Data arrives at the clients making inference requests
▪ Features cannot be pre-computed and cached in the online feature store
▪ Data needs to be featurized before being sent to the model for
prediction
▪ One-hot encode
▪ Normalization and scaling of numerical features
▪ Window aggregates
▪ Real time features needs to be augmented using the Feature Store
▪ Not all features are provided by the client
▪ Construct the feature vector using with features retrieved from in the online feature store

Real-Time Requirements
▪ Hide complexity from clients
▪ Strict response time SLA
▪ Use-cases are usually user facing
▪ Avoid feature engineering in the client
▪ Feature engineering needs to be implemented for each client using the model
▪ Hard to maintain consistency between training and inference

Approach 1: Preprocessing with tf.Transform
▪ Write feature engineering in
preprocessing_fn
▪ Transformation is speciﬁc to a model
▪ Hard to reuse / keep track of
transformations at scale
▪ No support for window aggregations
▪ Doesn’t scale with number of
features/requests
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = x_centered * y_normalized
return {
'x_centered': x_centered,
'y_normalized': y_normalized,
'x_centered_times_y_normalized':
x_centered_times_y_normalized,
's_integerized': s_integerized
}

▪ Deployed as separated service
▪ Duplicated feature engineering code
▪ No support for Window aggregations
▪ No support for feature enrichment from
the online feature store
▪ Not easily extended to save featurized
data.
Approach 2: KFServing Transformer
class ImageTransformer(kfserving.KFModel):
def __init__(self, name, predictor_host):
super().__init__(name)
self.predictor_host = predictor_host
def preprocess(self, inputs):
return {'instances': [image_transform(instance) for
instance in inputs['instances']]}
def postprocess(self, inputs):
return inputs

Approach 3*: Hof
▪ Independent from the model
▪ Pandas UDFs and Spark 3 to scale feature
engineering
▪ First class support for online feature store
integration
▪ Pluggable to save requests and inference
vectors.
*Third time’s a charm
*Third time lucky
*Great things come in threes

Hof
▪ gRPC/HTTP endpoint to submit feature engineering requests
▪ Mostly stateless
▪ Forward request to a message queue (Kafka)
▪ Messages are consumed/processed by Spark Streaming application(s)
▪ Messages are sent back on another queue
▪ Response is forwarded back to the user

▪ One input topic
▪ N output topics
▪ N is the number of Hof instances running
▪ Message:
▪ Key: Topic to send back
▪ Message: Data to be feature engineered
▪ Topics lifecycle managed
automatically
▪ Hof instances talk to Hopsworks REST APIs to
create/destroy topics
Hof architecture
Message queue setup

Hof architecture
Spark Application setup
▪ Hof does not enforce the
schema in the request:
▪ Avoid additional deserialization
▪ If requests are self contained,
multiple Spark applications
can run in parallel
▪ Increase availability and throughput

Hof architecture
Addons
▪ Additional Spark applications
can be plugged in
▪ Save incoming data on
HopsFS/S3:
▪ Make it available for future feature engineering
▪ Save feature engineering
output:
▪ Auditing
▪ Model training
▪ Detect skews in incoming data
▪ Trigger alerts and model re-training

Client request
{
'streaming': {
'transformation':’fraud’,
'data': {
‘customer_id’: 1
‘transaction_amount’: 145
}
}
}

Application code
Show example with pandas_udf# Feature group definition
import hsfs
def stream_function(df):
# aggregations
return df
fg = fs.create_streaming_feature_group("example_streaming", version=1)
fg.save(stream_function)
# Processing
import hsfs
fs = connection.get_feature_store()
fg = fs.get_streaming_feature_group("example_streaming", version=1)
fg.apply()

Hof architecture
Streaming + Online Feature Store
▪ Not all the inference vector
has to be computed real
time
▪ Features can be fetched
from the online feature
store
▪ Features are referenced using the training dataset

Client request
{
'streaming': {
'transformation':’fraud’,
'data': { … }},
‘online‘: {
‘training_dataset’: {
‘name’: ‘fraud model’,
‘version’: 1
},
`filter`: {`customer_id`:3}
}
}

github.com/logicalclocks
hopsworks.ai
@logicalclocks

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Real-time Feature Engineering with Apache Spark Streaming and Hof

Recommandé

Recommandé

Contenu connexe

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Real-time Feature Engineering with Apache Spark Streaming and Hof