SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Real-time Feature Engineering
with Apache Spark Streaming
and Hof
Fabio Buso
Software Engineer @ Logical Clocks AB
Feature stores
▪ Repository for curated ML features ready to be used
▪ Ensure consistency between features used for training and features
used for serving
▪ Centralized place to collect:
▪ Metadata
▪ Statistics
▪ Labels/Tag
▪ Spark summit 2020 talk:
https://databricks.com/session_na20/building-a-feature-store-arou
nd-dataframes-and-apache-spark
Real-Time Feature Engineering
▪ Data arrives at the clients making inference requests
▪ Features cannot be pre-computed and cached in the online feature store
▪ Data needs to be featurized before being sent to the model for
prediction
▪ One-hot encode
▪ Normalization and scaling of numerical features
▪ Window aggregates
▪ Real time features needs to be augmented using the Feature Store
▪ Not all features are provided by the client
▪ Construct the feature vector using with features retrieved from in the online feature store
Real-Time Requirements
▪ Hide complexity from clients
▪ Strict response time SLA
▪ Use-cases are usually user facing
▪ Avoid feature engineering in the client
▪ Feature engineering needs to be implemented for each client using the model
▪ Hard to maintain consistency between training and inference
Approach 1: Preprocessing with tf.Transform
▪ Write feature engineering in
preprocessing_fn
▪ Transformation is specific to a model
▪ Hard to reuse / keep track of
transformations at scale
▪ No support for window aggregations
▪ Doesn’t scale with number of
features/requests
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = x_centered * y_normalized
return {
'x_centered': x_centered,
'y_normalized': y_normalized,
'x_centered_times_y_normalized':
x_centered_times_y_normalized,
's_integerized': s_integerized
}
▪ Deployed as separated service
▪ Duplicated feature engineering code
▪ No support for Window aggregations
▪ No support for feature enrichment from
the online feature store
▪ Not easily extended to save featurized
data.
Approach 2: KFServing Transformer
class ImageTransformer(kfserving.KFModel):
def __init__(self, name, predictor_host):
super().__init__(name)
self.predictor_host = predictor_host
def preprocess(self, inputs):
return {'instances': [image_transform(instance) for
instance in inputs['instances']]}
def postprocess(self, inputs):
return inputs
Approach 3*: Hof
▪ Independent from the model
▪ Pandas UDFs and Spark 3 to scale feature
engineering
▪ First class support for online feature store
integration
▪ Pluggable to save requests and inference
vectors.
*Third time’s a charm
*Third time lucky
*Great things come in threes
Hof
▪ gRPC/HTTP endpoint to submit feature engineering requests
▪ Mostly stateless
▪ Forward request to a message queue (Kafka)
▪ Messages are consumed/processed by Spark Streaming application(s)
▪ Messages are sent back on another queue
▪ Response is forwarded back to the user
▪ One input topic
▪ N output topics
▪ N is the number of Hof instances running
▪ Message:
▪ Key: Topic to send back
▪ Message: Data to be feature engineered
▪ Topics lifecycle managed
automatically
▪ Hof instances talk to Hopsworks REST APIs to
create/destroy topics
Hof architecture
Message queue setup
Hof architecture
Spark Application setup
▪ Hof does not enforce the
schema in the request:
▪ Avoid additional deserialization
▪ If requests are self contained,
multiple Spark applications
can run in parallel
▪ Increase availability and throughput
Hof architecture
Spark Application setup
▪ Hof does not enforce the
schema in the request:
▪ Avoid additional deserialization
▪ If requests are self contained,
multiple Spark applications
can run in parallel
▪ Increase availability and throughput
Hof architecture
Addons
▪ Additional Spark applications
can be plugged in
▪ Save incoming data on
HopsFS/S3:
▪ Make it available for future feature engineering
▪ Save feature engineering
output:
▪ Auditing
▪ Model training
▪ Detect skews in incoming data
▪ Trigger alerts and model re-training
Client request
{
'streaming': {
'transformation':’fraud’,
'data': {
‘customer_id’: 1
‘transaction_amount’: 145
}
}
}
Application code
Show example with pandas_udf# Feature group definition
import hsfs
def stream_function(df):
# aggregations
return df
fg = fs.create_streaming_feature_group("example_streaming", version=1)
fg.save(stream_function)
# Processing
import hsfs
fs = connection.get_feature_store()
fg = fs.get_streaming_feature_group("example_streaming", version=1)
fg.apply()
Hof architecture
Streaming + Online Feature Store
▪ Not all the inference vector
has to be computed real
time
▪ Features can be fetched
from the online feature
store
▪ Features are referenced using the training dataset
Client request
{
'streaming': {
'transformation':’fraud’,
'data': { … }},
‘online‘: {
‘training_dataset’: {
‘name’: ‘fraud model’,
‘version’: 1
},
`filter`: {`customer_id`:3}
}
}
DEMO
github.com/logicalclocks
hopsworks.ai
@logicalclocks
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Contenu connexe

Plus de Databricks

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 

Plus de Databricks (20)

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Dernier

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

Real-time Feature Engineering with Apache Spark Streaming and Hof

  • 1. Real-time Feature Engineering with Apache Spark Streaming and Hof Fabio Buso Software Engineer @ Logical Clocks AB
  • 2. Feature stores ▪ Repository for curated ML features ready to be used ▪ Ensure consistency between features used for training and features used for serving ▪ Centralized place to collect: ▪ Metadata ▪ Statistics ▪ Labels/Tag ▪ Spark summit 2020 talk: https://databricks.com/session_na20/building-a-feature-store-arou nd-dataframes-and-apache-spark
  • 3. Real-Time Feature Engineering ▪ Data arrives at the clients making inference requests ▪ Features cannot be pre-computed and cached in the online feature store ▪ Data needs to be featurized before being sent to the model for prediction ▪ One-hot encode ▪ Normalization and scaling of numerical features ▪ Window aggregates ▪ Real time features needs to be augmented using the Feature Store ▪ Not all features are provided by the client ▪ Construct the feature vector using with features retrieved from in the online feature store
  • 4. Real-Time Requirements ▪ Hide complexity from clients ▪ Strict response time SLA ▪ Use-cases are usually user facing ▪ Avoid feature engineering in the client ▪ Feature engineering needs to be implemented for each client using the model ▪ Hard to maintain consistency between training and inference
  • 5. Approach 1: Preprocessing with tf.Transform ▪ Write feature engineering in preprocessing_fn ▪ Transformation is specific to a model ▪ Hard to reuse / keep track of transformations at scale ▪ No support for window aggregations ▪ Doesn’t scale with number of features/requests def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_integerized = tft.compute_and_apply_vocabulary(s) x_centered_times_y_normalized = x_centered * y_normalized return { 'x_centered': x_centered, 'y_normalized': y_normalized, 'x_centered_times_y_normalized': x_centered_times_y_normalized, 's_integerized': s_integerized }
  • 6. ▪ Deployed as separated service ▪ Duplicated feature engineering code ▪ No support for Window aggregations ▪ No support for feature enrichment from the online feature store ▪ Not easily extended to save featurized data. Approach 2: KFServing Transformer class ImageTransformer(kfserving.KFModel): def __init__(self, name, predictor_host): super().__init__(name) self.predictor_host = predictor_host def preprocess(self, inputs): return {'instances': [image_transform(instance) for instance in inputs['instances']]} def postprocess(self, inputs): return inputs
  • 7. Approach 3*: Hof ▪ Independent from the model ▪ Pandas UDFs and Spark 3 to scale feature engineering ▪ First class support for online feature store integration ▪ Pluggable to save requests and inference vectors. *Third time’s a charm *Third time lucky *Great things come in threes
  • 8. Hof ▪ gRPC/HTTP endpoint to submit feature engineering requests ▪ Mostly stateless ▪ Forward request to a message queue (Kafka) ▪ Messages are consumed/processed by Spark Streaming application(s) ▪ Messages are sent back on another queue ▪ Response is forwarded back to the user
  • 9. ▪ One input topic ▪ N output topics ▪ N is the number of Hof instances running ▪ Message: ▪ Key: Topic to send back ▪ Message: Data to be feature engineered ▪ Topics lifecycle managed automatically ▪ Hof instances talk to Hopsworks REST APIs to create/destroy topics Hof architecture Message queue setup
  • 10. Hof architecture Spark Application setup ▪ Hof does not enforce the schema in the request: ▪ Avoid additional deserialization ▪ If requests are self contained, multiple Spark applications can run in parallel ▪ Increase availability and throughput
  • 11. Hof architecture Spark Application setup ▪ Hof does not enforce the schema in the request: ▪ Avoid additional deserialization ▪ If requests are self contained, multiple Spark applications can run in parallel ▪ Increase availability and throughput
  • 12. Hof architecture Addons ▪ Additional Spark applications can be plugged in ▪ Save incoming data on HopsFS/S3: ▪ Make it available for future feature engineering ▪ Save feature engineering output: ▪ Auditing ▪ Model training ▪ Detect skews in incoming data ▪ Trigger alerts and model re-training
  • 13. Client request { 'streaming': { 'transformation':’fraud’, 'data': { ‘customer_id’: 1 ‘transaction_amount’: 145 } } }
  • 14. Application code Show example with pandas_udf# Feature group definition import hsfs def stream_function(df): # aggregations return df fg = fs.create_streaming_feature_group("example_streaming", version=1) fg.save(stream_function) # Processing import hsfs fs = connection.get_feature_store() fg = fs.get_streaming_feature_group("example_streaming", version=1) fg.apply()
  • 15. Hof architecture Streaming + Online Feature Store ▪ Not all the inference vector has to be computed real time ▪ Features can be fetched from the online feature store ▪ Features are referenced using the training dataset
  • 16. Client request { 'streaming': { 'transformation':’fraud’, 'data': { … }}, ‘online‘: { ‘training_dataset’: { ‘name’: ‘fraud model’, ‘version’: 1 }, `filter`: {`customer_id`:3} } }
  • 17. DEMO
  • 19. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.