[2024]Digital Global Overview Report 2024 Meltwater.pdf
Ml ops and the feature store with hopsworks, DC Data Science Meetup
1. MLOps and the Feature Store
with Hopsworks
Jim Dowling
CEO, Hopsworks
DC Data Science Meetup,
Sep 14th 2021
2. We all take different Journeys to arrive at the Feature Store
Data Engineer
“Gotta feed those data
‘scientists’ with data”
Data Scientist
“Hello!?! Hello!?!
Is there any data out there?”
ML Engineer
And then she said
“productionize this notebook”
Feature Store
3. We all take different Journeys to arrive at MLOps
Data Engineer
Orchestrated Pipelines,
baby!
Data Scientist
Notebooks as Jobs, yay!
ML Engineer
Containerize, kubernetize,
observerize!
Feature Store
triggers them
5. SQL or Python or Spark for Feature Engineering?
SQL Features
(Table)
DB
DB
Python
Features
(Dataframe)
Msg Bus
Files
Extract,
Aggregate,
Transform
Spark
DBT
Extract,
Aggregate,
Transform
6. What Feature Engineering do we typically perform where?
Aggregations,
Data Validation
Training
Data
Serving
Raw Data
Feature
Store
Model
Repo
Transformations Input Data
Need to ensure no
skew between training
and serving
transformations
7. Feature Group
Feature 1 Feature M
Primary Key
0 ... ...
1 ... ...
2 ... ...
... ... ...
N ... ...
9. sales_fg = fg.get_feature_group(“sales_fg”, version=1)
df = # featurize some data to ingest into the feature store
sales_fg.insert(df)
Batch insert/backfilling features into the Feature Store
10. Spark Streaming insertion of features into the Feature Store
sales_fg = fg.get_feature_group(“sales_fg”, version=1)
streaming_df = # get streaming dataframe to ingest into the feature store
sales_fg.insert_stream(streaming_df)
11. Data Validation for Feature Groups (using Deequ)
expectation_sales = fs.create_expectation(..,
rules=[Rule(name="HAS_MIN", level="WARNING", min=0),
Rule(name="HAS_MAX", level="ERROR", max=1000000)])
sales_fg = fg.get_feature_group(“sales_fg”, version=1)
sales_fg.attach_expectation(expectation_sales)
df = # get some dataframe to ingest into the feature store
# Run Data Validation Rules when data is written
sales_fg.insert(df)
12. On-Demand Feature Groups (External Tables)
snowflake_conn = fs.get_storage_connector("telco_snowflake_cluster")
telco_on_dmd = fs.create_on_demand_feature_group(name="telco_snowflake",
version=latest_version,
query="select * from telco",
description="On-demand FG",
storage_connector=snowflake_conn,
statistics_config=True)
telco_on_dmd.save()
You can also use connectors to any JDBC source or S3 source or ADLS on Azure
13. JOIN, Transform, Filter Features to create Training Datasets
Feature 1
LABEL
(CHURN_weekly)
Feature J
Primary Key
0 ... ... 1
1 ... ... 0
2 ... ... 0
... ... ... ...
N ... ... 1
Feature 1 Feature M
Primary Key
0 ... ...
1 ... ...
2 ... ...
... ... ...
N ... ...
Feature 1 Feature J
Primary Key
0 ... ...
1 ... ...
2 ... ...
... ... ...
N ... ...
Feature Group A Feature Group B
Training Dataset
Transform, Filter
14. HSFS API - Transformation Functions
# Store in a Python module. More than 1 transformation fn per file is allowed.
from datetime import datetime
def date_string_to_timestamp(date):
date_format = "%Y%m%d%H%M%S"
return int(float(datetime.strptime(date, date_format).timestamp()) * 1000)
15. HSFS API - Create Training Datasets with Transformations
date_string_2_ts = fs.create_transformation_function(
transformation_function=python_file.date_string_to_timestamp,
output_type="long", version=1)
# JOIN the features together
query = sales_fg.select_all().join(exogeneous_fg.select(['fuel_price',‘cpi’])
td = fs.create_training_dataset(name="sales_dc_td",
description="Dataset to train the Sales model for DC",
data_format="tfrecord",
transformation_functions={"sale_ts":date_string_2_ts},
version=1,
label=[”label_col”])
.filter(state=”DC”)
td.save(query)
17. Models retrieve pre-computed features (Feature Vectors) from the Feature Store
Feature 1 CHURN_weekly
Feature N
Primary Key
ID ... ... N/A
From App From Feature Store No Label - Predict it
Lookup Features from Feature Store using “ID”
Note: this is the sames features as in the Training Dataset, minus the label
18. HSFS API - Serving
td = fs.get_training_dataset(“sales_dc_td”, version=1)
td.init_prepared_statement()
# online transformation functions are transparently applied before returning
prediction_array = td.get_serving_vector({“date”: “2021-06-01 21:04:00”})
# call model with ‘prediction_array’ as input
20. Provenance Graph of Dependencies
Feature Groups Models
Training Datasets
Changes in upstream entities trigger actions that can cause downstream computations to run
Upstream Downstream
21. MLOps is Feature Pipelines, Training Pipelines, and Model Monitoring
transaction_type
transaction_amount
user_id
user_nationality
user_gender
transactions_fg
users_fg
Feature Groups Training
Datasets
pk join
fraud_td
Descriptive
Statistics,
Feature
Correlations,
Histograms
...
Use for Drift
Detection
fraud_classifier
Models
Feature Pipeline Training Pipeline
Model
Monitoring
23. CI/CD Triggers and Orchestration of Pipelines in MLOps
Enterprise
Data
Model
Registry
Feature
Pipeline
Model
Serving
Training
Pipeline
Feature
Store
Orchestrator: Airflow, Github Actions, Jenkins
CI/CD Triggers: Code commit, New data, time trigger (e.g., daily)
Model
Monitoring
24. Orchestrate Feature and Training Pipelines with Airflow in Hopsworks
Feature Engineering
Notebook/Job
Validate on Data Slices
& Deploy Model
Run Experiment
to Train Model
Select Features, File Format
and Create Training Data
FEATURE
STORE
25. Data Science
Data Engineering Compliance & Regulatory
Feature Store
Teams use the tools of their choice,
integrated with the
Hopsworks Feature Store
Model Serving
Hopsworks is an Open, Modular Feature Store that can Plug into ML Pipelines
28. KubeFlow Model Serving (KFServing), the Feature Store, and Logging to Kafka
Local Remote
AI-Enabled
Application
KFServing Feature Store
1. 2.
3.
4.
1. Prediction Request
2. Request Features
3. Return Enriched Feature Vector
4. Predict, Log, & Return Result
class Transformer:
def _init_(self):
self.fs = #connect to feature store
self.td = self.fs.get_training_dataset("sales_dc_td")
def preprocess(inputs):
return td.get_serving_vector(inputs["some-key"])
2. Request Features from inside the KFServing Transformer
Kafka
4.
30. 30
New Training Data from Prediction Logs and the Evaluation Store
Prediction
Requests
● Interactive Queries to debug the Model
● Interactive Queries to debug Inference Data
● Inspect Model KPIs Charts
● Inspect Model Serving Performance Charts
● Identify Model/Data Drift
● Interactive Queries to Audit Logs
Evaluation
Store
Feature
Store
ML Engineer
Data Scientist
● Understand Live Model Performance
● Use new Training Data
Kafka
32. CUSTOMER CASE STUDY SWEDBANK - ANTI-MONEY LAUNDERING (AML) WITH HOPSWORKS
THE CHALLENGE
Increase detection rate and reduce false positives and costs for AML.
GANs with
a ~40 TB
transaction dataset
Spark for Feature
Engineering
(including graph embeddings)
TensorFlow/GPUs to
train a GAN
Features, Scale-out
training, models, model
serving
Webinar, Thursday 16th, 9am PT:
https://info.nvidia.com/accelerate-financial-fraud-detection-webinar.html?ncid=so-link-610204-vt09&linkId=100000063386013
With Hopsworks, Swedbank managed to decrease in 99% of their false positive compared
to their previous system (rule based).
34. CUSTOMER CASE STUDY SWEDBANK - ANTI-MONEY LAUNDERING (AML) WITH HOPSWORKS
Kafka
Teradata
Cloudera
AML
Application
Retrieve
Features
(<10 ms)
Real-Time Financial Features
Customer Credit Score / KYC
Historial Financial Transactions
Is this Money Transfer Suspicious?
Model
Train (40 TB)
Hopsworks Feature Store is the central location where all the data (features) are stored and manipulated
to be used for the AML application.
Hopsworks
Feature Store
35. 35
Anti-Money Laundering End-to-End Example
transactions alert_transactions
party
trans_embeddings alert_trans_embeddings
training_data
user_id is the join key for party and (alert_)transactions
test_data
trans_id is the join key for (alert_)transactions and (alert_)trans_embeddings
user_id
trans_id trans_id
36. MLOps Lifecycle with Hopsworks
Enterprise
Data
Model
Registry
Feature
Engineering
Model
Serving
Model
Training
Model
Deploy
Model
Monitoring
Log Predictions Statistics
CDC
Experiments
Feature Statistics
A/B Test
Model
Metadata
Serving
Statistics
Free-text Search,
Provenance API
RonDB
Feature Store
Elasticsearch
RonDB
Metastore
Feature Vectors
41. Feature serving both online and batch
41
Offline Feature Store
OnlineFS-ClusterJ
OnlineFS-ClusterJ
HSFS
FG 2
FG 1
OnlineFS-ClusterJ
Meta Data (Avro Schema)
Online
Feature Store
Scalable stateless Online FS
upsert ingestion service
Kafka topic per Online
Feature Group
FG 1 FG 2 FG 3
Meta Data
Meta Data (Avro Schema)
Upsert
based on
Primary Key
Consume
and decode
Encode and
produce
Upsert
User/Application
fg.insert(df)
42. 42
RonDB powers the Hopsworks Platform
RonDB makes Hopsworks the only LATS Feature Store
< 1ms KV lookup
>10M KV Lookups/sec
>99.999% availability