Ml ops and the feature store with hopsworks, DC Data Science Meetup

MLOps and the Feature Store
with Hopsworks
Jim Dowling
CEO, Hopsworks
DC Data Science Meetup,
Sep 14th 2021

We all take different Journeys to arrive at the Feature Store
Data Engineer
“Gotta feed those data
‘scientists’ with data”
Data Scientist
“Hello!?! Hello!?!
Is there any data out there?”
ML Engineer
And then she said
“productionize this notebook”
Feature Store

We all take different Journeys to arrive at MLOps
Data Engineer
Orchestrated Pipelines,
baby!
Data Scientist
Notebooks as Jobs, yay!
ML Engineer
Containerize, kubernetize,
observerize!
Feature Store
triggers them

Feature
Store
Feature
Engineering
Model
Training
Model
serving
Model
monitoring
Validate
& Test
Input Data
MLOps with a Feature Store

SQL or Python or Spark for Feature Engineering?
SQL Features
(Table)
DB
DB
Python
Features
(Dataframe)
Msg Bus
Files
Extract,
Aggregate,
Transform
Spark
DBT
Extract,
Aggregate,
Transform

What Feature Engineering do we typically perform where?
Aggregations,
Data Validation
Training
Data
Serving
Raw Data
Feature
Store
Model
Repo
Transformations Input Data
Need to ensure no
skew between training
and serving
transformations

Feature Group
Feature 1 Feature M
Primary Key
0 ... ...
1 ... ...
2 ... ...
... ... ...
N ... ...

import hsfs
connection = hsfs.connection(...)
fs = connection.get_feature_store()
fg_meta = fs.create_feature_group(name="sales_fg",
version=1,
primary_key=['store',’date’,’dept’],
event_time="ts",
description="customer features",
online_enabled=True)
HSFS API - Create Feature Groups

sales_fg = fg.get_feature_group(“sales_fg”, version=1)
df = # featurize some data to ingest into the feature store
sales_fg.insert(df)
Batch insert/backﬁlling features into the Feature Store

Spark Streaming insertion of features into the Feature Store
streaming_df = # get streaming dataframe to ingest into the feature store
sales_fg.insert_stream(streaming_df)

Data Validation for Feature Groups (using Deequ)
expectation_sales = fs.create_expectation(..,
rules=[Rule(name="HAS_MIN", level="WARNING", min=0),
Rule(name="HAS_MAX", level="ERROR", max=1000000)])
sales_fg.attach_expectation(expectation_sales)
df = # get some dataframe to ingest into the feature store
# Run Data Validation Rules when data is written
sales_fg.insert(df)

On-Demand Feature Groups (External Tables)
snowflake_conn = fs.get_storage_connector("telco_snowflake_cluster")
telco_on_dmd = fs.create_on_demand_feature_group(name="telco_snowflake",
version=latest_version,
query="select * from telco",
description="On-demand FG",
storage_connector=snowflake_conn,
statistics_config=True)
telco_on_dmd.save()
You can also use connectors to any JDBC source or S3 source or ADLS on Azure

JOIN, Transform, Filter Features to create Training Datasets
Feature 1
LABEL
(CHURN_weekly)
Feature J
Primary Key
0 ... ... 1
1 ... ... 0
2 ... ... 0
... ... ... ...
N ... ... 1
Feature 1 Feature M
Primary Key
0 ... ...
1 ... ...
2 ... ...
... ... ...
N ... ...
Feature 1 Feature J
Primary Key
0 ... ...
1 ... ...
2 ... ...
... ... ...
N ... ...
Feature Group A Feature Group B
Training Dataset
Transform, Filter

HSFS API - Transformation Functions
# Store in a Python module. More than 1 transformation fn per file is allowed.
from datetime import datetime
def date_string_to_timestamp(date):
date_format = "%Y%m%d%H%M%S"
return int(float(datetime.strptime(date, date_format).timestamp()) * 1000)

HSFS API - Create Training Datasets with Transformations
date_string_2_ts = fs.create_transformation_function(
transformation_function=python_file.date_string_to_timestamp,
output_type="long", version=1)
# JOIN the features together
query = sales_fg.select_all().join(exogeneous_fg.select(['fuel_price',‘cpi’])
td = fs.create_training_dataset(name="sales_dc_td",
description="Dataset to train the Sales model for DC",
data_format="tfrecord",
transformation_functions={"sale_ts":date_string_2_ts},
version=1,
label=[”label_col”])
.filter(state=”DC”)
td.save(query)

16
Feature Store
Batch
Inference
Report
Model
Serving Feature Store
Latency and availability are critical for user experience
High throughput important, latency not critical
Analytical Models Operational Models
Feature Vectors

Models retrieve pre-computed features (Feature Vectors) from the Feature Store
Feature 1 CHURN_weekly
Feature N
Primary Key
ID ... ... N/A
From App From Feature Store No Label - Predict it
Lookup Features from Feature Store using “ID”
Note: this is the sames features as in the Training Dataset, minus the label

HSFS API - Serving
td = fs.get_training_dataset(“sales_dc_td”, version=1)
td.init_prepared_statement()
# online transformation functions are transparently applied before returning
prediction_array = td.get_serving_vector({“date”: “2021-06-01 21:04:00”})
# call model with ‘prediction_array’ as input

transaction_type
transaction_amount
user_id
user_nationality
user_gender
transactions_fg
users_fg
Feature Groups Training
Datasets
pk join
fraud_td
Descriptive
Statistics,
Feature
Correlations,
Histograms
...
Use for Drift
Detection
fraud_classiﬁer
Models
Training Data
Features Models
Raw
Data
From Raw Data to Production Models in Hopsworks

Provenance Graph of Dependencies
Feature Groups Models
Training Datasets
Changes in upstream entities trigger actions that can cause downstream computations to run
Upstream Downstream

MLOps is Feature Pipelines, Training Pipelines, and Model Monitoring
transaction_type
transaction_amount
user_id
user_nationality
user_gender
transactions_fg
users_fg
Feature Groups Training
Datasets
pk join
fraud_td
Descriptive
Statistics,
Feature
Correlations,
Histograms
...
Use for Drift
Detection
fraud_classiﬁer
Models
Feature Pipeline Training Pipeline
Model
Monitoring

Feature
Store
Feature
Engineering
Model
Training
Model
serving
Model
monitoring
ML Engineers
Data Scientists
Model
Testing
Data Engineers
Architects (Governance)
Roles and Responsibilities in a ML Pipeline

CI/CD Triggers and Orchestration of Pipelines in MLOps
Enterprise
Data
Model
Registry
Feature
Pipeline
Model
Serving
Training
Pipeline
Feature
Store
Orchestrator: Airﬂow, Github Actions, Jenkins
CI/CD Triggers: Code commit, New data, time trigger (e.g., daily)
Model
Monitoring

Orchestrate Feature and Training Pipelines with Airﬂow in Hopsworks
Feature Engineering
Notebook/Job
Validate on Data Slices
& Deploy Model
Run Experiment
to Train Model
Select Features, File Format
and Create Training Data
FEATURE
STORE

Data Science
Data Engineering Compliance & Regulatory
Feature Store
Teams use the tools of their choice,
integrated with the
Hopsworks Feature Store
Model Serving
Hopsworks is an Open, Modular Feature Store that can Plug into ML Pipelines

26
Feature Pipeline
Feature Store
Batch or Streaming
Feature Pipeline
Enterprise Datastores
Aggregations
Data Validation

27
Training Pipeline
Model
architecture
Select
target,
features
Find best
HParams
Train model
(distributed)
Validate
Model
Deploy
Model
Feature Store
Maggy - Experiments, Distributed ML, and write-once training logic
https://www.youtube.com/watch?v=1SHOwl37I5c

KubeFlow Model Serving (KFServing), the Feature Store, and Logging to Kafka
Local Remote
AI-Enabled
Application
KFServing Feature Store
1. 2.
3.
4.
1. Prediction Request
2. Request Features
3. Return Enriched Feature Vector
4. Predict, Log, & Return Result
class Transformer:
def _init_(self):
self.fs = #connect to feature store
self.td = self.fs.get_training_dataset("sales_dc_td")
def preprocess(inputs):
return td.get_serving_vector(inputs["some-key"])
2. Request Features from inside the KFServing Transformer
Kafka
4.

29
Model Monitoring from KFServing Logs
Usage example
Windowed Outliers
Pipe
Windowed Drift Pipe
Stats Outliers Pipe
Stats Drift Pipe
Outliers Pipe
Drift Pipe
Monitor pipe Window pipe
Stats pipe
Sink Pipe
Alerts
Reports
Insights
Prediction
Requests
Kafka

30
New Training Data from Prediction Logs and the Evaluation Store
Prediction
Requests
● Interactive Queries to debug the Model
● Interactive Queries to debug Inference Data
● Inspect Model KPIs Charts
● Inspect Model Serving Performance Charts
● Identify Model/Data Drift
● Interactive Queries to Audit Logs
Evaluation
Store
Feature
Store
ML Engineer
Data Scientist
● Understand Live Model Performance
● Use new Training Data
Kafka

End-to-End Example -
Anti-Money Laundering
https://github.com/logicalclocks/AMLend2end

CUSTOMER CASE STUDY SWEDBANK - ANTI-MONEY LAUNDERING (AML) WITH HOPSWORKS
THE CHALLENGE
Increase detection rate and reduce false positives and costs for AML.
GANs with
a ~40 TB
transaction dataset
Spark for Feature
Engineering
(including graph embeddings)
TensorFlow/GPUs to
train a GAN
Features, Scale-out
training, models, model
serving
Webinar, Thursday 16th, 9am PT:
https://info.nvidia.com/accelerate-ﬁnancial-fraud-detection-webinar.html?ncid=so-link-610204-vt09&linkId=100000063386013
With Hopsworks, Swedbank managed to decrease in 99% of their false positive compared
to their previous system (rule based).

RULES-BASE AML vs DEEP LEARNING AML

CUSTOMER CASE STUDY SWEDBANK - ANTI-MONEY LAUNDERING (AML) WITH HOPSWORKS
Kafka
Teradata
Cloudera
AML
Application
Retrieve
Features
(<10 ms)
Real-Time Financial Features
Customer Credit Score / KYC
Historial Financial Transactions
Is this Money Transfer Suspicious?
Model
Train (40 TB)
Hopsworks Feature Store is the central location where all the data (features) are stored and manipulated
to be used for the AML application.
Hopsworks
Feature Store

35
Anti-Money Laundering End-to-End Example
transactions alert_transactions
party
trans_embeddings alert_trans_embeddings
training_data
user_id is the join key for party and (alert_)transactions
test_data
trans_id is the join key for (alert_)transactions and (alert_)trans_embeddings
user_id
trans_id trans_id

MLOps Lifecycle with Hopsworks
Enterprise
Data
Model
Registry
Feature
Engineering
Model
Serving
Model
Training
Model
Deploy
Model
Monitoring
Log Predictions Statistics
CDC
Experiments
Feature Statistics
A/B Test
Model
Metadata
Serving
Statistics
Free-text Search,
Provenance API
RonDB
Feature Store
Elasticsearch
RonDB
Metastore
Feature Vectors

Demo
Anti-Money Laundering
https://github.com/logicalclocks/AMLend2end

Training
Development
Model Repo
Model Serving
Output
Feature
Store
Feature
Engineering
Sources
Feature
Store
Database
Application/ERP
Logs
3rd Party APIs
Object and File Storage
• • •
Dashboards
Batch Applications
Augmented Analytics
Applications
Microservices
• • •
Hopsworks - Design and Operate AI Applications
Python
Spark/SQL
Spark
Streaming
Flink
Any Python Library
HopsFS (S3 / Azure Blob Storage)
RonDB

www.hopsworks.ai
-
@hopsworks
github.com/logicalclocks/hopsworks

github.com/logicalclocks/hopsworks
-
@logicalclocks
-
www.logicalclocks.com

Feature serving both online and batch
41
Oﬄine Feature Store
OnlineFS-ClusterJ
OnlineFS-ClusterJ
HSFS
FG 2
FG 1
OnlineFS-ClusterJ
Meta Data (Avro Schema)
Online
Feature Store
Scalable stateless Online FS
upsert ingestion service
Kafka topic per Online
Feature Group
FG 1 FG 2 FG 3
Meta Data
Meta Data (Avro Schema)
Upsert
based on
Primary Key
Consume
and decode
Encode and
produce
Upsert
User/Application
fg.insert(df)

42
RonDB powers the Hopsworks Platform
RonDB makes Hopsworks the only LATS Feature Store
< 1ms KV lookup
>10M KV Lookups/sec
>99.999% availability

Ml ops and the feature store with hopsworks, DC Data Science Meetup

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Ml ops and the feature store with hopsworks, DC Data Science Meetup

Similaire à Ml ops and the feature store with hopsworks, DC Data Science Meetup (20)

Plus de Jim Dowling

Plus de Jim Dowling (16)

Dernier

Dernier (20)

Ml ops and the feature store with hopsworks, DC Data Science Meetup