The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment.
3. Agenda
Definitions of a Feature Store
A clear need, many approaches.
The Feature Flow Algorithm
ML pipeline orchestration.
The ML Pipeline Mesh
Governance and automation.
5. ML LIFECYCLE
SUCCESS CRITERIA
VALIDATE BUSINESS
HYPOTHESIS
NEW BUSINESS
INSIGHT
A positive experimental result
creates KPI lift in production.
Regardless of production results,
new business insights are
captured and made discoverable
(with a feature store).
This accelerates future
experimentation.
7. FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
Feature “Orchestration”
Automating ML Pipeline Construction
8. FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
Feature “Orchestration”
Automating ML Pipeline Construction
• Most common approach.
• Data access pattern for ML
pipelines.
• Generally, post “feat. engineering”.
• Supplement Data Governance with
DS semantics.
10. TRAIN/TEST Data Science Semantics
Extending the Data Governance Framework: An Example
Customer Segmentation
Train/Test
Split
… ML …
customer
segment
features
“preprocessed” sales data
test data
training data
11. TRAIN/TEST Data Semantics
Extending the Data Governance Framework: An Example
“preprocessed” sales data
Sales Prospect Segmentation
Train/Test
Split
… ML …
prospect
segment
features
Next Best Action
Train/Test
Split
test data
training data
Assemble
Features
test data
training data
12. TRAIN/TEST Data Semantics
Extending the Data Governance Framework: An Example
“preprocessed” sales data
Sales Prospect Segmentation
Train/Test
Split
… ML …
test data
training data
prospect
segment
features
Next Best Action
Train/Test
Split
test data
training data
Assemble
Features
WHAT’S WRONG WITH
THIS PICTURE?
13. FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
Feature “Orchestration”
Automating ML Pipeline Construction
14. FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
AutoML
Key Stakeholder:
Citizen Scientist
Feature “Orchestration”
Automating ML Pipeline Construction
15. FEATURE STORES
THREE APPROACHES TO AUTOMATION
Feature
Store
Approaches
Feature “Ops”
Automating Feature Data
Delivery to ML Pipelines
Feature “Modelling”
Automating ETL/Feature Engineering
AutoML
Key Stakeholder:
Citizen Scientist
Feature “Orchestration”
Automating ML Pipeline Construction
“Feature Flow”
Key Stakeholder:
ML Engineer
18. ML Pipeline Review
source: https://spark.apache.org/docs/latest/ml-pipeline.html
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([…])
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([…], ["id", "text"])
# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
19. ML Pipeline Review
source: https://spark.apache.org/docs/latest/ml-pipeline.html
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([…])
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([…], ["id", "text"])
# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
What does this
line do for me (as
an engineer)?
20. FEATURE FLOW
ORCHESTRATION
ALGORITHM: FEATURE
INFERENCE
Feature Flow takes pipeline
stages as input, builds a
graph, then sorts the stages
topologically.
First, we iteratively infer the stages
that need to be added to the
pipeline to produce the necessary
features.
Then, we sort the stages
topologically.
Tokenize TFIDF
Sentiment
Est.
THE “MONOLITHIC” PIPELINE (THE OLD WAY)
tokenize = ...
tfidf = ...
sentiment = ...
pipeline = Pipeline(
stages=[
tokenize, tfidf, sentiment
])
Tokenize TFIDF
Toxicity
Est.
tokenize = ...
tfidf = ...
toxicity = ...
pipeline = Pipeline(
stages=[
tokenize, tfidf, toxicity
])
FEATURE STAGE DEPLOYMENTS (THE NEW WAY)
Tokenize tokens
TFIDF vectors
Sentimen
t Est.
sentiment
tokens
vectors
Toxicity
Est.
toxicityvectors
Tokenize tokens
TFIDF vectors
Sentimen
t Est.
sentiment
tokens
vectors
Toxicity
Est.
toxicityvectors
toxicitysentiment
Tokenize tokens
TFIDF vectors
Sentimen
t Est.
tokens
vectors
Toxicity
Est.
vectors
toxicitysentiment
Tokenize tokens
TFIDF
Sentimen
t Est.
tokens
vectors
Toxicity
Est.
vectors
toxicitysentiment
Tokenize TFIDF
Sentimen
t Est.
tokens
vectors
Toxicity
Est.
vectors
toxicitysentiment
21. THEN, ELIMINATE ALL NODES WITH
MULTIPLE INCOMING EDGES PER FEATURE.
And, replace with nodes for the
product of all incoming features.
Feature: vectors
FEATURE FLOW
ORCHESTRATION
ALGORITHM: FEATURE
LINEAGE
Feature Flow gives us the
tools to experiment with
subsets of our pipeline.
The graph gets more complex
when we are evaluating multiple
strategies that create the same
features.
To manage multiple possible
traversals of the graph, we
maintain a lineage of each feature.
AN EXAMPLE STAGE WITH MULTIPLE STRATEGIES
Tokenize tokens
TFIDF vectors
Sentiment
Est.
sentiment
tokens
vectors
Toxicity
Est.
toxicityvectors
Word2Vec vectorstokens
FIRST, BUILD THE GRAPH
Tokenize
TFIDF
Sentiment
Est.
Word2Vec
Tokenize
TFIDF
Word2Vec
Sentiment
Est.
(TFIDF)
Sentiment
Est.
(Word2Vec)
Toxicity
Est. (TFIDF)
Toxicity
Est.
(Word2Vec)
Toxicity
Est.
23. SEPARATE CONCERNS OF
ALGORITHMIC DESIGN
FROM
OPERATIONS
Deployment Automation
and
Runtime Management
Metadata Management
and
Discovery
ML Pipeline
Governance