SlideShare une entreprise Scribd logo
1  sur  86
Télécharger pour lire hors ligne
Cameron Joannidis, Simple Machines
Productionizing a
Machine Learning System
at a Large Australian
Telco
#SAISML6
Who Am I?
• I'm Cameron
• Linkedin - https://www.linkedin.com/in/cameron-joannidis/
• Twitter - @camjo89

• Principal Machine Learning Engineer at Simple Machines
• Open source contributor - https://github.com/camjo
@camjo89
@camjo89
Data Science
Data Engineering
Mostly Distributed
Mostly Centralised
Data Science Workflow
@camjo89
Data Science Workflow
@camjo89
Data Engineering
@camjo89
Feature Engineering
@camjo89
@camjo89
What is Machine Learning?
@camjo89
What is Machine Learning?
@camjo89
What is Machine Learning?
@camjo89
What is Machine Learning?
@camjo89
What is Machine Learning - Reality…
@camjo89
Luxury Segment
Knitting Enthusiasts Segment
???
???
@camjo89
Churn
Not Churn
???
???
@camjo89
Churn
Not Churn
???
???
Features
• Since we can’t directly measure churn, we need to instead try and predict churn
using something we can measure.
• These measurements are called features an are typically associated with an
entity and an instant in time (e.g. customer, 2018-01-01)
• Can be as simple as some raw data all the way through to a complex
transformation across multiple systems to capture some domain knowledge
within your business
• Most of the performance of ML algorithms comes from well engineered features
@camjo89
Training vs. Scoring
• Training dataset requires large volumes of historical data and many more
features than will likely end up in the finished model.
• Scoring typically uses a subset of the training features and is only calculated
for the most recent values of those features.
@camjo89
@camjo89
@camjo89
@camjo89
Feature Store
• A feature store is a data store that holds features for some given scope of the
business (e.g. customer/household etc) at a particular time granularity (e.g.
daily/weekly/yearly).
@camjo89
EAVT
• Entity Attribute Value Timestamp
@camjo89
EAVT
• Entity Attribute Value Timestamp
@camjo89
Entity Attribute Value Time
Cathy
Smith
Feature 1 33 2018-01-01
Cathy
Smith
Feature 1 37 2018-01-02
Jason
Poser
Feature 2 164 2018-01-01
EAVT
• Entity Attribute Value Timestamp
@camjo89
Entity Attribute Value Time
Cathy
Smith
Feature 1 33 2018-01-01
Cathy
Smith
Feature 1 37 2018-01-02
Jason
Poser
Feature 2 164 2018-01-01
Cathy
Smith
Feature 3 “yes” 2018-01-01
Entity Attribute Value Time
Cathy
Smith
Feature 1 33 2018-01-01
Cathy
Smith
Feature 1 37 2018-01-02
Jason
Poser
Feature 2 164 2018-01-01
EAVT
• Entity Attribute Value Timestamp
• Requires expensive pivot to turn into tabular format used in training/scoring*
@camjo89
Customer Date Feature 1 Feature 2 Feature 3
Cathy
Smith
2018-01-01 33 null 1
Cathy
Smith
2018-01-02 37 null 7
Jason
Poser
2018-01-01 null 164 13
Entity Attribute Value Time
Cathy
Smith
Feature 1 33 2018-01-01
Cathy
Smith
Feature 1 37 2018-01-02
Jason
Poser
Feature 2 164 2018-01-01
Cathy
Smith
Feature 3 “yes” 2018-01-01
Entity Attribute Value Time
Cathy
Smith
Feature 1 33 2018-01-01
Cathy
Smith
Feature 1 37 2018-01-02
Jason
Poser
Feature 2 164 2018-01-01
*Naively on HDFS
Tabular
• Relational database table structure
@camjo89
Tabular
• Relational database table structure
@camjo89
Customer Date Feature 1 Feature 2
Cathy
Smith
2018-01-01 33 null
Cathy
Smith
2018-01-02 37 null
Jason
Poser
2018-01-01 null 164
Customer Date Feature 1 Feature 2
Cathy
Smith
2018-01-01 33 null
Cathy
Smith
2018-01-02 37 null
Jason
Poser
2018-01-01 null 164
Tabular
• Relational database table structure
• Adding new feature requires a join
@camjo89
Customer Date Feature 1 Feature 2
Cathy
Smith
2018-01-01 33 null
Cathy
Smith
2018-01-02 37 null
Jason
Poser
2018-01-01 null 164
Customer Date Feature 1 Feature 2 Feature 3
Cathy
Smith
2018-01-01 33 null 1
Cathy
Smith
2018-01-02 37 null 7
Jason
Poser
2018-01-01 null 164 13
+
Customer Date Feature 3
Cathy
Smith
2018-01-01 1
Cathy
Smith
2018-01-02 7
Jason
Poser
2018-01-01 13
=
Customer Date Feature 1 Feature 2
Cathy
Smith
2018-01-01 33 null
Cathy
Smith
2018-01-02 37 null
Jason
Poser
2018-01-01 null 164
Context
• On premises data centre (often fighting for cluster resources)
• Hortonworks 2.5 (Spark 1.6.3, Hive 1.2.1, HBase 1.1.2)
@camjo89
@camjo89
@camjo89
@camjo89
@camjo89
@camjo89
@camjo89
@camjo89
Back to the Drawing Board
• Too complex to use
• Too many ways that users can make mistakes
• Need a higher level declarative language
• Need to globally optimise the set of features that are being run and utilise
domain knowledge to help spark optimise properly without lots of fine tuning
• Needs to support SQL (or some subset) without losing the ability to optimise
things ourselves
@camjo89
@camjo89
.
.
.
@camjo89
Domain Optimisations Spark DAG Optimisations
@camjo89
.
.
.
Domain Optimisations Spark DAG Optimisations
Optimisation
• Most optimisation techniques are about reducing IO, sharing compute and
sharing caching where possible.
• Expose interfaces to make smart use of hdfs partitioning to avoid reading
data
• Combine multiple features that read from the same table into a single big
read rather than one read per feature
• Cache common aggregates across features to avoid unnecessary re-
computation
@camjo89
@camjo89
=>
@camjo89
@camjo89
@camjo89
UI
• Features
• Creation
• Discovery
• Metadata
• Management
• Promotion between environments (Dev/UAT/Prod)
• Projects
• Define training + scoring feature sets
@camjo89
@camjo89
Users
Our Cluster
@camjo89
Cost
• Questions to ask
• Does the feature store get a budget of its own?
@camjo89
Cost
• Questions to ask
• Does the feature store get a budget of its own?
• Do the projects pay for usage?
@camjo89
The Null Problem
@camjo89
The Null Problem
@camjo89
The Null Problem
@camjo89
Optimisation
• Investigate the actual usage patterns of the feature table
• Spark DAG optimisation
• General Spark Performance Tuning
• Spark Version
@camjo89
@camjo89
@camjo89
@camjo89
@camjo89
@camjo89
Back of the Envelope
@camjo89
• Full Table:
• 30M users * 365 days * 2 years * 1000 features ~= 22 trillion data points
Back of the Envelope
@camjo89
• Full Table:
• 30M users * 365 days * 2 years * 1000 features ~= 22 trillion data points
• Per Project:
• 2M users * 365 days * 2 years * 300 features ~= 0.5 billion data points
Existing headaches
• Generating all features for all customers over the full time range is an enormous
data challenge.
• You are doing way more work than is necessary for a lot of use cases.
• You are always increasing your compute burden as you add new features/
history.
• You need to join new features onto the table and keep it highly available (which
typically means a copy of the table on hdfs)
• Users generally need to do a repartition to optimise their table for use anyway
@camjo89
Crazy Idea?
• Generate training data sets on demand
• Removes an enormous engineering cost of building and maintaining a
monstrous table that is mostly useless
• Allows teams to get started on the modelling much faster since we were only
generating the data they actually needed
• Training data set is clean and efficiently partitioned on delivery
@camjo89
Optimisation
• Investigate the actual usage patterns of the feature table
• Spark DAG optimisation
• General Spark Performance Tuning
• Spark Version
@camjo89
@camjo89
@camjo89
@camjo89
@camjo89
1
2
3
4
5
@camjo89
@camjo89
@camjo89
Optimisation
• Investigate the actual usage patterns of the feature table
• Spark DAG optimisation
• General Spark Performance Tuning
• Spark Version
@camjo89
@camjo89
@camjo89
@camjo89
total time = 2T
@camjo89
@camjo89
@camjo89
total time = 2T T
Optimisation
• Investigate the actual usage patterns of the feature table
• Spark DAG optimisation
• General Spark Performance Tuning
• Spark Version
@camjo89
Spark Migration
• 1.6 => 2.2
• Reasons to Migrate:
• 1.6 is now several years old
• Performance is astronomically better in 2.2 - we observed up to 17x
speedups for certain workloads (e.g. windowed features)
• Stability and logging is far better in 2.2. Spark 1.6 had a lot of painful bugs
that we had to work around
@camjo89
Relative Performance Gains
Hours
0
7.5
15
22.5
30
Standard Declarative Declarative + Optimised
3 mon, 50 feat 3 mon, 250 feat
2 years, 350 feat
NeverCompleted
CrashedEveryTime
Takeaways
• Naïve Centralisation is hard and expensive. Work out your usage patterns
before you go too far in.
• Work out how to manage resource usage and cost early on.
• Constrain the surface area of the optimisation problem you solve in order to
scale. (80/20 rule)
• Features should be independently declarable but globally optimised.
• The more declarative the interaction, the more you can do for your users
transparently.
@camjo89
Thank You
Questions?
@camjo89
@simplemach

Contenu connexe

Similaire à Productionizing a Machine Learning System at a Large Australian Telco with Cameron Joannidis

Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForOutspoken Media
 
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To YouLogs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To YouJohn Anderson
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceESUG
 
Getting started with Appium 2.0
Getting started with Appium 2.0Getting started with Appium 2.0
Getting started with Appium 2.0Anand Bagmar
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)ITCamp
 
Rails Performance Tricks and Treats
Rails Performance Tricks and TreatsRails Performance Tricks and Treats
Rails Performance Tricks and TreatsMarshall Yount
 
ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...
ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...
ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...ITCamp
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven
 
Why you shouldn't probably care about Machine Learning
Why you shouldn't probably care about Machine LearningWhy you shouldn't probably care about Machine Learning
Why you shouldn't probably care about Machine LearningBIWUG
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedJamie Grier
 
Scaling face recognition with big data - Bogdan Bocse
 Scaling face recognition with big data - Bogdan Bocse Scaling face recognition with big data - Bogdan Bocse
Scaling face recognition with big data - Bogdan BocseITCamp
 
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭台灣資料科學年會
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
Machine learning systems for engineers
Machine learning systems for engineersMachine learning systems for engineers
Machine learning systems for engineersCameron Joannidis
 
ADT02 - Java 8 Lambdas and the Streaming API
ADT02 - Java 8 Lambdas and the Streaming APIADT02 - Java 8 Lambdas and the Streaming API
ADT02 - Java 8 Lambdas and the Streaming APIMichael Remijan
 
Client Best Practices
Client Best PracticesClient Best Practices
Client Best PracticesYuval Dagai
 
Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform DATAVERSITY
 

Similaire à Productionizing a Machine Learning System at a Large Australian Telco with Cameron Joannidis (20)

Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look For
 
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To YouLogs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To You
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performance
 
Getting started with Appium 2.0
Getting started with Appium 2.0Getting started with Appium 2.0
Getting started with Appium 2.0
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
Rails Performance Tricks and Treats
Rails Performance Tricks and TreatsRails Performance Tricks and Treats
Rails Performance Tricks and Treats
 
ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...
ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...
ITCamp 2018 - Damian Widera - SQL Server 2016. Meet the Row Level Security. P...
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Why you shouldn't probably care about Machine Learning
Why you shouldn't probably care about Machine LearningWhy you shouldn't probably care about Machine Learning
Why you shouldn't probably care about Machine Learning
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Story ofcorespring infodeck
Story ofcorespring infodeckStory ofcorespring infodeck
Story ofcorespring infodeck
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory Speed
 
Scaling face recognition with big data - Bogdan Bocse
 Scaling face recognition with big data - Bogdan Bocse Scaling face recognition with big data - Bogdan Bocse
Scaling face recognition with big data - Bogdan Bocse
 
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 
Machine learning systems for engineers
Machine learning systems for engineersMachine learning systems for engineers
Machine learning systems for engineers
 
ADT02 - Java 8 Lambdas and the Streaming API
ADT02 - Java 8 Lambdas and the Streaming APIADT02 - Java 8 Lambdas and the Streaming API
ADT02 - Java 8 Lambdas and the Streaming API
 
Client Best Practices
Client Best PracticesClient Best Practices
Client Best Practices
 
Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform 
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Dernier (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Productionizing a Machine Learning System at a Large Australian Telco with Cameron Joannidis

  • 1. Cameron Joannidis, Simple Machines Productionizing a Machine Learning System at a Large Australian Telco #SAISML6
  • 2. Who Am I? • I'm Cameron • Linkedin - https://www.linkedin.com/in/cameron-joannidis/ • Twitter - @camjo89
 • Principal Machine Learning Engineer at Simple Machines • Open source contributor - https://github.com/camjo
  • 4. @camjo89 Data Science Data Engineering Mostly Distributed Mostly Centralised
  • 13. @camjo89 What is Machine Learning - Reality…
  • 17. Features • Since we can’t directly measure churn, we need to instead try and predict churn using something we can measure. • These measurements are called features an are typically associated with an entity and an instant in time (e.g. customer, 2018-01-01) • Can be as simple as some raw data all the way through to a complex transformation across multiple systems to capture some domain knowledge within your business • Most of the performance of ML algorithms comes from well engineered features @camjo89
  • 18. Training vs. Scoring • Training dataset requires large volumes of historical data and many more features than will likely end up in the finished model. • Scoring typically uses a subset of the training features and is only calculated for the most recent values of those features. @camjo89
  • 22. Feature Store • A feature store is a data store that holds features for some given scope of the business (e.g. customer/household etc) at a particular time granularity (e.g. daily/weekly/yearly). @camjo89
  • 23. EAVT • Entity Attribute Value Timestamp @camjo89
  • 24. EAVT • Entity Attribute Value Timestamp @camjo89 Entity Attribute Value Time Cathy Smith Feature 1 33 2018-01-01 Cathy Smith Feature 1 37 2018-01-02 Jason Poser Feature 2 164 2018-01-01
  • 25. EAVT • Entity Attribute Value Timestamp @camjo89 Entity Attribute Value Time Cathy Smith Feature 1 33 2018-01-01 Cathy Smith Feature 1 37 2018-01-02 Jason Poser Feature 2 164 2018-01-01 Cathy Smith Feature 3 “yes” 2018-01-01 Entity Attribute Value Time Cathy Smith Feature 1 33 2018-01-01 Cathy Smith Feature 1 37 2018-01-02 Jason Poser Feature 2 164 2018-01-01
  • 26. EAVT • Entity Attribute Value Timestamp • Requires expensive pivot to turn into tabular format used in training/scoring* @camjo89 Customer Date Feature 1 Feature 2 Feature 3 Cathy Smith 2018-01-01 33 null 1 Cathy Smith 2018-01-02 37 null 7 Jason Poser 2018-01-01 null 164 13 Entity Attribute Value Time Cathy Smith Feature 1 33 2018-01-01 Cathy Smith Feature 1 37 2018-01-02 Jason Poser Feature 2 164 2018-01-01 Cathy Smith Feature 3 “yes” 2018-01-01 Entity Attribute Value Time Cathy Smith Feature 1 33 2018-01-01 Cathy Smith Feature 1 37 2018-01-02 Jason Poser Feature 2 164 2018-01-01 *Naively on HDFS
  • 27. Tabular • Relational database table structure @camjo89
  • 28. Tabular • Relational database table structure @camjo89 Customer Date Feature 1 Feature 2 Cathy Smith 2018-01-01 33 null Cathy Smith 2018-01-02 37 null Jason Poser 2018-01-01 null 164 Customer Date Feature 1 Feature 2 Cathy Smith 2018-01-01 33 null Cathy Smith 2018-01-02 37 null Jason Poser 2018-01-01 null 164
  • 29. Tabular • Relational database table structure • Adding new feature requires a join @camjo89 Customer Date Feature 1 Feature 2 Cathy Smith 2018-01-01 33 null Cathy Smith 2018-01-02 37 null Jason Poser 2018-01-01 null 164 Customer Date Feature 1 Feature 2 Feature 3 Cathy Smith 2018-01-01 33 null 1 Cathy Smith 2018-01-02 37 null 7 Jason Poser 2018-01-01 null 164 13 + Customer Date Feature 3 Cathy Smith 2018-01-01 1 Cathy Smith 2018-01-02 7 Jason Poser 2018-01-01 13 = Customer Date Feature 1 Feature 2 Cathy Smith 2018-01-01 33 null Cathy Smith 2018-01-02 37 null Jason Poser 2018-01-01 null 164
  • 30. Context • On premises data centre (often fighting for cluster resources) • Hortonworks 2.5 (Spark 1.6.3, Hive 1.2.1, HBase 1.1.2) @camjo89
  • 38. Back to the Drawing Board • Too complex to use • Too many ways that users can make mistakes • Need a higher level declarative language • Need to globally optimise the set of features that are being run and utilise domain knowledge to help spark optimise properly without lots of fine tuning • Needs to support SQL (or some subset) without losing the ability to optimise things ourselves @camjo89
  • 42. Optimisation • Most optimisation techniques are about reducing IO, sharing compute and sharing caching where possible. • Expose interfaces to make smart use of hdfs partitioning to avoid reading data • Combine multiple features that read from the same table into a single big read rather than one read per feature • Cache common aggregates across features to avoid unnecessary re- computation @camjo89
  • 47.
  • 48.
  • 49. UI • Features • Creation • Discovery • Metadata • Management • Promotion between environments (Dev/UAT/Prod) • Projects • Define training + scoring feature sets @camjo89
  • 52. Cost • Questions to ask • Does the feature store get a budget of its own? @camjo89
  • 53. Cost • Questions to ask • Does the feature store get a budget of its own? • Do the projects pay for usage? @camjo89
  • 57. Optimisation • Investigate the actual usage patterns of the feature table • Spark DAG optimisation • General Spark Performance Tuning • Spark Version @camjo89
  • 63. Back of the Envelope @camjo89 • Full Table: • 30M users * 365 days * 2 years * 1000 features ~= 22 trillion data points
  • 64. Back of the Envelope @camjo89 • Full Table: • 30M users * 365 days * 2 years * 1000 features ~= 22 trillion data points • Per Project: • 2M users * 365 days * 2 years * 300 features ~= 0.5 billion data points
  • 65. Existing headaches • Generating all features for all customers over the full time range is an enormous data challenge. • You are doing way more work than is necessary for a lot of use cases. • You are always increasing your compute burden as you add new features/ history. • You need to join new features onto the table and keep it highly available (which typically means a copy of the table on hdfs) • Users generally need to do a repartition to optimise their table for use anyway @camjo89
  • 66. Crazy Idea? • Generate training data sets on demand • Removes an enormous engineering cost of building and maintaining a monstrous table that is mostly useless • Allows teams to get started on the modelling much faster since we were only generating the data they actually needed • Training data set is clean and efficiently partitioned on delivery @camjo89
  • 67. Optimisation • Investigate the actual usage patterns of the feature table • Spark DAG optimisation • General Spark Performance Tuning • Spark Version @camjo89
  • 75. Optimisation • Investigate the actual usage patterns of the feature table • Spark DAG optimisation • General Spark Performance Tuning • Spark Version @camjo89
  • 82. Optimisation • Investigate the actual usage patterns of the feature table • Spark DAG optimisation • General Spark Performance Tuning • Spark Version @camjo89
  • 83. Spark Migration • 1.6 => 2.2 • Reasons to Migrate: • 1.6 is now several years old • Performance is astronomically better in 2.2 - we observed up to 17x speedups for certain workloads (e.g. windowed features) • Stability and logging is far better in 2.2. Spark 1.6 had a lot of painful bugs that we had to work around @camjo89
  • 84. Relative Performance Gains Hours 0 7.5 15 22.5 30 Standard Declarative Declarative + Optimised 3 mon, 50 feat 3 mon, 250 feat 2 years, 350 feat NeverCompleted CrashedEveryTime
  • 85. Takeaways • Naïve Centralisation is hard and expensive. Work out your usage patterns before you go too far in. • Work out how to manage resource usage and cost early on. • Constrain the surface area of the optimisation problem you solve in order to scale. (80/20 rule) • Features should be independently declarable but globally optimised. • The more declarative the interaction, the more you can do for your users transparently. @camjo89