SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Building an ML tool to predict
article quality scores using Delta
& MLflow
Ivana Pejeva
Data Scientist @ element61
What’s the challenge
For past articles, Roularta wants to…
▪ analyze which article did well in
bringing traffic, engagement &
conversions
Quality scoring of an article Predicting quality of an article
Roularta wants to know the aricle quality...
... but there is an infinite amount of possible KPIs one
needs to track
...and the goal is to simplify/standardize/automate an
objective measurement
For new articles, Roularta wants to…
▪ predict which article will bring good
conversions, traffic and engagement
What we offer editorial teams
▪ Calculates article scores on historical articles
▪ Predicts article scores on new articles
CMS
Content Behavior
Reader behavior
Structured streaming
Structured streaming
Azure Data Lake Gen2
Feature extraction
Predictive Model
Article Score
Score calculation
Data tool which
What do we want to predict
How much traffic is an
article bringing to the
site?
Is this article going to
bring good
conversions?
Will this article keep
people engaged?
Traffic Score Conversion Score Engagement Score
The flow for calculating and predicting article
scores
Gather all pageviews
and content data
• Number of pageviews
• Pageviews from Social
media
• Article author
• Publishing date
• …
Calculate series of
traffic, engagement
and conversion scores
Traffic Score: 0.7
Engagement Score:
0.2
Conversion Score: 0.5
Calculate quality
stars for each
article
Traffic Score:
Engagement Score:
Conversion Score:
Calculate quality
score
Calculate overall
quality score based on
traffic, engagement
and conversion score
Predict scores
Predict the traffic,
engagement and
conversion scores on
new articles
What’s the data
▪ Reading Behavior
▪ Coming from a CDP tracker
▪ Pageviews
▪ Content Behavior
▪ Coming from the CMS
▪ Details on the content
CMS
Content Management System
How we prepared the data
▪ Number of pageviews
▪ Time on page
▪ Social media shares
▪ Registrations
▪ Subscriptions
▪ Bounce rate
▪ …
▪ Article text
▪ Publication time
▪ Author
▪ Tags in article
▪ How long is the article
▪ Topic of article
▪ …
Calculate article scores Predict article scores
Calculation of article scores is a computationally
intensive job
▪ A lot of the metrics require looking
at a specific window of data from
millions of rows
▪ e.g. calculating the number of visitors
to an article where the user was not
seen on the site for the past 30 days
▪ This would require to always look
at a specific window of data for
each article
▪ Very expensive operation as we
are looking over millions of rows
▪ Keep intermediate delta lake
tables
▪ e.g. Keep the visitors from the last 30
days in an intermediate delta table
▪ Avoid using windowing function over
a huge amount of data
▪ > 10x improve in performance
The role of Delta
▪ ~2 million pageviews per day we need to process in streaming mode
▪ ~ 250k number of articles
▪ 50 brands
▪ Delta in silver and gold layer for data ingestion
▪ Intermediate delta tables for extracted features
▪ Time travel to recreate ML experiments
Delta to accelerate data ingestion and feature engineering pipelines
The feature extraction process
CMS
Content Behavior
Reader behavior
Calculate traffic,
engagement and
conversion scores
Create features for
each article Add labels to articles
• Is it a short/long article?
• Does the teaser have images?
• Tags in article
• Authors of article
• When was it published?
• Teaser text
NLP BERT language model
▪ BERTje – a dutch based BERT model
▪ Extract representations from article text
▪ Leverage pandasUDF for better performance
▪ Highly improved ML model performance
For extracting features from articles text
How we build the predictive model
CMS
Content Behavior
Reader behavior
Calculate traffic,
engagement and
conversion scores
Create features
for each article
Add labels to
articles
• Is it a short/long article?
• Does the teaser have images?
• Tags in article
• Authors of article
• When was it published?
• Teaser text
ML Pipeline for
feature
transformation
• Binarize features
• Encode string column to label indices
• TF-IDF to reflects importance of words
in tags, topics and collections of articles
• Scale data with mean and std
• Create feature vectors
Train Model
• Multi-class classification problem
(predict the number of stars (1-5)
• Random Forest Classifier
• Cross Validation and log model
metrics with
• Multiclass Classification Evaluator
Register
model
The role of MLflow
▪ Train ML models per brand and per score
▪ We need to
▪ Track performance for all models
▪ Promote new model versions fast
▪ Serve models to score new data
For model tracking, register and serve
Results
Method:
▪ Predictions made on articles from one brand
▪ Total number of articles 2368
Results:
▪ In 93% of the cases a quality article (4 or 5 stars) is
given 4 or 5 stars by the model
▪ A low-quality article (1 or 2 stars) is given 4 or 5 stars in
only 2% of the cases
Details:
▪ Most important features are Teaser text and topics
Summary
▪ Use Delta to accelerate data ingestion and feature extraction
▪ Use NLP BERT for text representations from articles
▪ Use Mlflow for tracking, registering and serving ML models
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Contenu connexe

Tendances

Pre trained language model
Pre trained language modelPre trained language model
Pre trained language modelJiWenKim
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse
 
Processamento de Linguagem Natural
Processamento de Linguagem NaturalProcessamento de Linguagem Natural
Processamento de Linguagem NaturalGustavo Gattass Ayub
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized HomepageJustin Basilico
 
Python & jupyter notebook installation
Python & jupyter notebook installationPython & jupyter notebook installation
Python & jupyter notebook installationAnamta Sayyed
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareJustin Basilico
 
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position EmbeddingRoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embeddingtaeseon ryu
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectiveXavier Amatriain
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain
 
Recommender Systems In Industry
Recommender Systems In IndustryRecommender Systems In Industry
Recommender Systems In IndustryXavier Amatriain
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemJacek Wasilewski
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated RecommendationsHarald Steck
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Kishor Datta Gupta
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
 

Tendances (20)

Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
Processamento de Linguagem Natural
Processamento de Linguagem NaturalProcessamento de Linguagem Natural
Processamento de Linguagem Natural
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized Homepage
 
Python & jupyter notebook installation
Python & jupyter notebook installationPython & jupyter notebook installation
Python & jupyter notebook installation
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position EmbeddingRoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embedding
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
 
Recommender Systems In Industry
Recommender Systems In IndustryRecommender Systems In Industry
Recommender Systems In Industry
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Temporal based Recommendation System
Temporal based Recommendation SystemTemporal based Recommendation System
Temporal based Recommendation System
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender System
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 

Similaire à Building an ML Tool to predict Article Quality Scores using Delta & MLFlow

Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Best Practices with Sitecore
Best Practices with SitecoreBest Practices with Sitecore
Best Practices with SitecoreAnant Corporation
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Oracle PIM: Phantasmal Item Descriptions in your Organization
Oracle PIM: Phantasmal Item Descriptions in your OrganizationOracle PIM: Phantasmal Item Descriptions in your Organization
Oracle PIM: Phantasmal Item Descriptions in your OrganizationAXIA Consulting Inc.
 
Structured Authoring for Business-Critical Content
Structured Authoring for Business-Critical ContentStructured Authoring for Business-Critical Content
Structured Authoring for Business-Critical ContentLavaCon
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachKent Graziano
 
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificEnabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificDatabricks
 
Building modern intranets with share point communication sites aug 2018 kloud
Building modern intranets with share point communication sites aug 2018   kloudBuilding modern intranets with share point communication sites aug 2018   kloud
Building modern intranets with share point communication sites aug 2018 kloudAsish Padhy
 
Out With the Old, in With the Open-source: Brainshark's Complete CMS Migration
Out With the Old, in With the Open-source: Brainshark's Complete CMS MigrationOut With the Old, in With the Open-source: Brainshark's Complete CMS Migration
Out With the Old, in With the Open-source: Brainshark's Complete CMS MigrationAcquia
 
Modernize Your Content Publishing Process with Smart Content
Modernize Your Content Publishing Process with Smart ContentModernize Your Content Publishing Process with Smart Content
Modernize Your Content Publishing Process with Smart ContentGavin Drake
 
Advanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using MLAdvanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using MLDatabricks
 
Prospectus Editing Tool (PET)
Prospectus Editing Tool (PET)Prospectus Editing Tool (PET)
Prospectus Editing Tool (PET)MrJ1971
 
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed CodeSEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed CodeMarc D Anderson
 
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...OpenSource Connections
 
Moving advanced analytics to your sql server databases
Moving advanced analytics to your sql server databasesMoving advanced analytics to your sql server databases
Moving advanced analytics to your sql server databasesEnrico van de Laar
 
Shaking hands with the developer: How IT Communications can help you build a ...
Shaking hands with the developer: How IT Communications can help you build a ...Shaking hands with the developer: How IT Communications can help you build a ...
Shaking hands with the developer: How IT Communications can help you build a ...Sarah Khan
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 

Similaire à Building an ML Tool to predict Article Quality Scores using Delta & MLFlow (20)

Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Best Practices with Sitecore
Best Practices with SitecoreBest Practices with Sitecore
Best Practices with Sitecore
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
 
Oracle PIM: Phantasmal Item Descriptions in your Organization
Oracle PIM: Phantasmal Item Descriptions in your OrganizationOracle PIM: Phantasmal Item Descriptions in your Organization
Oracle PIM: Phantasmal Item Descriptions in your Organization
 
Structured Authoring for Business-Critical Content
Structured Authoring for Business-Critical ContentStructured Authoring for Business-Critical Content
Structured Authoring for Business-Critical Content
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
 
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher ScientificEnabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
 
Building modern intranets with share point communication sites aug 2018 kloud
Building modern intranets with share point communication sites aug 2018   kloudBuilding modern intranets with share point communication sites aug 2018   kloud
Building modern intranets with share point communication sites aug 2018 kloud
 
Webinar: The Slippery Slope of Migrating to SharePoint Online or On-Premise
Webinar: The Slippery Slope of Migrating to SharePoint Online or On-PremiseWebinar: The Slippery Slope of Migrating to SharePoint Online or On-Premise
Webinar: The Slippery Slope of Migrating to SharePoint Online or On-Premise
 
Out With the Old, in With the Open-source: Brainshark's Complete CMS Migration
Out With the Old, in With the Open-source: Brainshark's Complete CMS MigrationOut With the Old, in With the Open-source: Brainshark's Complete CMS Migration
Out With the Old, in With the Open-source: Brainshark's Complete CMS Migration
 
Modernize Your Content Publishing Process with Smart Content
Modernize Your Content Publishing Process with Smart ContentModernize Your Content Publishing Process with Smart Content
Modernize Your Content Publishing Process with Smart Content
 
Advanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using MLAdvanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using ML
 
Prospectus Editing Tool (PET)
Prospectus Editing Tool (PET)Prospectus Editing Tool (PET)
Prospectus Editing Tool (PET)
 
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed CodeSEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
 
SharePoint Custom Development
SharePoint Custom DevelopmentSharePoint Custom Development
SharePoint Custom Development
 
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
 
Moving advanced analytics to your sql server databases
Moving advanced analytics to your sql server databasesMoving advanced analytics to your sql server databases
Moving advanced analytics to your sql server databases
 
Shaking hands with the developer: How IT Communications can help you build a ...
Shaking hands with the developer: How IT Communications can help you build a ...Shaking hands with the developer: How IT Communications can help you build a ...
Shaking hands with the developer: How IT Communications can help you build a ...
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 

Dernier (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 

Building an ML Tool to predict Article Quality Scores using Delta & MLFlow

  • 1. Building an ML tool to predict article quality scores using Delta & MLflow Ivana Pejeva Data Scientist @ element61
  • 2. What’s the challenge For past articles, Roularta wants to… ▪ analyze which article did well in bringing traffic, engagement & conversions Quality scoring of an article Predicting quality of an article Roularta wants to know the aricle quality... ... but there is an infinite amount of possible KPIs one needs to track ...and the goal is to simplify/standardize/automate an objective measurement For new articles, Roularta wants to… ▪ predict which article will bring good conversions, traffic and engagement
  • 3. What we offer editorial teams ▪ Calculates article scores on historical articles ▪ Predicts article scores on new articles CMS Content Behavior Reader behavior Structured streaming Structured streaming Azure Data Lake Gen2 Feature extraction Predictive Model Article Score Score calculation Data tool which
  • 4. What do we want to predict How much traffic is an article bringing to the site? Is this article going to bring good conversions? Will this article keep people engaged? Traffic Score Conversion Score Engagement Score
  • 5. The flow for calculating and predicting article scores Gather all pageviews and content data • Number of pageviews • Pageviews from Social media • Article author • Publishing date • … Calculate series of traffic, engagement and conversion scores Traffic Score: 0.7 Engagement Score: 0.2 Conversion Score: 0.5 Calculate quality stars for each article Traffic Score: Engagement Score: Conversion Score: Calculate quality score Calculate overall quality score based on traffic, engagement and conversion score Predict scores Predict the traffic, engagement and conversion scores on new articles
  • 6. What’s the data ▪ Reading Behavior ▪ Coming from a CDP tracker ▪ Pageviews ▪ Content Behavior ▪ Coming from the CMS ▪ Details on the content CMS Content Management System
  • 7. How we prepared the data ▪ Number of pageviews ▪ Time on page ▪ Social media shares ▪ Registrations ▪ Subscriptions ▪ Bounce rate ▪ … ▪ Article text ▪ Publication time ▪ Author ▪ Tags in article ▪ How long is the article ▪ Topic of article ▪ … Calculate article scores Predict article scores
  • 8. Calculation of article scores is a computationally intensive job ▪ A lot of the metrics require looking at a specific window of data from millions of rows ▪ e.g. calculating the number of visitors to an article where the user was not seen on the site for the past 30 days ▪ This would require to always look at a specific window of data for each article ▪ Very expensive operation as we are looking over millions of rows ▪ Keep intermediate delta lake tables ▪ e.g. Keep the visitors from the last 30 days in an intermediate delta table ▪ Avoid using windowing function over a huge amount of data ▪ > 10x improve in performance
  • 9. The role of Delta ▪ ~2 million pageviews per day we need to process in streaming mode ▪ ~ 250k number of articles ▪ 50 brands ▪ Delta in silver and gold layer for data ingestion ▪ Intermediate delta tables for extracted features ▪ Time travel to recreate ML experiments Delta to accelerate data ingestion and feature engineering pipelines
  • 10. The feature extraction process CMS Content Behavior Reader behavior Calculate traffic, engagement and conversion scores Create features for each article Add labels to articles • Is it a short/long article? • Does the teaser have images? • Tags in article • Authors of article • When was it published? • Teaser text
  • 11. NLP BERT language model ▪ BERTje – a dutch based BERT model ▪ Extract representations from article text ▪ Leverage pandasUDF for better performance ▪ Highly improved ML model performance For extracting features from articles text
  • 12. How we build the predictive model CMS Content Behavior Reader behavior Calculate traffic, engagement and conversion scores Create features for each article Add labels to articles • Is it a short/long article? • Does the teaser have images? • Tags in article • Authors of article • When was it published? • Teaser text ML Pipeline for feature transformation • Binarize features • Encode string column to label indices • TF-IDF to reflects importance of words in tags, topics and collections of articles • Scale data with mean and std • Create feature vectors Train Model • Multi-class classification problem (predict the number of stars (1-5) • Random Forest Classifier • Cross Validation and log model metrics with • Multiclass Classification Evaluator Register model
  • 13. The role of MLflow ▪ Train ML models per brand and per score ▪ We need to ▪ Track performance for all models ▪ Promote new model versions fast ▪ Serve models to score new data For model tracking, register and serve
  • 14. Results Method: ▪ Predictions made on articles from one brand ▪ Total number of articles 2368 Results: ▪ In 93% of the cases a quality article (4 or 5 stars) is given 4 or 5 stars by the model ▪ A low-quality article (1 or 2 stars) is given 4 or 5 stars in only 2% of the cases Details: ▪ Most important features are Teaser text and topics
  • 15. Summary ▪ Use Delta to accelerate data ingestion and feature extraction ▪ Use NLP BERT for text representations from articles ▪ Use Mlflow for tracking, registering and serving ML models
  • 16. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.