SlideShare une entreprise Scribd logo
1  sur  56
Towards increasing predictability
of machine-learning research
Artem Vorozhtsov
Yandex LCC
Yandex LCC
System for displaying ads
on Yandex’s search result pages
and partner’s websites
Ad Targeting Group
Automation of Machine Learning Research
Research with profit
Introduction
R&D best practices
— Modularity
— Computational Measurability
— Transparency and Sharing
— Automation
— Modularity
Units, reuse, abstraction
R&D best practices
— Computational Measurability
Metrics Driven Development
R&D best practices
— Transparency and Sharing
Collaboration and reproducibility
R&D best practices
— Automation
…
R&D best practices
R&D best practices
— Modularity: units, reuse, abstractions
— Computational Measurability: MDD
— Transparency and Sharing: collaboration
— Automation: …
Happy life principles
— Kindness
— Wholeheartedness
— Love
— Discipline
— Self-development
Happy life principles
— Kindness
— Wholeheartedness
— Love
— Discipline
— Self-development
This is a list of global things,
not local (everyday) rules
clipart from Scrappindoodles
— Where does automation stop?
— Story of automating
— Everyday rules
— Questions
Plan
Automation is not
Automation –
is the use of machines, control systems
and information technologies to reduce
the need for human work to optimize productivity
in the production of goods and services.
Automation –
is the use of information technologies
to optimize productivity and to increase
predictability in the research, development
and other projects.
Complex KPIs
Where does automation stop?
KPI stands for Key Performance Indicators:
— Money, Clicks on Ads
— Comparison with rivals (# of segments we are better)
— Number of Nobel Prices
— Users & Government Loyalty
— Logliklihood of prediction
Where does automation stop?
— Strategy Thinking, Complex KPIs
— Research
Where does automation stop?
.. where real research starts
Where does automation stop?
Intuition
Research
Creativity
ScienceTools
Complex Maths
Automated pipelines
PDEs
Metrics
Validators
SVMPCA
1. Imagine how simple and agile research
work could be.
1. Believe it is possible, automate the most
and find the place for research.
Recipe
Task:
Ad click probability prediction
(binary classification problem)
KPI:
Profit, Clicks, Conversions, Loglikelihood
Yandex LLC
Story of automation
Story of automation
Classifier
(matrixnet)
filtersfiltersfiltersfilters
filtersfiltersfiltersreducers
filtersfiltersfiltersmetrics
GnuPlot
filterssimulators
MapReduce
STORAGE
clipart from http://www.stoneys.ch
Story of automation
Classifier
(matrixnet)
filtersfiltersfiltersfilters
filtersfiltersfiltersreducers
filtersfiltersfiltersmetrics
GnuPlot
filterssimulators
MapReduce
STORAGE
clipart from http://clipartov.net
Story of automation
Classifier
(TMVA, …)
filters
filters
filters
filters
filters
filters
filters
reducers
filters
filters
filters
metrics
GnuPlot
filters
simulators
MapReduce
STORAGE
ML Infrastructure
Report
Idea
Pipeline (no automation)
— Prepare raw data set for ML
— Apply filters (cuts) and mappers
— Calculate features
— Assign weights
— Split to train and test
— Train classifier at training set
— Look at learn curve and check for overfitting
— Apply resulted classifier model to testing set
— Calculate metrics and compare with current best
Story of automation
Pipeline (no automation)
— Prepare raw data set for ML
— Apply filters (cuts) and mappers (add new filter)
— Calculate features (add new feature)
— Assign weights (new idea for weighting)
— Split to train and test
— Train classifier at training set (new train options)
— Look at learn curve and check for overfitting
— Apply resulted classifier model to testing set
— Calculate metrics and compare with current best
Story of automation
— Create and commit YAML file
— Read the report
Story of automation
Engine: “matrixnet” # options: VW, TMVA (TODO!)
Mappers: |
[
Join(„PLACE FOR NEW FEATURES‟),
Grep(„r.Age > 10 and PLACE FOR GREP IDEA'),
Mapper(„r.Weight = PLACE FOR WEIGHT IDEA‟),
yabs.matrixnet.factor.DefaultFactors(),
]
MailTo: ml-reports@yandex-team.ru
Options: „PLACE FOR NEW OPTIONS‟
Tables: „EFHFactors:last_14_days‟
Pipeline (with automation)
Story of automation
Classifier
(TMVA, …)
filters
filters
filters
filters
filters
filters
filters
reducers
filters
filters
filters
metrics
GnuPlot
filters
simulators
MapReduce
STORAGE
ML Infrastructure
Report
YAML-file
Story of automation
metric | learn | test | test cur.
---------------------------------------
ll_p | 0.38171 | 0.36074 | 0.14527
ll_r | 0.38966 | 0.37151 | 0.33247
f1_p | 0.44869 | 0.44430 | 0.43266
fom_p | 0.91526 | 0.90580 | 0.88528
kl_p | 0.31143 | 0.29581 | 0.13186
log_loss | 0.39965 | 0.40354 | 0.44178
mcc_p | 0.30788 | 0.30159 | 0.28512
q10_p | 2.6632 | 2.5994 | 2.5261
q2_p | 1.6315 | 1.6212 | 1.5886
q_p | 1.6244 | 1.6089 | 1.5777
Report
Story of automation
Report
Story of automation
Report
Story of automation
ML Infrastructure
Classifier
(TMVA, …)
filtersfiltersfiltersfilters
filtersfiltersfiltersreducers
filtersfiltersfiltersmetrics
GnuPlot
filters
simulator
s
MapReduce
STORAGE
Production
Report (Money, Clicks)
Experiment (1%)
Deploy new model
YAML-file
Report (llp)
Report (Money, Clicks)
Deploy
new model
Challenges (scientific)
— Multi-armed bandit problem
• Banner is black box with estimated CTR
• Historical data is used for prediction
— Default model bias
• Training set is generated by default model
— Move from KPIs to metrics and cost functions
• Business Strategy  (approx) metrics
— Balancing between different cost functions
• Clicks, Money, Conversions, CPA
Challenge (automation):
Graphical Pipelines Framework
Simulation
data
Experimental
data
map
train
Cut by
threshold
Show mass
distribution
Filter
background
Estimate mixture
parameters
classifymap
Run
Automation for me is:
— Tools
What is Automation?
Automation for me is:
— Tools (in TMVA)
What is Automation?
Normalization
Rectangular Cuts
SVM
Boosted Trees
Gaussianisation
PCA
PDE
Decorrelation
Genetic Algorithms
Automation for me is:
— Tools
• Macro language (high level language)
for expressing ideas
What is Automation?
Simulation
data
Experimental
data
map
train
Filter by
threshold
Show mass
distribution
Filter
background
Estimate mixture
parameters
classifymap
Automation for me is:
— Tools
• Macro language (high level language)
for expressing ideas
— Infrastructure
• Connecting with arrows
• Whole pipeline coverage
What is Automation?
Automation for me is:
— Tools
• Macro language (high level language)
for expressing ideas
— Infrastructure
• Connecting with arrows
• Whole pipeline coverage
— Specialization
• Collaboration and delegation
What is Automation?
Automation for me is:
—…
— Specialization
• Collaboration and delegation
What is Automation?
classifier
train set model
parameters
Parameters
What is Automation?
Comp. Complexity
Model
ProperDefective
Cost Function
Learning rate
Tree depth
Regularization
Features TypesNumber of trees
Automation for me is:
— Tools
• Macro language (high level language) for
expressing ideas
— Infrastructure
• Connecting with arrows
• Whole pipeline coverage
— Specialization
• Collaboration and delegation
What is Automation?
(1) Copy and paste data
— Add new boxes to automated pipeline
— Automate transport between all boxes
— Do not use strange software
Everyday rules: anti-patterns
(2) Execute data pipeline steps manually in a cycle.
— Define new command for this pipeline
— Use standard formats for data streams
— Define needed ‘mappers’ and ‘reducers’ for data
stream and use them
Everyday rules: anti-patterns
(3) Your code is >3 times longer than natural language
description
— Start working on new tools (macro languages, DSL)
Everyday rules: anti-patterns
(4) It takes >1 man-hour to recalculate final graph of
your research
— Automate the whole pipeline
Everyday rules: anti-patterns
(5) You write line of code that has no chance of being
executed >10,000 times
Everyday rules: anti-patterns
(5) You write line of code that has no chance of being
executed >10,000 times
Everyday rules: anti-patterns
Code (>10000 times) Interactive Data Analysis (once)
def pca(data, reduce_dims=0, corr=True,
normalise=False,subtract_mean=True):
data_mean = None
if subtract_mean:
data_mean = mean(data, axis=0)
data -= data_mean
transposed = transpose(data)
cov_matrix = corrcoef(transposed)
# Compute eigenvalues and sort into
# descending order
eigen_vals,eigen_vecs =
linalg.eig(cov_matrix)
indices = argsort(eigen_vals)
indices = indices[::-1]
eigen_vecs = eigen_vecs[:, indices]
eigen_vals = eigen_vals[indices]
data = filter(data, “RegionID = 213”)
data1, data2 = split_random(data)
data2ext = decorrelate(data1, data2,
fields = [“age”, “income”, …])
report = check_features(data2ext)
show_report(report)
(5) You write line of code that has no chance of being
executed >10,000 times
Choose one action a time (A) or (B):
A. Interactive data analysis using high level tools
B. Coding: extending/improving tools library or
infrastructure. Delegate it?
There is no other options.
Everyday rules: anti-patterns
(6) Your colleagues think that you are doing something
useless
— Stop doing questionable things
Everyday rules: anti-patterns
(7) You have a dream, and it hasn’t came true yet
— Tell Yandex about your dream
Everyday rules: anti-patterns
Artem Vorozhtsov
Head of Ads Targeting Group
avorozhtsov@yandex-team.ru
Thank you!

Contenu connexe

Tendances

Multi-data-types Interval Decision Diagrams for XACML Evaluation Engine
Multi-data-types Interval Decision Diagrams for XACML Evaluation EngineMulti-data-types Interval Decision Diagrams for XACML Evaluation Engine
Multi-data-types Interval Decision Diagrams for XACML Evaluation EngineCanh Ngo
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...PyData
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix DatasetBen Mabey
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Semi-supervised learning with GANs
Semi-supervised learning with GANsSemi-supervised learning with GANs
Semi-supervised learning with GANsterek47
 
Automated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform DesignsAutomated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform DesignsAijun Zhang
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsAlbert Bifet
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataBenjamin Bengfort
 
Ferruzza g automl deck
Ferruzza g   automl deckFerruzza g   automl deck
Ferruzza g automl deckEric Dill
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 Albert Bifet
 
QCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for EveryoneQCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for EveryoneDhiana Deva
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysisRajesh M
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeYuto Hayamizu
 

Tendances (20)

Multi-data-types Interval Decision Diagrams for XACML Evaluation Engine
Multi-data-types Interval Decision Diagrams for XACML Evaluation EngineMulti-data-types Interval Decision Diagrams for XACML Evaluation Engine
Multi-data-types Interval Decision Diagrams for XACML Evaluation Engine
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Semi-supervised learning with GANs
Semi-supervised learning with GANsSemi-supervised learning with GANs
Semi-supervised learning with GANs
 
Automated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform DesignsAutomated Machine Learning via Sequential Uniform Designs
Automated Machine Learning via Sequential Uniform Designs
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Ferruzza g automl deck
Ferruzza g   automl deckFerruzza g   automl deck
Ferruzza g automl deck
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
QCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for EveryoneQCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for Everyone
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysis
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
 

En vedette (13)

Chemical reactions
Chemical reactionsChemical reactions
Chemical reactions
 
презентация до педради 01.04.2013
презентация до педради 01.04.2013презентация до педради 01.04.2013
презентация до педради 01.04.2013
 
Parcial ii
Parcial iiParcial ii
Parcial ii
 
Ex2 system chap 1-3
Ex2 system chap 1-3Ex2 system chap 1-3
Ex2 system chap 1-3
 
Shower Book
Shower BookShower Book
Shower Book
 
Copy of assignment..macro
Copy of assignment..macroCopy of assignment..macro
Copy of assignment..macro
 
Tallinna loomaaed
Tallinna loomaaedTallinna loomaaed
Tallinna loomaaed
 
Human services yearbook
Human services yearbookHuman services yearbook
Human services yearbook
 
Parcial ii
Parcial iiParcial ii
Parcial ii
 
Cognitive functioning (2)
Cognitive functioning (2)Cognitive functioning (2)
Cognitive functioning (2)
 
Natural Collections
Natural CollectionsNatural Collections
Natural Collections
 
Ex2 cheat sheet
Ex2 cheat sheetEx2 cheat sheet
Ex2 cheat sheet
 
Hipertensi dalam kehamilan
Hipertensi dalam kehamilanHipertensi dalam kehamilan
Hipertensi dalam kehamilan
 

Similaire à Automating ML Research for Increased Predictability

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?Matei Zaharia
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
Introduction to System, Simulation and Model
Introduction to System, Simulation and ModelIntroduction to System, Simulation and Model
Introduction to System, Simulation and ModelMd. Hasan Imam Bijoy
 
Mtc strategy-briefing-houston-pd m-05212018-3
Mtc strategy-briefing-houston-pd m-05212018-3Mtc strategy-briefing-houston-pd m-05212018-3
Mtc strategy-briefing-houston-pd m-05212018-3Dania Kodeih
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMATLABISRAEL
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in productionStepan Pushkarev
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABCodeOps Technologies LLP
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Presentation by Lionel Briand
Presentation by Lionel BriandPresentation by Lionel Briand
Presentation by Lionel BriandPtidej Team
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 

Similaire à Automating ML Research for Increased Predictability (20)

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Introduction to System, Simulation and Model
Introduction to System, Simulation and ModelIntroduction to System, Simulation and Model
Introduction to System, Simulation and Model
 
Mtc strategy-briefing-houston-pd m-05212018-3
Mtc strategy-briefing-houston-pd m-05212018-3Mtc strategy-briefing-houston-pd m-05212018-3
Mtc strategy-briefing-houston-pd m-05212018-3
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
An Analytics Platform for Connected Vehicles
An Analytics Platform for Connected VehiclesAn Analytics Platform for Connected Vehicles
An Analytics Platform for Connected Vehicles
 
kdd2015
kdd2015kdd2015
kdd2015
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in production
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Presentation by Lionel Briand
Presentation by Lionel BriandPresentation by Lionel Briand
Presentation by Lionel Briand
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 

Dernier

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Dernier (20)

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Automating ML Research for Increased Predictability

  • 1.
  • 2. Towards increasing predictability of machine-learning research Artem Vorozhtsov
  • 5. System for displaying ads on Yandex’s search result pages and partner’s websites Ad Targeting Group Automation of Machine Learning Research Research with profit Introduction
  • 6. R&D best practices — Modularity — Computational Measurability — Transparency and Sharing — Automation
  • 7. — Modularity Units, reuse, abstraction R&D best practices
  • 8. — Computational Measurability Metrics Driven Development R&D best practices
  • 9. — Transparency and Sharing Collaboration and reproducibility R&D best practices
  • 11. R&D best practices — Modularity: units, reuse, abstractions — Computational Measurability: MDD — Transparency and Sharing: collaboration — Automation: …
  • 12. Happy life principles — Kindness — Wholeheartedness — Love — Discipline — Self-development
  • 13. Happy life principles — Kindness — Wholeheartedness — Love — Discipline — Self-development This is a list of global things, not local (everyday) rules
  • 15. — Where does automation stop? — Story of automating — Everyday rules — Questions Plan
  • 17. Automation – is the use of machines, control systems and information technologies to reduce the need for human work to optimize productivity in the production of goods and services.
  • 18. Automation – is the use of information technologies to optimize productivity and to increase predictability in the research, development and other projects.
  • 19. Complex KPIs Where does automation stop?
  • 20. KPI stands for Key Performance Indicators: — Money, Clicks on Ads — Comparison with rivals (# of segments we are better) — Number of Nobel Prices — Users & Government Loyalty — Logliklihood of prediction Where does automation stop?
  • 21. — Strategy Thinking, Complex KPIs — Research Where does automation stop?
  • 22. .. where real research starts Where does automation stop? Intuition Research Creativity ScienceTools Complex Maths Automated pipelines PDEs Metrics Validators SVMPCA
  • 23. 1. Imagine how simple and agile research work could be. 1. Believe it is possible, automate the most and find the place for research. Recipe
  • 24. Task: Ad click probability prediction (binary classification problem) KPI: Profit, Clicks, Conversions, Loglikelihood Yandex LLC Story of automation
  • 27. Story of automation Classifier (TMVA, …) filters filters filters filters filters filters filters reducers filters filters filters metrics GnuPlot filters simulators MapReduce STORAGE ML Infrastructure Report Idea
  • 28. Pipeline (no automation) — Prepare raw data set for ML — Apply filters (cuts) and mappers — Calculate features — Assign weights — Split to train and test — Train classifier at training set — Look at learn curve and check for overfitting — Apply resulted classifier model to testing set — Calculate metrics and compare with current best Story of automation
  • 29. Pipeline (no automation) — Prepare raw data set for ML — Apply filters (cuts) and mappers (add new filter) — Calculate features (add new feature) — Assign weights (new idea for weighting) — Split to train and test — Train classifier at training set (new train options) — Look at learn curve and check for overfitting — Apply resulted classifier model to testing set — Calculate metrics and compare with current best Story of automation
  • 30. — Create and commit YAML file — Read the report Story of automation Engine: “matrixnet” # options: VW, TMVA (TODO!) Mappers: | [ Join(„PLACE FOR NEW FEATURES‟), Grep(„r.Age > 10 and PLACE FOR GREP IDEA'), Mapper(„r.Weight = PLACE FOR WEIGHT IDEA‟), yabs.matrixnet.factor.DefaultFactors(), ] MailTo: ml-reports@yandex-team.ru Options: „PLACE FOR NEW OPTIONS‟ Tables: „EFHFactors:last_14_days‟ Pipeline (with automation)
  • 31. Story of automation Classifier (TMVA, …) filters filters filters filters filters filters filters reducers filters filters filters metrics GnuPlot filters simulators MapReduce STORAGE ML Infrastructure Report YAML-file
  • 32. Story of automation metric | learn | test | test cur. --------------------------------------- ll_p | 0.38171 | 0.36074 | 0.14527 ll_r | 0.38966 | 0.37151 | 0.33247 f1_p | 0.44869 | 0.44430 | 0.43266 fom_p | 0.91526 | 0.90580 | 0.88528 kl_p | 0.31143 | 0.29581 | 0.13186 log_loss | 0.39965 | 0.40354 | 0.44178 mcc_p | 0.30788 | 0.30159 | 0.28512 q10_p | 2.6632 | 2.5994 | 2.5261 q2_p | 1.6315 | 1.6212 | 1.5886 q_p | 1.6244 | 1.6089 | 1.5777 Report
  • 35. Story of automation ML Infrastructure Classifier (TMVA, …) filtersfiltersfiltersfilters filtersfiltersfiltersreducers filtersfiltersfiltersmetrics GnuPlot filters simulator s MapReduce STORAGE Production Report (Money, Clicks) Experiment (1%) Deploy new model YAML-file Report (llp) Report (Money, Clicks)
  • 37. Challenges (scientific) — Multi-armed bandit problem • Banner is black box with estimated CTR • Historical data is used for prediction — Default model bias • Training set is generated by default model — Move from KPIs to metrics and cost functions • Business Strategy  (approx) metrics — Balancing between different cost functions • Clicks, Money, Conversions, CPA
  • 38. Challenge (automation): Graphical Pipelines Framework Simulation data Experimental data map train Cut by threshold Show mass distribution Filter background Estimate mixture parameters classifymap Run
  • 39. Automation for me is: — Tools What is Automation?
  • 40. Automation for me is: — Tools (in TMVA) What is Automation? Normalization Rectangular Cuts SVM Boosted Trees Gaussianisation PCA PDE Decorrelation Genetic Algorithms
  • 41. Automation for me is: — Tools • Macro language (high level language) for expressing ideas What is Automation? Simulation data Experimental data map train Filter by threshold Show mass distribution Filter background Estimate mixture parameters classifymap
  • 42. Automation for me is: — Tools • Macro language (high level language) for expressing ideas — Infrastructure • Connecting with arrows • Whole pipeline coverage What is Automation?
  • 43. Automation for me is: — Tools • Macro language (high level language) for expressing ideas — Infrastructure • Connecting with arrows • Whole pipeline coverage — Specialization • Collaboration and delegation What is Automation?
  • 44. Automation for me is: —… — Specialization • Collaboration and delegation What is Automation? classifier train set model parameters
  • 45. Parameters What is Automation? Comp. Complexity Model ProperDefective Cost Function Learning rate Tree depth Regularization Features TypesNumber of trees
  • 46. Automation for me is: — Tools • Macro language (high level language) for expressing ideas — Infrastructure • Connecting with arrows • Whole pipeline coverage — Specialization • Collaboration and delegation What is Automation?
  • 47. (1) Copy and paste data — Add new boxes to automated pipeline — Automate transport between all boxes — Do not use strange software Everyday rules: anti-patterns
  • 48. (2) Execute data pipeline steps manually in a cycle. — Define new command for this pipeline — Use standard formats for data streams — Define needed ‘mappers’ and ‘reducers’ for data stream and use them Everyday rules: anti-patterns
  • 49. (3) Your code is >3 times longer than natural language description — Start working on new tools (macro languages, DSL) Everyday rules: anti-patterns
  • 50. (4) It takes >1 man-hour to recalculate final graph of your research — Automate the whole pipeline Everyday rules: anti-patterns
  • 51. (5) You write line of code that has no chance of being executed >10,000 times Everyday rules: anti-patterns
  • 52. (5) You write line of code that has no chance of being executed >10,000 times Everyday rules: anti-patterns Code (>10000 times) Interactive Data Analysis (once) def pca(data, reduce_dims=0, corr=True, normalise=False,subtract_mean=True): data_mean = None if subtract_mean: data_mean = mean(data, axis=0) data -= data_mean transposed = transpose(data) cov_matrix = corrcoef(transposed) # Compute eigenvalues and sort into # descending order eigen_vals,eigen_vecs = linalg.eig(cov_matrix) indices = argsort(eigen_vals) indices = indices[::-1] eigen_vecs = eigen_vecs[:, indices] eigen_vals = eigen_vals[indices] data = filter(data, “RegionID = 213”) data1, data2 = split_random(data) data2ext = decorrelate(data1, data2, fields = [“age”, “income”, …]) report = check_features(data2ext) show_report(report)
  • 53. (5) You write line of code that has no chance of being executed >10,000 times Choose one action a time (A) or (B): A. Interactive data analysis using high level tools B. Coding: extending/improving tools library or infrastructure. Delegate it? There is no other options. Everyday rules: anti-patterns
  • 54. (6) Your colleagues think that you are doing something useless — Stop doing questionable things Everyday rules: anti-patterns
  • 55. (7) You have a dream, and it hasn’t came true yet — Tell Yandex about your dream Everyday rules: anti-patterns
  • 56. Artem Vorozhtsov Head of Ads Targeting Group avorozhtsov@yandex-team.ru Thank you!

Notes de l'éditeur

  1. Hello, everybody.My name is Artem Vorozhtsov.A work at Yandex.And today I would like to talk about increasing predictability of ML research.And the 42nd slide contains ultimate recipe for this problem. So your task is to wake up in twenty minutes.
  2. Before start, let me introduce myself and say some words about Yandex.Yandex is one of a few companies that provides national-wide search services. In 2011 we’ve managed to go IPO with 8bn capitalization and quite recently just two years later celebrated 10bn capitalization. Google is our nearest rival.
  3. Yandex’ search engine has some features that Google doesn’t have.For instance, “Islands” on search result page.And Yandex provides some services that Google does not provide. For instance, Yandex.Taxi and Yandex.Market.
  4. Onemonth ago my colleague AndreyUstyuzhanin has talked about some buzzwords in IT.These words are software development best practices.They increase chances of your project to succeed.In fact one might treat them as must-have-things.But the problem is that they do not work. I mean, event If you know these practices, understand their importance, try to use them, you still have no guarantee to succeed.
  5. Anyhow, lets look at them briefly.Modularity, is the basic one. It is about packaging your functionality in units, so that others could reuse them.Besides, modularity is about increasing abstraction level.Existing units could be packaged again in high level units.
  6. Measurability is about metrics and validation.Binary classification common metrics are loglikelihood and AUC.Quality of User Interface could be measured by usability, that could be computed as average number of required clicks to complete a common user task.And all these metrics and the idea to look at them are not a big deal.But IT specialists invented even a new buzzword – MDD, that stands for Metrics Driven Development.This is a practice where each development step and decision is validated by metrics.So why did they give the special name to such simple and must-have-thing.The reason is that people just … forget to pick the RIGHT metrics, or do not express it in a COMPUTABLE form and do not monitor its values.Andsometimes it’s hard to express real world thing in computable form, but it should be done.
  7. Transparency and Sharing are about collaboration,standards and reproducibility.The possibility to reproduce results of a research is very important thing.And one should be able to dive into any black box unless he wants to stay at certain abstraction level.
  8. The last and the most obscureone is automation. It is the main buzzword.And I will talk about it in details later.
  9. Advice like “Be kind to people”, “Don’t be angry”, “Think before doing” do not work at all.You know.
  10. And this list should be converted to the list of everyday rules.
  11. So I am going to provide you with everyday rules that researcher-developer should follow.But before, I want to talk about automation itself.Then I will tell you a story of Machine Learning research Automation at Yandex.
  12. So, let’s talk about automation itself.One may think that automation is something like this.But in fact, automation is not necessarily about fixed unmovable industry pipelines.
  13. Let’s look at Wikipedia.Automation is the use of machines to optimize productivity.Automation still allows research processes to be agile. And it not necessarily excludes humans from pipelines.Moreover, there was an interesting change in Wikipedia. The phrase “reduce the need for human work” was replaced with “optimize productivity”.The real target of an automation is not to replace humans but to make the production more optimal.
  14. Let’s rewrite this definition, just a little bit.Automation is the use of information technologies to optimize productivity and to increase predictabilityin the research, development and other projects.Now this definition gives key to the main question of the presentation.BTW, programmers are big fans of automation.They try to automate everything they meet.So that automation spreads all over the system .. like infection, really, … until it hits the border. And it is an interesting thing – the place, where automation spreading stops.I have a story how I found this place at Yandex.
  15. Just two weeks ago I came to my boss and start asking him questions about how does he make his big boss decisions.You know. It’s like “close one department, start new project, make yellow background for all ads at Yandex Search Page”, and etc.And I wanted to know much more about his internal Cost Function that drives his decisions.And he told me “Stop. Don’t’ try to virtualize me. I like real live version of me much more”.So, Big Boss Strategy Thinking could not be digitized.It’s something incomputable, it is based on intuition.
  16. KPI stands for Key Performance Indicator. Typical company KPIs are profit, number of clicks on ads, variety of products, etc. or comparison with nearest rivals (for Yandex it’s Google, BTW).For LHC one of KPIs is the number of Nobel Prices.
  17. Now, I have answer for the question.Automation stops where when it meets Boss Strategy Thinking. And second answer is relevant to CERN. Automation stops where the real research work starts.
  18. And it’s a good reason to figure out where automation stops.The circle in the picture is the whole machine learning work.Left blue part is one that could be automated. It’s about computable metrics, automated pipelines, ready to use state-of-art algorithms, and infrastructure.Red part is about science, Its about using these tools.Which one do you choose?Skills of person who is doing right part differs from skills of person who creates tools.
  19. So the recipe.It has two steps: 1) imagine how simple and agile it could be; think, how ideal research and development process could look like; 2) Believe it is possible to build this system; automate the most and the rest is the place for real research.
  20. My story of automation is not a fairy tale.It is a story where a researcher’s dream came true.It’s about aggressive automation of different ML tasks, and about making ML research as simple as it could be.Predicting probability of click on ad on search result pagesand partner’s websites .This is typical Binary Classification problem.
  21. This dream may come true.Lets look at a researcher’s pipeline before automation.Nothing is automated and there are many steps.I listed only the first nine steps.
  22. There are some places in the pipeline where a researcher could make his changes.He could add new features to data set, use special options for binary classification algorithm (TMVA boosting trees or Genetic Algorithms).Or he/she could make changes in pre or post-processing of data (different filters and mappers).And a researcher had to be informed about current default parameters of each step.
  23. Now, in my research group, pipeline consists of only two steps: create and commit YAML-file and read the report.Green words are placeholders where a researcher can put his ideas in.
  24. The final process looks like this.We are geeks, so we choose YAML-language for expressing our ideas.And It is not just YAML, it is YAML with python code inside.We express our ideas in YAML && python and it works well for us.
  25. The report tells how good is the idea.First lines are metrics values for training and testing sets.And the first metrics is llp, it is based on loglikelihood of prediction.
  26. The first figure in the report is learning curve. It shows how quality of classifier changes with iterations.And iterations in our classifier is the number of trees in forest of boosted trees. Sometimes number of iterations is several hundred and sometimes it is several thousand.
  27. The second figure has details about each feature strength.And that’s all.
  28. In fact there is place for more automation.If a report has a good quality metrics, than trained prediction model should be checked in production.1% experiment is started automatically.After a week a researcher will get report with online real world metrics of the experiment, like money, clicks, conversions, CPC, CTR, and others.And if all these metrics in the experiment is better than in the default model, than new model replaces default one.The last report is about inverse experiment (old default).It should be worse than the new default.
  29. So, again.What is automation for me?Firstly, it is a set of tools.And some of them are very high level tools.
  30. Software Infrastructure is responsible for how easily all these boxes (and a researcher) could be connected in a pipeline. Important property of infrastructure is coverage. It is intrinsic defining property of infrastructure.Infrastructure should embrace data delivery and storage, machine learning software, tools for manipulating and visualizing data, tools for searching and indexing your data, reporting tools.
  31. Finally, automation brings specialization.
  32. There should be a person or group of people responsible for the classifier.We have such classifier at Yandex. Its called matrixnet.And there is a group of programmers responsible for it.They add new cost functions, increase performance and remove all imperfections from it.I have a strong opinion that delegation is the important thing in research too.It improves collaboration and increase end-KPIs.
  33. Some final slides with everyday rules.I express them as anti-patterns.If you match anti-pattern then you should do something and get rid of this match. The first one is very simple.The green text contains recommendations.So if you copy and paste data then something is wrong.Maybe there is step that is not yet included to your pipeline.Then just do it. Add new box to you pipeline. Automate transport of data to this box (I mean input data).Maybe you use software with visual graphical user interface and there is now other way to input data into the program.Then just don’t use this program.
  34. This antipattern gives interesting criteria.Guess you have an idea to apply a new filter to raw data right at the beginning of the research pipeline.How long will it take to get the final numbers or figures of your research?And the answer has two components: human time and machine time.It’s not good if it takes more then ten minutes of your time.
  35. This one is not about meta-transition.Its about motivation of your current action.What are you doing now? Are you writing a code?Is this your job to write this code?Who will use it? What is the purpose of this code?And how many times it will be executed in future? (just guess)
  36. And the crucial point here is the difference between code and interactive data analysis.When you do interactive experiment you use really high level tools.And most of your lines executed only once.If it is not interactive experiment then it is coding.
  37. And if it’s coding, then your code should be about extending or improving tools library or infrastructure.Writing line of code is a great responsibility.It is like a brick in a cathedral building.You should know that a line of code is born to be executed many many times.So, there are two options only – interactive analysis or coding tools and software infrastructure.Something different is just waste of time (probably).
  38. This one is questionable one.But its is not a joke.Among programmers there are those who are10 times more productive than others.Secret of their productivity is not speed typing, and they are not 10 time more cleaver than others.They just don’t do questionable things.The situation is like following: there are a lot of ideas and tickets to do. May be hundred or so.This rule is about doing only idea and ticket that confirmed its importance multiple times.And the results we are going to achieve from the ticket will have multiple applications.And there is no chief programmer of researcher who thinks that this idea is just waste of time.
  39. And the final rule is about doing me a favor.I want to know more about your need and research work related todata mining and Machine Learning. I will stay in CERN for a week.