Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Bayesian Optimization
of ML Hyper-parameters
Maksym Bevza
Research Engineer at
ML solves complex problems
Computer vision
Machine translation
Speech recognition
Game playing
Other complex problems
● And many more
○ Recommender systems
○ Natural language understanding
○ Robotics
○ Grammatical err...
Number of parameters growth
● The number of parameters grows tremendously
○ Number of layers
○ Convolution kernel size
○ N...
Tuning parameters is magic
● Complex systems are hard to analyse
● Impact of parameters on success is obscure
● Success of ML algorithm depends on
○ Data
○ Good algorithm/architecture
○ Good parameters settings
Tuning parameters is ...
● Success of ML algorithm depends on
○ Data
○ Good algorithm/architecture
○ Good parameters settings
Tuning parameters is ...
Goals
● Introduce Bayesian Optimization to the audience
● Share personal experience
○ Results on digit recognition problem...
Overview
● Tuning ML hyper-parameters
● Bayesian Optimization
● Available software
● Experiments in research field
● My ex...
Tuning ML hyper-parameters
Tuning ML hyper-parameters
● Grid search
● Random search
● Grad student descent
Grid Search
1. Define a search space
2. Try all 4*3=12
configurations
Search space for SVM Classifier
{
'C': [1, 10, 100, ...
Random Search
1. Define the search space
2. Sample the search space
and run ML algorithm
Search space for SVM Classifier
{...
Grid Search: pros & cons
● Fully automatic
● Parallelizable
● Number experiments grows exponentially with number of params...
Random Search: pros & cons
● Fully automatic
● Parallelizable
● Number of iterations are set upfront
● No time waste on un...
● f(x, y) = g(x) + h(y)
● h(y) is smaller than g(x)
Grid Search vs Random Search
Grad Student Descent
● Researcher fiddles around with the parameters until
it works
Name of method by Ryan Adams
Grad Student Descent: pros & cons
● Learns from previous iterations
● Takes into account evaluation cost
● Parallelizable
...
Comparison of all methods
Grid Search Random
Search
Grad Student
Descent
Fully automatic Yes Yes No
Learns from previous i...
Bayesian Optimization: the goal
● Fully automatic
● Learns from previous iterations
● Takes into account evaluation cost
●...
Bayesian Optimization (BO)
What is it?
● Let’s treat our ML learning algorithm as a function f : X -> Y
● X is our search space for hyper-parameters
● Y is set o...
● X - a search space
{
'C': [1, 1000],
'gamma': [0.0001, 0.1],
'kernel': ['rbf'],
}
Background: Examples
● We can optimize towards any score (even non-differentiable)
○ Validation error rate
○ AUC
○ Recall at fixed FPR
○ Many m...
● Our ML algorithm f for similar settings gets similar scores
● We can leverage it to try settings that are more promising...
● Let’s consider one
dimensional function
f : R -> R
● Let’s suppose we
want to minimize f
An example
Image from https://w...
● Build all possible
functions
● Less smooth
functions are less
probable
An example
Image from https://www.iro.umontreal.c...
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Which point to try next?
● Exploration: Try places with high variance
● Exploitation: Try places with low mean
Exploration / Exploitation tradeoff
● Probability of Improvement (PI)
● Expected Improvement (EI)
● Other complicated ones
Strategies of choosing next point
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Let’s go step by step
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
What about cost of evaluation?
● Hyper-parameters often impact the evaluation time
● Number of hidden layers, neurons per layer (Deep Learning)
● Number ...
● In practice we deal with time limits
● E.g. what’s the best set-up we can get in 7 days?
● Try cheap evaluations first
●...
How to account for cost of
evaluation?
● Let’s estimate two functions at a time:
○ The function f itself
○ The cost of evaluation (duration) of function f
● We c...
● We chosed the point with highest Expected Improvement
● Pick the highest EI/second instead
Strategy of choosing next poi...
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Comparison of all methods
Grid
Search
Random
Search
Grad Student
Descent
Bayesian
Optimization
Fully automatic Yes Yes No ...
What’s the catch?
● Bayesian optimization software is tricky to build
● Leveraging clusters for parallelization is hard
● No hype around it
...
Available software
● The toolkits built by researchers are not supported well
○ Spearmint
○ SMAC
○ HyperOpt
○ BayesOpt
● Non-bayesian alterna...
● SigOpt provides Bayesian Optimization as a service
● Claims state-of-the-art Bayesian Optimization
● Their customers
○ P...
def evaluate_model(assignments):
return train_and_evaluate_cv(**assignments)
SigOpt API
from sigopt import Connection
conn = Connection(client_token='TOKEN')
SigOpt API
experiment = conn.experiments().create(
name='Some Optimization (Python)',
parameters=[
dict(name='C', type='double', boun...
for _ in range(30):
suggestion = conn.experiments(experiment.id).suggestions().create()
value = evaluate_model(suggestion....
Experiments in research field
Snoek et al. (2012)
● CIFAR-10
○ 60000 images
○ 32x32 colour
○ 10 classes
● Error rate: 14.98%
○ New state-of-the-art resu...
Snoek et al. (2012)
● Error rate:
14.98%
● Previous:
18%
Extensive analysis by Clark et al. (2016)
● Extensive analysis of BO and other search methods
● Different type of function...
Extensive analysis by Clark et al. (2016)
● Comparison method
○ Best found
○ AUC
Extensive analysis by Clark et al. (2016)
● For each function
○ First placed
○ Top three
○ Borda
Extensive analysis by Clark et al. (2016)
Extensive analysis by Clark et al. (2016)
Extensive analysis by Clark et al. (2016)
Extensive analysis by Clark et al. (2016)
My experiments
Task
● Digit recognition
● MNIST dataset
○ 70000 images
○ 28x28 grayscale
○ 10 classes
Model
Conv Pool Dropout
Fully Connected
Fully Connected
Output (10)
Dropout
Dropout
Conv Pool Dropout
● 6 parameters tuned
○ Number of filters per layer (1)
○ Number of convolution layers (1)
○ Dense layers size (2)
○ Batch ...
● Features
○ Parameter types: INT, FLOAT, ENUM
○ Evaluation data stored in MongoDB
○ Works with noisy functions
● License:...
Results
MNIST Results: Random Search
MNIST Results: Bayesian Optimized
MNIST Results: Random vs Bayesian
● Best Random (avg): 1.20%
● Best Bayesian (avg): 0.86%
● Relative decrease in error rate: 28%
MNIST results: Random vs Ba...
Final points
● Spearmint tries boundaries first
○ Be cautious in setting up your search space
● Use logarithmic scales when it makes se...
Conclusions
● Bayesian Optimization leads to better results
● SigOpt is hopefully first stable implementation of BO
Thanks!
Maksym Bevza
Research Engineer at Grammarly
maksym.bevza@grammarly.com
www.grammarly.com
Prochain SlideShare
Chargement dans…5
×

DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации

84 vues

Publié le

DataScienceLab, 13 мая 2017
Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации
Максим Бевза (Research Engineer at Grammarly)
Все алгоритмы машинного обучения нуждаются в настройке (тьюнинге). Часто мы используем Grid Search или Randomized Search или нашу интуицию для подбора гиперпараметров. Байесовская оптимизация поможет нам направить Randomized Search в те места, которые наиболее перспективны, так, чтобы тот же (или лучший) результат мы получили за меньшее количество итераций.
Все материалы: http://datascience.in.ua/report2017

Publié dans : Technologie
  • Soyez le premier à commenter

DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации

  1. 1. Bayesian Optimization of ML Hyper-parameters Maksym Bevza Research Engineer at
  2. 2. ML solves complex problems
  3. 3. Computer vision
  4. 4. Machine translation
  5. 5. Speech recognition
  6. 6. Game playing
  7. 7. Other complex problems ● And many more ○ Recommender systems ○ Natural language understanding ○ Robotics ○ Grammatical error correction ○ ...
  8. 8. Number of parameters growth ● The number of parameters grows tremendously ○ Number of layers ○ Convolution kernel size ○ Number of neurons ○ Dropout drop rate ○ Learning rate ○ Batch size ● Preprocessing params
  9. 9. Tuning parameters is magic ● Complex systems are hard to analyse ● Impact of parameters on success is obscure
  10. 10. ● Success of ML algorithm depends on ○ Data ○ Good algorithm/architecture ○ Good parameters settings Tuning parameters is crucial
  11. 11. ● Success of ML algorithm depends on ○ Data ○ Good algorithm/architecture ○ Good parameters settings Tuning parameters is crucial
  12. 12. Goals ● Introduce Bayesian Optimization to the audience ● Share personal experience ○ Results on digit recognition problem ○ Toolkits for Bayesian Optimization
  13. 13. Overview ● Tuning ML hyper-parameters ● Bayesian Optimization ● Available software ● Experiments in research field ● My experiments
  14. 14. Tuning ML hyper-parameters
  15. 15. Tuning ML hyper-parameters ● Grid search ● Random search ● Grad student descent
  16. 16. Grid Search 1. Define a search space 2. Try all 4*3=12 configurations Search space for SVM Classifier { 'C': [1, 10, 100, 1000], 'gamma': [1e-2, 1e-3, 1e-4], 'kernel': ['rbf'] }
  17. 17. Random Search 1. Define the search space 2. Sample the search space and run ML algorithm Search space for SVM Classifier { 'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1), 'kernel': ['rbf'] }
  18. 18. Grid Search: pros & cons ● Fully automatic ● Parallelizable ● Number experiments grows exponentially with number of params ● Waste of time on unimportant parameters ● Some points in search space are not reachable ● Does not learn from previous iterations
  19. 19. Random Search: pros & cons ● Fully automatic ● Parallelizable ● Number of iterations are set upfront ● No time waste on unimportant parameters ● All points in the search space are reachable ● Does not learn from previous iterations ● Does not take into account evaluation cost
  20. 20. ● f(x, y) = g(x) + h(y) ● h(y) is smaller than g(x) Grid Search vs Random Search
  21. 21. Grad Student Descent ● Researcher fiddles around with the parameters until it works Name of method by Ryan Adams
  22. 22. Grad Student Descent: pros & cons ● Learns from previous iterations ● Takes into account evaluation cost ● Parallelizable ● Benefits from understanding semantics of hyper-parameters ● Search is biased ● Requires a lot of manual work
  23. 23. Comparison of all methods Grid Search Random Search Grad Student Descent Fully automatic Yes Yes No Learns from previous iterations No No Yes Takes into account eval. cost No No Yes Parallelizable Yes Yes Yes Reasonable search time No Yes Yes Handles unimportant parameters No Yes Yes Search is NOT biased Yes Yes No Good software Yes Yes N/A
  24. 24. Bayesian Optimization: the goal ● Fully automatic ● Learns from previous iterations ● Takes into account evaluation cost ● Search is not biased ● Parallelizable ● Available software is non-free and not stable
  25. 25. Bayesian Optimization (BO) What is it?
  26. 26. ● Let’s treat our ML learning algorithm as a function f : X -> Y ● X is our search space for hyper-parameters ● Y is set of score that we want to optimize ● Let’s consider other parameters to be fixed (e.g. dataset) Background
  27. 27. ● X - a search space { 'C': [1, 1000], 'gamma': [0.0001, 0.1], 'kernel': ['rbf'], } Background: Examples
  28. 28. ● We can optimize towards any score (even non-differentiable) ○ Validation error rate ○ AUC ○ Recall at fixed FPR ○ Many more Background: Examples
  29. 29. ● Our ML algorithm f for similar settings gets similar scores ● We can leverage it to try settings that are more promising ● For custom scores this condition should hold Intuition
  30. 30. ● Let’s consider one dimensional function f : R -> R ● Let’s suppose we want to minimize f An example Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  31. 31. ● Build all possible functions ● Less smooth functions are less probable An example Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  32. 32. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  33. 33. Which point to try next?
  34. 34. ● Exploration: Try places with high variance ● Exploitation: Try places with low mean Exploration / Exploitation tradeoff
  35. 35. ● Probability of Improvement (PI) ● Expected Improvement (EI) ● Other complicated ones Strategies of choosing next point
  36. 36. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  37. 37. Let’s go step by step
  38. 38. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  39. 39. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  40. 40. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  41. 41. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  42. 42. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  43. 43. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  44. 44. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  45. 45. What about cost of evaluation?
  46. 46. ● Hyper-parameters often impact the evaluation time ● Number of hidden layers, neurons per layer (Deep Learning) ● Number and depth of trees (Random Forest) ● Number of estimators (Gradient Boosting) Time limits vs evaluation limits
  47. 47. ● In practice we deal with time limits ● E.g. what’s the best set-up we can get in 7 days? ● Try cheap evaluations first ● Given rough characteristic of f, try expensive evaluations Time limits vs evaluation limits
  48. 48. How to account for cost of evaluation?
  49. 49. ● Let’s estimate two functions at a time: ○ The function f itself ○ The cost of evaluation (duration) of function f ● We can use BO to estimate those functions How to account for cost of evaluation?
  50. 50. ● We chosed the point with highest Expected Improvement ● Pick the highest EI/second instead Strategy of choosing next point with cost
  51. 51. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  52. 52. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  53. 53. Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
  54. 54. Comparison of all methods Grid Search Random Search Grad Student Descent Bayesian Optimization Fully automatic Yes Yes No Yes Learns from previous iterations No No Yes Yes Takes into account eval. cost No No Yes Yes Parallelizable Yes Yes Yes Tricky Reasonable search time No Yes Yes Yes Handles unimportant parameters No Yes Yes Yes Search is NOT biased Yes Yes No Yes Good software Yes Yes N/A No
  55. 55. What’s the catch?
  56. 56. ● Bayesian optimization software is tricky to build ● Leveraging clusters for parallelization is hard ● No hype around it What’s the catch?
  57. 57. Available software
  58. 58. ● The toolkits built by researchers are not supported well ○ Spearmint ○ SMAC ○ HyperOpt ○ BayesOpt ● Non-bayesian alternatives ○ TPE (Tree-structured Parzen Estimator) ○ PSO (Particle Swarm Optimization) Available software
  59. 59. ● SigOpt provides Bayesian Optimization as a service ● Claims state-of-the-art Bayesian Optimization ● Their customers ○ Prudential ○ Huawei ○ MIT ○ Hotwire ○ ... SigOpt
  60. 60. def evaluate_model(assignments): return train_and_evaluate_cv(**assignments) SigOpt API
  61. 61. from sigopt import Connection conn = Connection(client_token='TOKEN') SigOpt API
  62. 62. experiment = conn.experiments().create( name='Some Optimization (Python)', parameters=[ dict(name='C', type='double', bounds=dict(min=0.0, max=1.0)), dict(name='gamma', type='double', bounds=dict(min=0.0, max=1.0)), ], ) SigOpt API
  63. 63. for _ in range(30): suggestion = conn.experiments(experiment.id).suggestions().create() value = evaluate_model(suggestion.assignments) conn.experiments(experiment.id).observations().create( suggestion=suggestion.id, value=value, ) SigOpt API
  64. 64. Experiments in research field
  65. 65. Snoek et al. (2012) ● CIFAR-10 ○ 60000 images ○ 32x32 colour ○ 10 classes ● Error rate: 14.98% ○ New state-of-the-art result (in 2012)
  66. 66. Snoek et al. (2012) ● Error rate: 14.98% ● Previous: 18%
  67. 67. Extensive analysis by Clark et al. (2016) ● Extensive analysis of BO and other search methods ● Different type of functions ○ Oscillatory ○ Discrete values ○ Boring ○ ...
  68. 68. Extensive analysis by Clark et al. (2016) ● Comparison method ○ Best found ○ AUC
  69. 69. Extensive analysis by Clark et al. (2016) ● For each function ○ First placed ○ Top three ○ Borda
  70. 70. Extensive analysis by Clark et al. (2016)
  71. 71. Extensive analysis by Clark et al. (2016)
  72. 72. Extensive analysis by Clark et al. (2016)
  73. 73. Extensive analysis by Clark et al. (2016)
  74. 74. My experiments
  75. 75. Task ● Digit recognition ● MNIST dataset ○ 70000 images ○ 28x28 grayscale ○ 10 classes
  76. 76. Model Conv Pool Dropout Fully Connected Fully Connected Output (10) Dropout Dropout Conv Pool Dropout
  77. 77. ● 6 parameters tuned ○ Number of filters per layer (1) ○ Number of convolution layers (1) ○ Dense layers size (2) ○ Batch size (1) ○ Learning rate (1) Parameters of the model
  78. 78. ● Features ○ Parameter types: INT, FLOAT, ENUM ○ Evaluation data stored in MongoDB ○ Works with noisy functions ● License: Non-commercial usage Spearmint
  79. 79. Results
  80. 80. MNIST Results: Random Search
  81. 81. MNIST Results: Bayesian Optimized
  82. 82. MNIST Results: Random vs Bayesian
  83. 83. ● Best Random (avg): 1.20% ● Best Bayesian (avg): 0.86% ● Relative decrease in error rate: 28% MNIST results: Random vs Bayesian
  84. 84. Final points
  85. 85. ● Spearmint tries boundaries first ○ Be cautious in setting up your search space ● Use logarithmic scales when it makes sense ● Recommendations on iteration limit ○ 10-20 iterations per one parameter Gotchas
  86. 86. Conclusions ● Bayesian Optimization leads to better results ● SigOpt is hopefully first stable implementation of BO
  87. 87. Thanks! Maksym Bevza Research Engineer at Grammarly maksym.bevza@grammarly.com www.grammarly.com

×