Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Bayesian Global Optimization

942 vues

Publié le

by Scott Clark, Co-founder & CEO, SigOpt

  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Bayesian Global Optimization

  1. 1. BAYESIAN GLOBAL OPTIMIZATION Using Optimal Learning to Tune Predictive Models Scott Clark scott@sigopt.com
  2. 2. OUTLINE 1. Why is Tuning Models Hard? 2. Comparison of Tuning Methods 3. Bayesian Global Optimization 4. Deep Learning Examples 5. Evaluating Optimization Strategies
  3. 3. Machine Learning is extremely powerful
  4. 4. Machine Learning is extremely powerful Tuning Machine Learning systems is extremely non-intuitive
  5. 5. https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3 What is the most important unresolved problem in machine learning? “...we still don't really know why some configurations of deep neural networks work in some case and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.” Xavier Amatriain, VP Engineering at Quora (former Director of Research at Netflix)
  6. 6. Photo: Joe Ross
  7. 7. TUNABLE PARAMETERS IN DEEP LEARNING
  8. 8. TUNABLE PARAMETERS IN DEEP LEARNING
  9. 9. Photo: Tammy Strobel
  10. 10. STANDARD METHODS FOR HYPERPARAMETER SEARCH
  11. 11. STANDARD TUNING METHODS Parameter Configuration ? Grid Search Random Search Manual Search - Weights - Thresholds - Window sizes - Transformations ML / AI Model Testing Data Cross Validation Training Data
  12. 12. OPTIMIZATION FEEDBACK LOOP Objective Metric Better Results REST API New configurations ML / AI Model Testing Data Cross Validation Training Data
  13. 13. BAYESIAN GLOBAL OPTIMIZATION
  14. 14. … the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive. Prof. Warren Powell - Princeton What is the most efficient way to collect information? Prof. Peter Frazier - Cornell How do we make the most money, as fast as possible? Scott Clark - CEO, SigOpt OPTIMAL LEARNING
  15. 15. ● Optimize objective function ○ Loss, Accuracy, Likelihood ● Given parameters ○ Hyperparameters, feature/architecture params ● Find the best hyperparameters ○ Sample function as few times as possible ○ Training on big data is expensive BAYESIAN GLOBAL OPTIMIZATION
  16. 16. SMBO Sequential Model-Based Optimization HOW DOES IT WORK?
  17. 17. 1. Build Gaussian Process (GP) with points sampled so far 2. Optimize the fit of the GP (covariance hyperparameters) 3. Find the point(s) of highest Expected Improvement within parameter domain 4. Return optimal next best point(s) to sample GP/EI SMBO
  18. 18. GAUSSIAN PROCESSES
  19. 19. GAUSSIAN PROCESSES
  20. 20. GAUSSIAN PROCESSES
  21. 21. GAUSSIAN PROCESSES
  22. 22. GAUSSIAN PROCESSES
  23. 23. GAUSSIAN PROCESSES
  24. 24. GAUSSIAN PROCESSES
  25. 25. GAUSSIAN PROCESSES
  26. 26. overfit good fit underfit GAUSSIAN PROCESSES
  27. 27. EXPECTED IMPROVEMENT
  28. 28. EXPECTED IMPROVEMENT
  29. 29. EXPECTED IMPROVEMENT
  30. 30. EXPECTED IMPROVEMENT
  31. 31. EXPECTED IMPROVEMENT
  32. 32. EXPECTED IMPROVEMENT
  33. 33. DEEP LEARNING EXAMPLES
  34. 34. ● Classify movie reviews using a CNN in MXNet SIGOPT + MXNET
  35. 35. TEXT CLASSIFICATION PIPELINE ML / AI Model (MXNet) Testing Text Validation Accuracy Better Results REST API Hyperparameter Configurations and Feature Transformations Training Text
  36. 36. TUNABLE PARAMETERS IN DEEP LEARNING
  37. 37. ● Comparison of several RMSProp SGD parametrizations STOCHASTIC GRADIENT DESCENT
  38. 38. ARCHITECTURE PARAMETERS
  39. 39. Grid Search Random Search ? TUNING METHODS
  40. 40. MULTIPLICATIVE TUNING SPEED UP
  41. 41. SPEED UP #1: CPU -> GPU
  42. 42. SPEED UP #2: RANDOM/GRID -> SIGOPT
  43. 43. CONSISTENTLY BETTER AND FASTER
  44. 44. ● Classify house numbers in an image dataset (SVHN) SIGOPT + TENSORFLOW
  45. 45. COMPUTER VISION PIPELINE ML / AI Model (Tensorflow) Testing Images Cross Validation Accuracy Better Results REST API Hyperparameter Configurations and Feature Transformations Training Images
  46. 46. METRIC OPTIMIZATION
  47. 47. ● All convolutional neural network ● Multiple convolutional and dropout layers ● Hyperparameter optimization mixture of domain expertise and grid search (brute force) SIGOPT + NEON http://arxiv.org/pdf/1412.6806.pdf
  48. 48. COMPARATIVE PERFORMANCE ● Expert baseline: 0.8995 ○ (using neon) ● SigOpt best: 0.9011 ○ 1.6% reduction in error rate ○ No expert time wasted in tuning
  49. 49. SIGOPT + NEON http://arxiv.org/pdf/1512.03385v1.pdf ● Explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions ● Variable depth ● Hyperparameter optimization mixture of domain expertise and grid search (brute force)
  50. 50. COMPARATIVE PERFORMANCE Standard Method ● Expert baseline: 0.9339 ○ (from paper) ● SigOpt best: 0.9436 ○ 15% relative error rate reduction ○ No expert time wasted in tuning
  51. 51. EVALUATING THE OPTIMIZER
  52. 52. OUTLINE ● Metric Definitions ● Benchmark Suite ● Eval Infrastructure ● Visualization Tool ● Baseline Comparisons
  53. 53. What is the best value found after optimization completes? METRIC: BEST FOUND BLUE RED BEST_FOUND 0.7225 0.8949
  54. 54. How quickly is optimum found? (area under curve) METRIC: AUC BLUE RED BEST_FOUND 0.9439 0.9435 AUC 0.8299 0.9358
  55. 55. STOCHASTIC OPTIMIZATION
  56. 56. ● Optimization functions from literature ● ML datasets: LIBSVM, Deep Learning, etc BENCHMARK SUITE TEST FUNCTION TYPE COUNT Continuous Params 184 Noisy Observations 188 Parallel Observations 45 Integer Params 34 Categorical Params / ML 47 Failure Observations 30 TOTAL 489
  57. 57. ● On-demand cluster in AWS for parallel eval function optimization ● Full eval consists of ~20000 optimizations, taking ~30 min INFRASTRUCTURE
  58. 58. RANKING OPTIMIZERS 1. Mann-Whitney U tests using BEST_FOUND 2. Tied results then partially ranked using AUC 3. Any remaining ties, stay as ties for final ranking
  59. 59. RANKING AGGREGATION ● Aggregate partial rankings across all eval functions using Borda count (sum of methods ranked lower)
  60. 60. SHORT RESULTS SUMMARY
  61. 61. BASELINE COMPARISONS
  62. 62. SIGOPT SERVICE
  63. 63. OPTIMIZATION FEEDBACK LOOP Objective Metric Better Results REST API New configurations ML / AI Model Testing Data Cross Validation Training Data
  64. 64. SIMPLIFIED OPTIMIZATION Client Libraries ● Python ● Java ● R ● Matlab ● And more... Framework Integrations ● TensorFlow ● scikit-learn ● xgboost ● Keras ● Neon ● And more... Live Demo
  65. 65. DISTRIBUTED TRAINING ● SigOpt serves as a distributed scheduler for training models across workers ● Workers access the SigOpt API for the latest parameters to try for each model ● Enables easy distributed training of non-distributed algorithms across any number of models
  66. 66. https://sigopt.com/getstarted Try it yourself!
  67. 67. Questions? contact@sigopt.com https://sigopt.com @SigOpt

×