WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Derivative Free Optimization
1. DERIVATIVE-FREE
OPTIMIZATION
http://www.lri.fr/~teytaud/dfo.pdf
(or Quentin's web page ?)
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège using also Slides from A. Auger
2. The next slide is the most important
of all.
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
3. In case of trouble,
Interrupt me.
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
4. In case of trouble,
Interrupt me.
Further discussion needed:
- R82A, Montefiore institute
- olivier.teytaud@inria.fr
- or after the lessons (the 25 th
, not the 18th)
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
5.
6. I. Optimization and DFO
II. Evolutionary algorithms
III. From math. programming
IV. Using machine learning
V. Conclusions
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
9. Derivative-free optimization of f
No gradient !
Only depends on the x's and f(x)'s
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
13. Derivative-free optimization of f
Why derivative free optimization ?
Ok, it's slower
But sometimes you have no derivative
It's simpler (by far) ==> less bugs
14. Derivative-free optimization of f
Why derivative free optimization ?
Ok, it's slower
But sometimes you have no derivative
It's simpler (by far)
It's more robust (to noise, to strange functions...)
15. Derivative-free optimization of f
Optimization algorithms
==> Newton optimization ?
Why derivative free
==> Quasi-Newton (BFGS)
Ok, it's slower
But sometimes you have no derivative
==> Gradient descent
It's simpler (by far)
==> ...robust (to noise, to strange functions...)
It's more
16. Derivative-free optimization of f
Optimization algorithms
Why derivative free optimization ?
Ok, it's slower
Derivative-free optimization
But sometimes you have no derivative
(don't need gradients)
It's simpler (by far)
It's more robust (to noise, to strange functions...)
17. Derivative-free optimization of f
Optimization algorithms
Why derivative free optimization ?
Derivative-free optimization
Ok, it's slower
But sometimes you have no derivative
Comparison-based optimization
(coming soon),
It's simpler (by far)comparisons,
just needing
It's more robust (to noise, to strange functions...)
incuding evolutionary algorithms
18. I. Optimization and DFO
II. Evolutionary algorithms
III. From math. programming
IV. Using machine learning
V. Conclusions
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
19. II. Evolutionary algorithms
a. Fundamental elements
b. Algorithms
c. Math. analysis
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
22. K exp( - p(x) ) with
- p(x) a degree 2 polynomial (neg. dom coef)
- K a normalization constant
Preliminaries:
- Gaussian distribution
- Multivariate Gaussian distribution
- Non-isotropic Gaussian distribution
- Markov chains
23. K exp( - p(x) ) with
- p(x) a degree 2 polynomial (neg. dom coef)
- K a normalization constant
Translation
of the
Preliminaries:Gaussian
Sze
of the
- Gaussian distribution
Gaussian
- Multivariate Gaussian distribution
- Non-isotropic Gaussian distribution
- Markov chains
25. Preliminaries:
Isotropic case:
- Gaussian distribution
- Multivariate Gaussian distribution||2 /22)
==> general case: density = K exp( - || x -
==> level sets are rotationally invariant
- Non-isotropic Gaussian distribution
==> completely defined by and
- Markov chains
(do you understand why K is fixed by ?)
==> “isotropic” Gaussian
34. Population-based comparison-based algorithms ?
Abstract notations: x(i) is a population, I(i) is an
internal state of the algorithm.
x(1),I(1) = Opt()
x(2),I(2) = Opt(x(1), sign(y(1,1)-y(1,2)), I(1) )
… … ...
x(n),I(n) = Opt(x(n-1),sign(y(n-1,1)-y(n-1,2) ,I(n-1))
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 34
35. Population-based comparison-based algorithms ?
Abstract notations: x(i) is a population, I(i) is an
internal state of the algorithm.
x(1),I(1) = Opt()
x(2),I(2) = Opt(x(1), (1), I(1) )
… … ...
x(n),I(n) = Opt(x(n-1),(n-1) ,I(n-1))
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 35
36. Comparison-based optimization
==> Same behavior on many functions
is comparison-based if
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 36
37. Comparison-based optimization
==> Same behavior on many functions
is comparison-based if
Quasi-Newton methods very poor on this.
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 37
38. Why comparison-based algorithms ?
==> more robust
==> this can be mathematically
formalized: comparison-based opt.
are slow ( d log ||xn-x*||/n ~ constant)
but robust (optimal for some worst
case analysis)
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
39. II. Evolutionary algorithms
a. Fundamental elements
b. Algorithms
c. Math. analysis
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
40. Parameters:
Generate points around x
x,
( x + N where N is a standard
Gaussian)
o f an
egy
ema
Strat
c sch
ution
Basi
Evol
41. Parameters:
Generate points around x
x,
( x + N where N is a standard
Gaussian)
o f an
egy
Compute their fitness values
ema
Strat
c sch
ution
Basi
Evol
42. Parameters:
Generate points around x
x,
( x + N where N is a standard
Gaussian)
o f an
egy
Compute their fitness values
ema
Strat
c sch
ution
Select the best
Basi
Evol
43. Parameters:
Generate points around x
x,
( x + N where N is a standard
Gaussian)
o f an
egy
Compute their fitness values
ema
Strat
c sch
ution
Select the best
Basi
Evol
Let x = average of these best
44. Parameters:
Generate points around x
x,
( x + N where N is a standard
Gaussian)
o f an
egy
Compute their fitness values
ema
Strat
c sch
ution
Select the best
Basi
Evol
Let x = average of these best
45. Parameters:
Generate points around x
x,
( x + N where N is a standard
Gaussian)
llel
para
Compute their fitness values
Multi-cores,
sly
Clusters, Grids...
ou
Select the best
Obvi
Let x = average of these best
46. Parameters:
Generate points around x
x,
( x + N where N is a standard
Gaussian)
llel
para
Compute their fitness values
sly
ple.
ou
Select the best
ly sim
Obvi
Real
Let x = average of these best
47. Parameters:
Generate points around x
x,
( x + N where N is a standard
Gaussian)
llel
para
Not a negligible advantage.
Compute their fitness values
When I accessed, for the 1st time,
sly
to a crucial industrial
ple.
code of an important
ou
Select the best
company, I believed
ly sim
Obvi
that it would be
Real
clean and bug free.
Let x = average of these best
(I was young)
48. Parameters:
Generate 1 point x' around x
x,
( x + N where N is a standard
Gaussian)
Compute its fitness value
) - ES
le
Keep the best (x or x').
ru
(1 +1
1/5 th
x=best(x,x')
The
with
=2 if x' best
=0.84 otherwise
61. Estimation of Multivariate Normal Algorithm
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 61
62. Estimation of Multivariate Normal Algorithm
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 62
63. Estimation of Multivariate Normal Algorithm
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 63
64. Estimation of Multivariate Normal Algorithm
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 64
65. EMNA is usually non-isotropic
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 65
66. EMNA is usually non-isotropic
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 66
67. Self-adaptation (works in many frameworks)
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 67
68. Self-adaptation (works in many frameworks)
Can be used for non-isotropic
multivariate Gaussian
distributions.
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 68
69. Let's generalize.
We have seen algorithms which work as follows:
- we keep one search point in memory
(and one step-size)
- we generate individuals
- we evaluate these individuals
- we regenerate a search point and a step-size
Maybe we could keep more than one search point ?
70. Let's generalize.
We have seen algorithms which work as follows:
- we keep one search point in memory
(and one step-size) points
==> mu search
- we generate individuals
- we evaluate thesegenerated individuals
==> lambda individuals
- we regenerate a search point and a step-size
Maybe we could keep more than one search point ?
71. Parameters: Generate points
x1,...,x around x1,...,x
e.g. each x randomly generated
from two points
llel
para
Compute their fitness values
sly
ple.
ou
Select the best
ly sim
Obvi
Real
Don't average...
72. Generate points
around x1,...,x
e.g. each x randomly generated
from two points
73. Generate points
around x1,...,x
e.g. each x randomly generated This is a
from two points cross-over
74. Generate points
around x1,...,x
e.g. each x randomly generated This is a
from two points cross-over
Example of procedure for generating a point:
- Randomly draw k parents x1,...,xk
(truncation selection: randomly in selected individuals)
- For generating the ith coordinate of new individual z:
u=random(1,k)
z(i) = x(u)i
75. Let's summarize:
We have seen a general scheme for optimization:
- generate a population (e.g. from some distribution, or from
a set of search points)
- select the best = new search points
==> Small difference between an
Evolutionary Algorithm (EA) and an
Estimation of Distribution Algorithm (EDA).
==> Some EA (older than the EDA acronym) are EDAs.
76. Let's summarize:
We have seen a general scheme for optimization:
- generate a population (e.g. from some distribution, or from
a set of search points)
- select the best = new search points EDA
EA
==> Small difference between an
Evolutionary Algorithm (EA) and an
Estimation of Distribution Algorithm (EDA).
==> Some EA (older than the EDA acronym) are EDAs.
77. Gives a lot freedom:
- choose your representation
and operators (depending on the problem)
- if you have a step-size, choose adaptation rule
- choose your population-size (depending on your
computer/grid )
- choose (carefully) e.g. min(dimension, /4)
78. Gives a lot freedom:
- choose your operators (depending on the problem)
- if you have a step-size, choose adaptation rule
- choose your population-size (depending on your
computer/grid )
- choose (carefully) e.g. min(dimension, /4)
Can handle strange things:
- optimize a physical structure ?
- structure represented as a Voronoi
- cross-over makes sense, benefits from local structure
- not so many algorithms can work on that
82. Voronoi representation:
- a family of points
- their labels
==> cross-over makes sense
==> you can optimize a shape
83. Voronoi representation:
- a family of points
- their labels
==> cross-over makes sense
==> you can optimize a shape
==> not that mathematical;
but really useful
Mutations: each label is changed with proba 1/n
Cross-over: each point/label is randomly drawn from one of
the two parents
84. Voronoi representation:
- a family of points
- their labels
==> cross-over makes sense
==> you can optimize a shape
==> not that mathematical;
but really useful
Mutations: each label is changed with proba 1/n
Cross-over: randomly pick one split in the representation:
- left part from parent 1
- right part from parent 2
==> related to biology
85. Gives a lot freedom:
- choose your operators (depending on the problem)
- if you have a step-size, choose adaptation rule
- choose your population-size (depending on your
computer/grid )
- choose (carefully) e.g. min(dimension, /4)
Can handle strange things:
- optimize a physical structure ?
- structure represented as a Voronoi
- cross-over makes sense, benefits from local structure
- not so many algorithms can work on that
86. II. Evolutionary algorithms
a. Fundamental elements
b. Algorithms
c. Math. Analysis
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
87. Consider the (1+1)-ES.
x(n) = x(n-1) or x(n-1) + (n-1)N
We want to maximize:
- E log || x(n) - f* ||
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
88. Consider the (1+1)-ES.
x(n) = x(n-1) or x(n-1) + (n-1)N
We want to maximize:
- E log || x(n) - f* ||
--------------------------
- E log || x(n-1) – f* ||
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
89. Consider the (1+1)-ES.
x(n) = x(n-1) or x(n-1) + (n-1)N
We don't know f*.
We want to maximize:
How can we optimize this ?
- E log || x(n) - f* ||
We will observe
-------------------------- the acceptance rate,
- E log || x(n-1) – f* ||
and we will deduce if
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège is too large or too small..
90. - E log || x(n) - f* ||
ON THE NORM FUNCTION
--------------------------
- E log || x(n-1) – f* ||
Rejected Accepted
mutations mutations
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
91. - E log || x(n) - f* || For each step-size,
-------------------------- evaluate this “expected progress rate”
- E log || x(n-1) – f* || and evaluate “P(acceptance)”
Rejected Accepted
mutations mutations
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
98. th
1/5 rule
Based on maths showing
that good step-size
<==> success rate < 1/5
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 98
99. I. Optimization and DFO
II. Evolutionary algorithms
III. From math. programming
IV. Using machine learning
V. Conclusions
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
100. III. From math. programming
==>pattern search method
Comparison with ES:
- code more complicated
- same rate
- deterministic
- less robust
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
101. III. From math. programming
Also:
- Nelder-Mead algorithm (similar to pattern search,
better constant in the rate)
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
102. III. From math. programming
Also:
- Nelder-Mead algorithm (similar to pattern search,
better constant in the rate)
- NEWUOA (using value functions and
not only comparisons)
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
103. I. Optimization and DFO
II. Evolutionary algorithms
III. From math. programming
IV. Using machine learning
V. Conclusions
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
104. IV. Using machine learning
What if computing f takes days ?
==> parallelism
==> and “learn” an approximation of f
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
105. IV. Using machine learning
Statistical tools: f ' (x) = approximation
( x, x1,f(x1), x2,f(x2), … , xn,f(xn))
y(n+1) = f ' (x(n+1) )
e.g. f' = quadratic function closest to f on the x(i)'s.
106. IV. Using machine learning
==> keyword “surrogate models”
==> use f' instead of f
==> periodically, re-use the real f
107. I. Optimization and DFO
II. Evolutionary algorithms
III. From math. programming
IV. Using machine learning
V. Conclusions
Olivier Teytaud
Inria Tao, en visite dans la belle ville de Liège
108. Derivative free optimization is fun.
==> nice maths
==> nice applications + easily parallel algorithms
==> can handle really complicated domains
(mixed continuous / integer, optimization
on sets of programs)
Yet,
often suboptimal on highly structured problems (when
BFGS is easy to use, thanks to fast gradients)
109. Keywords, readings
==> cross-entropy (so close to evolution strategies)
==> genetic programming (evolutionary algorithms for
automatically building programs)
==> H.-G. Beyer's book on ES = good starting point
==> many resources on the web
==> keep in mind that representation / operators are
often the key
==> we only considered isotropic algorithms; sometimes not
a good idea at all