SlideShare une entreprise Scribd logo
1  sur  21
DIRECT POLICY SEARCH


0. What is Direct Policy Search ?

1. Direct Policy Search:
   Parametric Policies for Financial Applications

2. Parametric Bellman values for Stock Problems

3. Direct Policy Search: Optimization Tools
First, you need to know what is
              direct policy search (DPS).

                  Principle of DPS:

 (1) Define a parametric policy Pi
     with parameters t1,...,tk.

 (2) maximize
     (t1,...,tk) → average reward when applying
     Policy pi(t1,...,tk) on the problem.

                ==> You must define Pi
 ==> You must choose a noisy optimization algorithm
==> There is a Pi by default (an actor neural network),
      but it's only a default solution (overload it)
Strengths of DPS:

- Good warm start
     If I have a solution for problem A, and
     if I switch to problem B close to A, then I quickly
     get good results.

- Benefits from expert knowledge on the structure

- No constraint on the structure of the objective function

- Anytime (i.e. not that bad in restricted time)

                          Drawbacks:
            - needs structured direct policy search
         - not directly applicable to partial observation
Virtual MashDecision computeDecision(MashState & state,
             Const Vector<double> params)

                ==> “params” = t1,...,tk
        ==> returns the decision pi(t1,...,tk,state)

                  Does it make sense ?

    Overload this function, and DPS is ready to work.

    Well, DPS (somewhere between alpha and beta)
                might be full of bugs :-)
Direct Policy Search:
Parametric Policies for Financial
          Application
Bengio et al papers on DPS for financial applications


       Stocks (various assets) + Cash              - Can be applied on data sets
                                                      (no simulator, no elasticity model)
           decision =
       tradingUnit(A, prevision(B,data))
                                                      because policy has no impact
                                                      on prices
                     Where:
- tradingUnit is designed by human experts         - 22 params in first paper
- prevision's outputs are chosen
          by human experts                         - reduced weight sharing
- prevision is a neural network
- A and B are parameters                               in other paper
                                                         ==> ~ 800 parameters
Then,                                                      (if I understand correctly)
B is optimized by LMS (prevision criterion)
    ==> poor results, little correlation between   - there exist much bigger DPS
       LMS and financial performance
A and B are optimized on the expected return             (Sigaud et al., 27 000)
   (by DPS) ==> much better
                                                   - nb: noisy optimization
An alternate solution:

parametric Bellman values

   for Stock Problems
What is a Bellman function ?

V(s): expected benefit, in the future,
  if playing optimally from state s.

V(s) is useful for playing optimally.
Rule for an optimal decision:

  d(s) = argmax V(s') + r(s,d)
            d

- s'=nextState(s,d)
- d(s): optimal decision in state s
- V(s'): Bellman value in state s'
- r(s,d): reward associated to
          decision d in state s
Remark 1: V(s) known
up to an additive constant is enough

       Remark 2: dV(s)/d(si)
       is the price of stock i

  Example with one stock, soon.
Q-rule for an optimal decision:

      d(s) = argmax Q(s,d)
                d

- d(s): optimal decision in state s
- Q(s,d) : optimal future reward if
   decision = d in s

==> approximate Q instead of V
==> we don't need r(s,d)
       nor newState(s,d)
I have enough
                                             stock;
                                        I pay only if it's
V(stock) (in euros)
                                            cheap.


       I need a
     lot of stock!
  I accept to pay a
          lot.




                      Slope = marginal price (euros/KWh)




                                              Stock (in kWh)
Examples:
For one stock:
   - very simple: constant price
   - piecewise linear (can ensure convexity)
   - “tanh” function
   - neural network, SVM, sum of Gaussians...


For several stocks:
   - each stock separately
   - 2-dimensional: V(s1,s2,s3)=V'(s1,S)+v''(s2,S)+v'''(s3,S)
                   where S=a1.s1+a2.s2+a3.s3
   - neural network, SVM, sum of Gaussians...
How to choose coefficients ?
- dynamic programming: robust, but slow in high dim
- direct policy search:
     - initializing coefficients from expert advice
     - or: supervised machine learning for approximating
             an expert advice
     ==> and then optimize
Conclusions:

V: Very convenient representation of policy:
   we can view prices.
Q: some advantages (model-free models)

Yet, less readable than direct rules.

And expensive: we need one optimization for making
  the decision, for each time step of a simulation.
  ==> but this optimization can be
        a simple sort (as a first approximation).

Simpler ? Adrien has a parametric strategy for stocks
   ==> we should see how to generalize it
   ==> transformation “constants → parameters” ==> DPS
Questions (strategic decisions for the DPS):
     - start with Adrien's policy, improve it, generalize it,
           parametrize it ? interface with ARM ?
     - or another strategy ?
     - or a parametric V function, and we assume we have
           r(s,d) and newState(s,d) (often true)
     - or a parametric Q function ?
         (more generic, unusual but appealing,
         but neglects some
         existing knowledge r(s,d) and newState(s,d) )

Further work:
   - finish the validation of Adrien's policy on stock
       (better than random as a policy; better than random
            as a UCT-Monte-Carlo)
   - generalize ? variants ?
   - introduce into DPS, compare to the baseline (neural net)
   - introduce DPS's result into MCTS
Questions (strategic decisions for the DPS):
     - start with Adrien's policy, improve it, generalize it,
           parametrize it ? interface with ARM ?
     - or another strategy ?
     - or a parametric V function, and we assume we have
           r(s,d) and newState(s,d) (often true)
     - or a parametric Q function ?
         (more generic, unusual but appealing,
         but neglects some
         existing knowledge r(s,d) and newState(s,d) )

Further work:
   - finish the validation of Adrien's policy on stock
       (better than random as a policy; better than random
            as a UCT-Monte-Carlo)
   - generalize ? variants ?
   - introduce into DPS, compare to the baseline (neural net)
   - introduce DPS's result into MCTS
Direct Policy Search:

 Optimization Tools

& Optimization Tricks
- Classical tools: Evolution Strategies,
   Cross-Entropy, Pso, ...
   ==> more or less supposed to be
          robust to local minima
   ==> no gradient
   ==> robust to noisy objective function
   ==> weak for high dimension (but: see locality, next slide)

- Hopefully:
   - good initialization: nearly convex
   - random seeds: no noise

==> NewUoa is my favorite choice
   - no gradient
   - can “really” work in high-dimension
   - update rule surprisingly fast
   - people who try to show that their
       algorithm is better than NewUoa
       suffer a lot in noise-free case
Improvements of optimization algorithms:

     - active learning: when optimization on scenarios,
            choose “good” scenarios

           ==> maybe “quasi-randomization” ?
                Just choosing a representative sample of
                scenarios. ==> simple, robust...

     - local improvement: when a gradient step/update
            is performed, only update variables concerned
            by the simulation you've used for generating
            the update

           ==> difficult to use in NewUoa
Roadmap:

- default policy for energy management problems:
      test, generalize, formalize, simplify...

- this default policy ==> a parametric policy

- test in DPS: strategy A

- interface DPS with NewUoa and/or others (openDP opt?)

- Strategy A: test into MCTS ==> Strategy B

==> IMHO, strategy A = good tool for fast
        readable non-myopic results

==> IMHO, strategy B = good for combining A with
   the efficiency of A for short term combinatorial effects.

- Also, validating the partial observation (sounds good).

Contenu connexe

Tendances

Support vector machine
Support vector machineSupport vector machine
Support vector machine
Musa Hawamdah
 
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
Dongseo University
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
jianingy
 

Tendances (20)

Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Svm vs ls svm
Svm vs ls svmSvm vs ls svm
Svm vs ls svm
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
RT-BDI: A Real-Time BDI model
RT-BDI: A Real-Time BDI modelRT-BDI: A Real-Time BDI model
RT-BDI: A Real-Time BDI model
 
How to use SVM for data classification
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classification
 
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
A BA-based algorithm for parameter optimization of support vector machine
A BA-based algorithm for parameter optimization of support vector machineA BA-based algorithm for parameter optimization of support vector machine
A BA-based algorithm for parameter optimization of support vector machine
 
Dask glm-scipy2017-final
Dask glm-scipy2017-finalDask glm-scipy2017-final
Dask glm-scipy2017-final
 
Introduction to logistic regression
Introduction to logistic regressionIntroduction to logistic regression
Introduction to logistic regression
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machines
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019
 
Modeling interest rates and derivatives
Modeling interest rates and derivativesModeling interest rates and derivatives
Modeling interest rates and derivatives
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
 

En vedette

Keywords and examples of machine learning
Keywords and examples of machine learningKeywords and examples of machine learning
Keywords and examples of machine learning
Olivier Teytaud
 

En vedette (16)

Bias correction, and other uncertainty management techniques
Bias correction, and other uncertainty management techniquesBias correction, and other uncertainty management techniques
Bias correction, and other uncertainty management techniques
 
Simulation-based optimization: Upper Confidence Tree and Direct Policy Search
Simulation-based optimization: Upper Confidence Tree and Direct Policy SearchSimulation-based optimization: Upper Confidence Tree and Direct Policy Search
Simulation-based optimization: Upper Confidence Tree and Direct Policy Search
 
Debugging
DebuggingDebugging
Debugging
 
Power systemsilablri
Power systemsilablriPower systemsilablri
Power systemsilablri
 
Examples of operational research
Examples of operational researchExamples of operational research
Examples of operational research
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Bias and Variance in Continuous EDA: massively parallel continuous optimization
Bias and Variance in Continuous EDA: massively parallel continuous optimizationBias and Variance in Continuous EDA: massively parallel continuous optimization
Bias and Variance in Continuous EDA: massively parallel continuous optimization
 
Keywords and examples of machine learning
Keywords and examples of machine learningKeywords and examples of machine learning
Keywords and examples of machine learning
 
Disappointing results & open problems in Monte-Carlo Tree Search
Disappointing results & open problems in Monte-Carlo Tree SearchDisappointing results & open problems in Monte-Carlo Tree Search
Disappointing results & open problems in Monte-Carlo Tree Search
 
Simple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationSimple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimization
 
Combining games artificial intelligences & improving random seeds
Combining games artificial intelligences & improving random seedsCombining games artificial intelligences & improving random seeds
Combining games artificial intelligences & improving random seeds
 
Fuzzy control - superfast survey
Fuzzy control - superfast surveyFuzzy control - superfast survey
Fuzzy control - superfast survey
 
Planning for power systems
Planning for power systemsPlanning for power systems
Planning for power systems
 
Artificial intelligence for power systems
Artificial intelligence for power systemsArtificial intelligence for power systems
Artificial intelligence for power systems
 
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
 
Réseaux neuronaux profonds & intelligence artificielle
Réseaux neuronaux profonds & intelligence artificielleRéseaux neuronaux profonds & intelligence artificielle
Réseaux neuronaux profonds & intelligence artificielle
 

Similaire à Direct policy search

Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
PSO and Its application in Engineering
PSO and Its application in EngineeringPSO and Its application in Engineering
PSO and Its application in Engineering
Prince Jain
 

Similaire à Direct policy search (20)

Optimization
OptimizationOptimization
Optimization
 
Uncertainties in large scale power systems
Uncertainties in large scale power systemsUncertainties in large scale power systems
Uncertainties in large scale power systems
 
weatherr.pptx
weatherr.pptxweatherr.pptx
weatherr.pptx
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Differential Machine Learning Masterclass
Differential Machine Learning MasterclassDifferential Machine Learning Masterclass
Differential Machine Learning Masterclass
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
Dynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris GameDynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris Game
 
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
 
Chapter 02-logistic regression
Chapter 02-logistic regressionChapter 02-logistic regression
Chapter 02-logistic regression
 
Applying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKApplying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPK
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Quantitative techniques
Quantitative techniquesQuantitative techniques
Quantitative techniques
 
PSO and Its application in Engineering
PSO and Its application in EngineeringPSO and Its application in Engineering
PSO and Its application in Engineering
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 

Dernier

Dernier (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Direct policy search

  • 1. DIRECT POLICY SEARCH 0. What is Direct Policy Search ? 1. Direct Policy Search: Parametric Policies for Financial Applications 2. Parametric Bellman values for Stock Problems 3. Direct Policy Search: Optimization Tools
  • 2. First, you need to know what is direct policy search (DPS). Principle of DPS: (1) Define a parametric policy Pi with parameters t1,...,tk. (2) maximize (t1,...,tk) → average reward when applying Policy pi(t1,...,tk) on the problem. ==> You must define Pi ==> You must choose a noisy optimization algorithm ==> There is a Pi by default (an actor neural network), but it's only a default solution (overload it)
  • 3. Strengths of DPS: - Good warm start If I have a solution for problem A, and if I switch to problem B close to A, then I quickly get good results. - Benefits from expert knowledge on the structure - No constraint on the structure of the objective function - Anytime (i.e. not that bad in restricted time) Drawbacks: - needs structured direct policy search - not directly applicable to partial observation
  • 4. Virtual MashDecision computeDecision(MashState & state, Const Vector<double> params) ==> “params” = t1,...,tk ==> returns the decision pi(t1,...,tk,state) Does it make sense ? Overload this function, and DPS is ready to work. Well, DPS (somewhere between alpha and beta) might be full of bugs :-)
  • 5. Direct Policy Search: Parametric Policies for Financial Application
  • 6. Bengio et al papers on DPS for financial applications Stocks (various assets) + Cash - Can be applied on data sets (no simulator, no elasticity model) decision = tradingUnit(A, prevision(B,data)) because policy has no impact on prices Where: - tradingUnit is designed by human experts - 22 params in first paper - prevision's outputs are chosen by human experts - reduced weight sharing - prevision is a neural network - A and B are parameters in other paper ==> ~ 800 parameters Then, (if I understand correctly) B is optimized by LMS (prevision criterion) ==> poor results, little correlation between - there exist much bigger DPS LMS and financial performance A and B are optimized on the expected return (Sigaud et al., 27 000) (by DPS) ==> much better - nb: noisy optimization
  • 7. An alternate solution: parametric Bellman values for Stock Problems
  • 8. What is a Bellman function ? V(s): expected benefit, in the future, if playing optimally from state s. V(s) is useful for playing optimally.
  • 9. Rule for an optimal decision: d(s) = argmax V(s') + r(s,d) d - s'=nextState(s,d) - d(s): optimal decision in state s - V(s'): Bellman value in state s' - r(s,d): reward associated to decision d in state s
  • 10. Remark 1: V(s) known up to an additive constant is enough Remark 2: dV(s)/d(si) is the price of stock i Example with one stock, soon.
  • 11. Q-rule for an optimal decision: d(s) = argmax Q(s,d) d - d(s): optimal decision in state s - Q(s,d) : optimal future reward if decision = d in s ==> approximate Q instead of V ==> we don't need r(s,d) nor newState(s,d)
  • 12. I have enough stock; I pay only if it's V(stock) (in euros) cheap. I need a lot of stock! I accept to pay a lot. Slope = marginal price (euros/KWh) Stock (in kWh)
  • 13. Examples: For one stock: - very simple: constant price - piecewise linear (can ensure convexity) - “tanh” function - neural network, SVM, sum of Gaussians... For several stocks: - each stock separately - 2-dimensional: V(s1,s2,s3)=V'(s1,S)+v''(s2,S)+v'''(s3,S) where S=a1.s1+a2.s2+a3.s3 - neural network, SVM, sum of Gaussians...
  • 14. How to choose coefficients ? - dynamic programming: robust, but slow in high dim - direct policy search: - initializing coefficients from expert advice - or: supervised machine learning for approximating an expert advice ==> and then optimize
  • 15. Conclusions: V: Very convenient representation of policy: we can view prices. Q: some advantages (model-free models) Yet, less readable than direct rules. And expensive: we need one optimization for making the decision, for each time step of a simulation. ==> but this optimization can be a simple sort (as a first approximation). Simpler ? Adrien has a parametric strategy for stocks ==> we should see how to generalize it ==> transformation “constants → parameters” ==> DPS
  • 16. Questions (strategic decisions for the DPS): - start with Adrien's policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) ) Further work: - finish the validation of Adrien's policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPS's result into MCTS
  • 17. Questions (strategic decisions for the DPS): - start with Adrien's policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) ) Further work: - finish the validation of Adrien's policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPS's result into MCTS
  • 18. Direct Policy Search: Optimization Tools & Optimization Tricks
  • 19. - Classical tools: Evolution Strategies, Cross-Entropy, Pso, ... ==> more or less supposed to be robust to local minima ==> no gradient ==> robust to noisy objective function ==> weak for high dimension (but: see locality, next slide) - Hopefully: - good initialization: nearly convex - random seeds: no noise ==> NewUoa is my favorite choice - no gradient - can “really” work in high-dimension - update rule surprisingly fast - people who try to show that their algorithm is better than NewUoa suffer a lot in noise-free case
  • 20. Improvements of optimization algorithms: - active learning: when optimization on scenarios, choose “good” scenarios ==> maybe “quasi-randomization” ? Just choosing a representative sample of scenarios. ==> simple, robust... - local improvement: when a gradient step/update is performed, only update variables concerned by the simulation you've used for generating the update ==> difficult to use in NewUoa
  • 21. Roadmap: - default policy for energy management problems: test, generalize, formalize, simplify... - this default policy ==> a parametric policy - test in DPS: strategy A - interface DPS with NewUoa and/or others (openDP opt?) - Strategy A: test into MCTS ==> Strategy B ==> IMHO, strategy A = good tool for fast readable non-myopic results ==> IMHO, strategy B = good for combining A with the efficiency of A for short term combinatorial effects. - Also, validating the partial observation (sounds good).