Towards a Theoretical Framework for LCS

Towards a Theoretical
Framework for LCS

Jan Drugowitsch
Dr Alwyn Barry

Overview
Big Question: Why use LCS?

Aims and Approach
1.
Function Approximation
2.
Dynamic Programming
3.
Classifier Replacement
4.
Future Work
5.

May 2006 © A.M.Barry and J. Drugowitsch, 2006
Drugowitsch,

Research Aims

To establish a formal model for
Accuracy-based LCS
To use established models, methods
and notation
To transfer method knowledge (analysis
/ improvement)
To create links to other ML approaches
and disciplines
Drugowitsch,

Approach
Reduce LCS to components:
–
Dynamic Programming
–
–

Model the components
Model their interactions
Verify models by empirical studies
Extend models from related domains
Drugowitsch,

Architecture Outline:
Set of K classifiers, matching states Sk⊆ S
–
Every classifier k approximates the value function
–
V over Sk. independently.

Assumptions
Function V : S → ℜ is time-invariant
–
Time-invariant set of classifiers {S1, …, SK}
–
Fixed sampling dist. ∏(S) given by π : S → [0, 1]
–

~

Minimising MSE: ∑ π (i )(V (i ) − Vk (i )) 2
–
i∈S k
Drugowitsch,

Linear approx. architecture

Feature vector φ : S → ℜL
–
Approximation parameters vector ωk ∈ ℜL
–
~
Approximation V k = φ (i )T ωk
–

Averaging classifiers, φ (i) = (1), ∀i∈S
Linear estimators, φ (i) = (1, i)′, ∀i∈S

Drugowitsch,

Approximation to MSE for k at time t :
I S k (im )(V (im ) − ωk ,tφ (im ))
t
1
∑
ε k ,t ′
=
2

ck , t − 1 m = 0

Matching: I S : S → {0t,1}
– k

Match count: ck ,t = ∑ I S (im )
– k
m =0

Since the sequence of states im is dependant
on π, the MSE is automatically weighted by
this distribution.

Drugowitsch,

Mixing classifiers
A classifier k covers Sk ⊆ S
–
Need to mix estimates to recover V (i )
~
–
Requires mixing parameter (will be
–
‘accuracy’ ... see later!)
K K
where ψ k : S → {0,1}, ∑ψ k (i ) = 1
~
V (i ) = ∑ψ k (i )ωkφ (i )
′
k =1 k =1

ψk(i) = 0 where i ∉ Sk … classifiers that do
–
not match do not contribute

Drugowitsch,

Matrix notation
V = (V (1),L, V ( N ))′ Dk = I S k D
 Lφ (1)′L  ~
Vk = Φωk
 
Φ= L   ψ k(1),K,0 
Lφ ( N )′L  
  Ψk =  
O
 π (1),K,0   0, K ,ψ ( N ) 
   
k
D= O 
 0, K , π ( N ) 
 
~K K
~
V = ∑ ΨkVk = ∑ Ψk Φωk
 I S k (1),K ,0 
  k =1 k =1
I Sk =  
O

 0, K, I S ( N ) 

 
k

Drugowitsch,

Achievements
Proof that the MSE of a classifier k at time t is the
–
normalised sample variance of the value
functions over the states matched until time t
Proof that the expression developed for sample-
–
based approximated MSE is a suitable
approximation for the actual MSE, for finite state
spaces
Proof of optimality of approximated MSE from the
–
vector-space viewpoint (relates classifier error to
the Principle of Orthogonality)

Drugowitsch,

Achievements (cont’d)
Elaboration of algorithms using the model:
–
Steepest Gradient Descent
Least Mean Square
Show sensitivity to step size and input range hold in LCS
–
Normalised LMS
Kalman Filter
… and its inverse co-variance form
–
Equivalence to Recursive Least Squares
–
Derivation of computationally cheap implementation
–

For each, derivation of:
–
Stability criteria
For LMS, identification of the excess mean squared
–
estimation error
Time constant bounds
Drugowitsch,

Achievements (cont’d)
Investigation:
–

Drugowitsch,

How to set the mixing weights ψk(i)?
Accuracy-based mixing (YCS-like!)
–
Divorce mixing for fn. approx. from fitness
–

I S k (i )ε k−ν
ψ k (i ) = K

∑ I S p (i )ε pν
−

p =1

Investigating power factor ν ∈ℜ+
–
Demonstrate, using the model, that we
–
obtain the MLE of the mixed classifiers
when ν = 1
Drugowitsch,

Mixing - Empirical Results
Higher ν, more importance to low error
–
Foundν = 1 best compromise
–

Classifiers: 0..π/2, π/2.. π, 0.. π Classifiers: 0..0.5, 0.5.. 1, 0.. 1
Drugowitsch,

Dynamic Programming
Dynamic Programming:
∞ t 
V (i ) = max E ∑ γ r (it , at , it +1 ) | i0 = i, at = µ (it ) 
*
µ
 t =0 

Bellman’s Equation:
( )
V * (i ) = max E r (i, a, j ) + γV * ( j ) | j = p (i, a)
a∈ A

Applying the DP operator T to value est.
(TV )(i ) = max E(r (i, a, j ) + γV ( j ) | j = p (i, a ) )
a∈ A

Drugowitsch,

Dynamic Programming
T is a contraction w.r.t. maximum norm
Converges to unique fixed point V*= TV*
–

With function approximation:
Approximation after every update:
–
()
~ ~
Vt +1 = f TVt
Might not converge for more complex
–
approximation architectures
Is LCS function approximation a non-
expansion w.r.t. maximum norm?
Drugowitsch,

Dynamic Programming
LCS Value Iteration
Assume:
–
constant population
averaging classifiers, φ (i) = (1), ∀i∈S

~
Vt +1 = TVt
Classifier is approximating:
–
Based on minimising MSE:
–
( )
~ ~ ~~
2 2
∑ Vt +1 (i) − Vk ,t +1 (i) = Vt +1 − Vk ,t +1
2
= TVt − Vk ,t +1
I Sk I Sk
i∈S k

Drugowitsch,

Dynamic Programming
LCS Value Iteration (cont’d)
Minimum is orthog. projection on approx:
–
~ ~
Vk ,t +1 = Π I Sk Vt +1 = Π I Sk TVt
Need also to determine new mixing weight
–
by evaluating the new approx. error:
( )
1 1
~ ~
2 2
ε k ,t +1 = Vt +1 − Vk ,t +1
= I − Π Dk TVt
Tr( I S k ) Tr( I S k )
I Sk I Sk

~
– ∴the only time dependency is on Vt
K
~ ~
– Thus: V =
∑ Ψk ,t +1Π I Sk TVt
t +1
k =1

Drugowitsch,

Dynamic Programming
LCS Asynchronous Value Iteration
Only update Vk for the matched state
–
Minimise: ∑ I S (im )((TVm )(im ) − ωk′ ,tφ (im ) )2
t
~
–
k
m =0

Thus, approx. costs weighted by state dist
–

( )
t
~ ~ 2
∑
2
π (i ) (TV )(i ) − ωkφ (i ) = TV − Φωk
′
Dk
i∈S k
~ ~
∴Vk ,t +1 = Π Dk TVt
K
~ ~
Vt +1 = ∑ Ψk ,t +1Π Dk TVt
k =1

Drugowitsch,

Dynamic Programming
Also derived:
LCS Q-Learning
–
LMS Implementation
Kalman Filter Implementation
Demonstrate that the weight update uses the
normalised form of local gradient descent
Model-K
based Policy Iteration
–
~ ~
Vt +1 = ∑ Ψk Π Dk TµVt
k =1
Step-wise Policy Iteration
–
TD(λ) is not possible in LCS due to re-
–
evaluation of mixing weights
Drugowitsch,

Dynamic Programming
Is LCS function approximation a non-expansion w.r.t.
maximum norm?
We derive that:
–
Constant Mixing: Yes
Arbitrary Mixing: Not necessarily
Accuracy-based Mixing: non-expansion (dependant upon a
conjecture)

Is LCS function approximation a non-expansion w.r.t.
weighted norm?
We know Tµ is a contraction
–
We show ΠDk is a non-expansion
–
– We demonstrate:
Single Classifier: convergence to a fixed pt
Disjoint Classifiers: convergence to a fixed pt
Arbitrary Mixing: spectral radius can be ≥ 1 even in some
fixed weightings
Drugowitsch,

How can we define the optimal population
formally?
What do we want to reach?
–
How does the global fitness of one classifier differ
–
from its local fitness (i.e. accuracy)?
What is the quality of a population?
–

Methodology
Examine methods to define the optimal
–
population
Relate to generalised hill-climbing
–
Map to other search criteria
–

Drugowitsch,

Future Work
Formalising and analysing classifier
replacement
Interaction of replacement with function
approximation
Interaction of replacement with DP
Handling stochastic environments
Error bounds and rates of convergence
…

Big Answer: LCS is an integrative framework,
… But we need the framework definition!

Drugowitsch,

Towards a Theoretical Framework for LCS

Recommended

Recommended

More Related Content

Similar to Towards a Theoretical Framework for LCS

Similar to Towards a Theoretical Framework for LCS (6)

More from Xavier Llorà

More from Xavier Llorà (20)

Recently uploaded

Recently uploaded (20)

Towards a Theoretical Framework for LCS