This document discusses formulating intrusion response as an optimal stopping problem known as a Dynkin game. The system evolves in discrete time steps with an attacker and defender. The defender observes the infrastructure and aims to determine the optimal time to stop and take defensive action based on observations. The approach involves modeling the interaction as a zero-sum partially observed stochastic game and using reinforcement learning to determine optimal strategies for both players.
Injustice - Developers Among Us (SciFiDevCon 2024)
Intrusion Response through Optimal Stopping
1. 1/43
Intrusion Response through Optimal Stopping
New York University - Invited Talk
Kim Hammar
kimham@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
2. 2/43
Use Case: Intrusion Response
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I Has partial observability
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
3. 3/43
Intrusion Response from the Defender’s Perspective
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
4. 3/43
Intrusion Response from the Defender’s Perspective
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
5. 3/43
Intrusion Response from the Defender’s Perspective
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
6. 3/43
Intrusion Response from the Defender’s Perspective
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
7. 3/43
Intrusion Response from the Defender’s Perspective
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
8. 3/43
Intrusion Response from the Defender’s Perspective
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
9. 3/43
Intrusion Response from the Defender’s Perspective
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
10. 3/43
Intrusion Response from the Defender’s Perspective
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
11. 3/43
Intrusion Response from the Defender’s Perspective
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
When to take a defensive action?
Which action to take?
12. 4/43
Formulating Intrusion Response as a Stopping Problem
time-step t = 1
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
13. 4/43
Formulating Intrusion Response as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
14. 4/43
Formulating Intrusion Response as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
15. 4/43
Formulating Intrusion Response as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
16. 5/43
Formulating Network Intrusion as a Stopping Problem
Intrusion event
(1st stop)
time-step t = 1 Intrusion ongoing
t
(2nd stop)
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I The system evolves in discrete time-steps.
I The attacker observes the infrastructure (IDS, log files, etc.).
I The first stop action decides when to intrude.
I The attacker can make 2 stops.
I The second stop action terminates the intrusion.
I Based on the observations & the defender’s belief, when
is it optimal to stop?
17. 5/43
Formulating Network Intrusion as a Stopping Problem
Intrusion event
(1st stop)
time-step t = 1 Intrusion ongoing
t
(2nd stop)
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I The system evolves in discrete time-steps.
I The attacker observes the infrastructure (IDS, log files, etc.).
I The first stop action decides when to intrude.
I The attacker can make 2 stops.
I The second stop action terminates the intrusion.
I Based on the observations & the defender’s belief, when
is it optimal to stop?
18. 5/43
Formulating Network Intrusion as a Stopping Problem
Intrusion event
(1st stop)
time-step t = 1 Intrusion ongoing
t
(2nd stop)
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I The system evolves in discrete time-steps.
I The attacker observes the infrastructure (IDS, log files, etc.).
I The first stop action decides when to intrude.
I The attacker can make 2 stops.
I The second stop action terminates the intrusion.
I Based on the observations & the defender’s belief, when
is it optimal to stop?
19. 6/43
A Dynkin Game Between the Defender and the Attacker1
Attacker
Defender
t = 1
t = T
τ1,1 τ1,2 τ1,3
τ2,1
t
Stopped
Game episode
Intrusion
I We formalize the Dynkin game as a zero-sum partially
observed one-sided stochastic game.
I The defender is the maximizing player
I The attacker is the minimizing player
1
E.B Dynkin. “A game-theoretic version of an optimal stopping problem”. In: Dokl. Akad. Nauk SSSR 385
(1969), pp. 16–19.
20. 7/43
Our Approach for Solving the Dynkin Game
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
21. 7/43
Our Approach for Solving the Dynkin Game
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
22. 7/43
Our Approach for Solving the Dynkin Game
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
23. 7/43
Our Approach for Solving the Dynkin Game
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
24. 7/43
Our Approach for Solving the Dynkin Game
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
25. 7/43
Our Approach for Solving the Dynkin Game
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
26. 7/43
Our Approach for Solving the Dynkin Game
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
27. 7/43
Our Approach for Solving the Dynkin Game
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
28. 8/43
Outline
I Use Case & Approach
I Use case: Intrusion response
I Approach: Optimal stopping
I Theoretical Background & Formal Model
I Optimal stopping problem definition
I Formulating the Dynkin game as a one-sided POSG
I Structure of π∗
I Stopping sets Sl are connected and nested, S1 is convex.
I Existence of multi-threshold best response strategies π̃1, π̃2.
I Efficient Algorithms for Learning π∗
I T-SPSA: A stochastic approximation algorithm to learn π∗
I T-FP: A Fictitious-play algorithm to approximate (π∗
1 , π∗
2 )
I Evaluation Results
I Target system, digital twin, system identification, & results
I Conclusions & Future Work
29. 8/43
Outline
I Use Case & Approach
I Use case: Intrusion response
I Approach: Optimal stopping
I Theoretical Background & Formal Model
I Optimal stopping problem definition
I Formulating the Dynkin game as a one-sided POSG
I Structure of π∗
I Stopping sets Sl are connected and nested, S1 is convex.
I Existence of multi-threshold best response strategies π̃1, π̃2.
I Efficient Algorithms for Learning π∗
I T-SPSA: A stochastic approximation algorithm to learn π∗
I T-FP: A Fictitious-play algorithm to approximate (π∗
1 , π∗
2 )
I Evaluation Results
I Target system, digital twin, system identification, & results
I Conclusions & Future Work
30. 8/43
Outline
I Use Case & Approach
I Use case: Intrusion response
I Approach: Optimal stopping
I Theoretical Background & Formal Model
I Optimal stopping problem definition
I Formulating the Dynkin game as a one-sided POSG
I Structure of π∗
I Stopping sets Sl are connected and nested, S1 is convex.
I Existence of multi-threshold best response strategies π̃1, π̃2.
I Efficient Algorithms for Learning π∗
I T-SPSA: A stochastic approximation algorithm to learn π∗
I T-FP: A Fictitious-play algorithm to approximate (π∗
1 , π∗
2 )
I Evaluation Results
I Target system, digital twin, system identification, & results
I Conclusions & Future Work
31. 8/43
Outline
I Use Case & Approach
I Use case: Intrusion response
I Approach: Optimal stopping
I Theoretical Background & Formal Model
I Optimal stopping problem definition
I Formulating the Dynkin game as a one-sided POSG
I Structure of π∗
I Stopping sets Sl are connected and nested, S1 is convex.
I Existence of multi-threshold best response strategies π̃1, π̃2.
I Efficient Algorithms for Learning π∗
I T-SPSA: A stochastic approximation algorithm to learn π∗
I T-FP: A Fictitious-play algorithm to approximate (π∗
1 , π∗
2 )
I Evaluation Results
I Target system, digital twin, system identification, & results
I Conclusions & Future Work
32. 8/43
Outline
I Use Case & Approach
I Use case: Intrusion response
I Approach: Optimal stopping
I Theoretical Background & Formal Model
I Optimal stopping problem definition
I Formulating the Dynkin game as a one-sided POSG
I Structure of π∗
I Stopping sets Sl are connected and nested, S1 is convex.
I Existence of multi-threshold best response strategies π̃1, π̃2.
I Efficient Algorithms for Learning π∗
I T-SPSA: A stochastic approximation algorithm to learn π∗
I T-FP: A Fictitious-play algorithm to approximate (π∗
1 , π∗
2 )
I Evaluation Results
I Target system, digital twin, system identification, & results
I Conclusions & Future Work
33. 8/43
Outline
I Use Case & Approach
I Use case: Intrusion response
I Approach: Optimal stopping
I Theoretical Background & Formal Model
I Optimal stopping problem definition
I Formulating the Dynkin game as a one-sided POSG
I Structure of π∗
I Stopping sets Sl are connected and nested, S1 is convex.
I Existence of multi-threshold best response strategies π̃1, π̃2.
I Efficient Algorithms for Learning π∗
I T-SPSA: A stochastic approximation algorithm to learn π∗
I T-FP: A Fictitious-play algorithm to approximate (π∗
1 , π∗
2 )
I Evaluation Results
I Target system, digital twin, system identification, & results
I Conclusions & Future Work
34. 9/43
Optimal Stopping: A Brief History
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 19472
I Since then it has been generalized and developed by (Chow3
,
Shiryaev & Kolmogorov4
, Bather5
, Bertsekas6
, etc.)
2
Abraham Wald. Sequential Analysis. Wiley and Sons, New York, 1947.
3
Y. Chow, H. Robbins, and D. Siegmund. “Great expectations: The theory of optimal stopping”. In: 1971.
4
Albert N. Shirayev. Optimal Stopping Rules. Reprint of russian edition from 1969. Springer-Verlag Berlin,
2007.
5
John Bather. Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions. USA:
John Wiley and Sons, Inc., 2000. isbn: 0471976490.
6
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
Scientific, 2005.
35. 10/43
The Optimal Stopping Problem
I The General Problem:
I A stochastic process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I Find the optimal stopping time τ∗
:
τ∗
= arg max
τ
Eτ
" τ−1
X
t=1
γt−1
RC
st st+1
+ γτ−1
RS
sτ sτ
#
(1)
where RS
ss0 & RC
ss0 are the stop/continue rewards
I The L − lth stopping time τl is:
τl = inf{t : t > τl−1, at = S}, l ∈ 1, .., L, τL+1 = 0
I τl is a random variable from sample space Ω to N, which is
dependent on hτ = ρ1, a1, o1, . . . , aτ−1, oτ and independent of
aτ , oτ+1, . . .
I We consider the class of stopping times Tt = {τ ≤ t} ∈ Fk
where Fk is the natural filtration on ht.
I Solution approaches: the Markovian approach and the
martingale approach.
36. 10/43
The Optimal Stopping Problem
I The General Problem:
I A stochastic process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I Find the optimal stopping time τ∗
:
τ∗
= arg max
τ
Eτ
" τ−1
X
t=1
γt−1
RC
st st+1
+ γτ−1
RS
sτ sτ
#
(2)
where RS
ss0 & RC
ss0 are the stop/continue rewards
I The L − lth stopping time τl is:
τl = inf{t : t > τl−1, at = S}, l ∈ 1, .., L, τL+1 = 0
I τl is a random variable from sample space Ω to N, which is
dependent on hτ = ρ1, a1, o1, . . . , aτ−1, oτ and independent of
aτ , oτ+1, . . .
I We consider the class of stopping times Tt = {τ ≤ t} ∈ Fk
where Fk is the natural filtration on ht.
I Solution approaches: the Markovian approach and the
martingale approach.
37. 10/43
Optimal Stopping: Solution Approaches
I The Markovian approach:
I Model the problem as a MDP or POMDP
I A policy π∗
that satisfies the Bellman-Wald equation is
optimal:
π∗
(s) = arg max
{S,C}
"
E
RS
s
| {z }
stop
, E
RC
s + γV ∗
(s0
)
| {z }
continue
#
∀s ∈ S
I Solve by backward induction, dynamic programming, or
reinforcement learning
38. 10/43
Optimal Stopping: Solution Approaches
I The Markovian approach:
I Assume all rewards are received upon stopping: R∅
s
I V ∗
(s) majorizes R∅
s if V ∗
(s) ≥ R∅
s ∀s ∈ S
I V ∗
(s) is excessive if V ∗
(s) ≥
P
s0 PC
s0sV ∗
(s0
) ∀s ∈ S
I V ∗
(s) is the minimal excessive function which majorizes R∅
s .
s
R∅
s
P
s0
PC
ss0 V ∗
(s0
)
V ∗
(s)
39. 10/43
Optimal Stopping: Solution Approaches
I The martingale approach:
I Model the state process as an arbitrary stochastic process
I The reward of the optimal stopping time is given by the
smallest supermartingale that stochastically dominates the
process, called the Snell envelope7
.
7
J. L. Snell. “Applications of martingale system theorems”. In: Transactions of the American Mathematical
Society 73 (1952), pp. 293–312.
40. 11/43
The Defender’s Optimal Stopping problem as a POMDP
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts weighted by priority ot, stops
remaining lt ∈ {1, .., L}, f (ot|st)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt 0
t ≥ 2
lt 0
intrusion starts
Qt = 1
final stop
lt = 0
intrusion prevented
lt = 0
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
41. 11/43
The Defender’s Optimal Stopping problem as a POMDP
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts weighted by priority ot, stops
remaining lt ∈ {1, .., L}, f (ot|st)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt 0
t ≥ 2
lt 0
intrusion starts
Qt = 1
final stop
lt = 0
intrusion prevented
lt = 0
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
42. 11/43
The Defender’s Optimal Stopping problem as a POMDP
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts weighted by priority ot, stops
remaining lt ∈ {1, .., L}, f (ot|st)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt 0
t ≥ 2
lt 0
intrusion starts
Qt = 1
final stop
lt = 0
intrusion prevented
lt = 0
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
43. 11/43
The Defender’s Optimal Stopping problem as a POMDP
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts weighted by priority ot, stops
remaining lt ∈ {1, .., L}, f (ot|st)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt 0
t ≥ 2
lt 0
intrusion starts
Qt = 1
final stop
lt = 0
intrusion prevented
lt = 0
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
44. 11/43
The Defender’s Optimal Stopping problem as a POMDP
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts weighted by priority ot, stops
remaining lt ∈ {1, .., L}, f (ot|st)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt 0
t ≥ 2
lt 0
intrusion starts
Qt = 1
final stop
lt = 0
intrusion prevented
lt = 0
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
45. 11/43
The Defender’s Optimal Stopping problem as a POMDP
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts weighted by priority ot, stops
remaining lt ∈ {1, .., L}, f (ot|st)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt 0
t ≥ 2
lt 0
intrusion starts
Qt = 1
final stop
lt = 0
intrusion prevented
lt = 0
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
46. 11/43
The Defender’s Optimal Stopping problem as a POMDP
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts weighted by priority ot, stops
remaining lt ∈ {1, .., L}, f (ot|st)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt 0
t ≥ 2
lt 0
intrusion starts
Qt = 1
final stop
lt = 0
intrusion prevented
lt = 0
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
47. 12/43
The Attacker’s Optimal Stopping problem as an MDP
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅,
defender belief b ∈ [0, 1].
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: denial of service and intrusion.
Penalty: detection
I Transition probabilities:
I Intrusion starts and ends when the
attacker takes stop actions
I Objective and Horizon:
I max Eπ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
at = 1
final stop detected
48. 13/43
The Dynkin Game as a One-Sided POSG
I Players:
I Player 1 is the defender and player 2 is the
attacker. Hence, N = {1, 2}.
I Actions:
I A1 = A2 = {S, C}.
I Rewards:
I Zero-sum game. Defender maximizes, attacker
minimizes.
I Observability:
I The defender has partial observability. The
attacker has full observability.
I Obective functions:
J1(π1, π2) = E(π1,π2)
T
X
t=1
γt−1
R(st, at)
#
(3)
J2(π1, π2) = −J1(π1, π2) (4)
0 1
∅
t ≥ 1
l 0
t ≥ 2
l 0
a
(2)
t = S
l
=
1
,
a
(
1
)
t
=
S
l
=
1
,
a
(
1
)
t
=
S
i
n
t
r
u
s
i
o
n
s
t
o
p
p
e
d
w
.
p
φ
l
o
r
t
e
r
m
i
n
a
t
e
d
(
a
(
2
)
t
=
S
)
49. 14/43
Outline
I Use Case Approach
I Use case: Intrusion response
I Approach: Optimal stopping
I Theoretical Background Formal Model
I Optimal stopping problem definition
I Formulating the Dynkin game as a one-sided POSG
I Structure of π∗
I Stopping sets Sl are connected and nested, S1 is convex.
I Existence of multi-threshold best response strategies π̃1, π̃2.
I Efficient Algorithms for Learning π∗
I T-SPSA: A stochastic approximation algorithm to learn π∗
I T-FP: A Fictitious-play algorithm to approximate (π∗
1 , π∗
2 )
I Evaluation Results
I Target system, digital twin, system identification, results
I Conclusions Future Work
50. 15/43
Structural Result: Optimal Multi-Threshold Policy
Theorem
Given the intrusion response POMDP, the following holds:
1. Sl−1 ⊆ Sl for l = 2, . . . L.
2. If L = 1, there exists an optimal threshold α∗ ∈ [0, 1] and an
optimal defender policy of the form:
π∗
L(b(1)) = S ⇐⇒ b(1) ≥ α∗
(5)
3. If L ≥ 1 and fX is totally positive of order 2 (TP2), there
exists L optimal thresholds α∗
l ∈ [0, 1] and an optimal
defender policy of the form:
π∗
l (b(1)) = S ⇐⇒ b(1) ≥ α∗
l , l = 1, . . . , L (6)
where α∗
l is decreasing in l.
51. 15/43
Structural Result: Optimal Multi-Threshold Policy
Theorem
Given the intrusion response POMDP, the following holds:
1. Sl−1 ⊆ Sl for l = 2, . . . L.
2. If L = 1, there exists an optimal threshold α∗ ∈ [0, 1] and an
optimal defender policy of the form:
π∗
L(b(1)) = S ⇐⇒ b(1) ≥ α∗
(7)
3. If L ≥ 1 and fX is totally positive of order 2 (TP2), there
exists L optimal thresholds α∗
l ∈ [0, 1] and an optimal
defender policy of the form:
π∗
l (b(1)) = S ⇐⇒ b(1) ≥ α∗
l , l = 1, . . . , L (8)
where α∗
l is decreasing in l.
52. 15/43
Structural Result: Optimal Multi-Threshold Policy
Theorem
Given the intrusion response POMDP, the following holds:
1. Sl−1 ⊆ Sl for l = 2, . . . L.
2. If L = 1, there exists an optimal threshold α∗ ∈ [0, 1] and an
optimal defender policy of the form:
π∗
L(b(1)) = S ⇐⇒ b(1) ≥ α∗
(9)
3. If L ≥ 1 and fX is totally positive of order 2 (TP2), there
exists L optimal thresholds α∗
l ∈ [0, 1] and an optimal
defender policy of the form:
π∗
l (b(1)) = S ⇐⇒ b(1) ≥ α∗
l , l = 1, . . . , L (10)
where α∗
l is decreasing in l.
53. 15/43
Structural Result: Optimal Multi-Threshold Policy
Theorem
Given the intrusion response POMDP, the following holds:
1. Sl−1 ⊆ Sl for l = 2, . . . L.
2. If L = 1, there exists an optimal threshold α∗ ∈ [0, 1] and an
optimal defender policy of the form:
π∗
L(b(1)) = S ⇐⇒ b(1) ≥ α∗
(11)
3. If L ≥ 1 and fX is totally positive of order 2 (TP2), there
exists L optimal thresholds α∗
l ∈ [0, 1] and an optimal
defender policy of the form:
π∗
l (b(1)) = S ⇐⇒ b(1) ≥ α∗
l , l = 1, . . . , L (12)
where α∗
l is decreasing in l.
58. 17/43
Lemma: V ∗
(b) is Piece-wise Linear and Convex
Lemma
V ∗(b) is piece-wise linear and convex.
I Belief space B is the |S − 1| dimensional unit simplex.
I |B| = ∞, high-dimensional (|S − 1|) continuous vector
I Infinite set of deterministic policies: maxπ:B→A Eπ
P
t rt
B(3): 2-dimensional unit-simplex
0.25 0.55
0.2
(1, 0, 0) (0, 1, 0)
(0, 0, 1)
(0.25, 0.55, 0.2)
B(2): 1-dimensional unit-simplex
(1, 0) (0, 1)
0.4 0.6
(0.4, 0.6)
59. 17/43
Lemma: V ∗
(b) is Piece-wise Linear and Convex
I Only finite set of belief points b ∈ B are “reachable”.
I Finite horizon =⇒ finite set of “conditional plans” H → A
I Set of pure strategies in an extensive game against
nature
N.{a1} N.{a2} N.{a3}
1.{a1, o1} 1.{a1, o2} 1.{a1, o3} 1.{a3, o3}
1.{a3, o2}
1.{a3, o1}
1.{a2, o1} 1.{a2, o2} 1.{a2, o3}
1.∅ |A| = 2, |O| = 3, T = 2
a1
a2
a3
o1 o2 o3 o1 o2 o3 o3
o2
o1
a1 a2 a1 a2 a1 a2 a1 a2 a1 a2 a1 a2 a2
a1
a2
a1
a2
a1
60. 17/43
Lemma: V ∗
(b) is Piece-wise Linear and Convex
I Only finite set of belief points b ∈ B are “reachable”.
I Finite horizon =⇒ finite set of “conditional plans” H → A
I Set of pure strategies in an extensive game against
nature
N.{a1} N.{a2} N.{a3}
1.{a1, o1} 1.{a1, o2} 1.{a1, o3} 1.{a3, o3}
1.{a3, o2}
1.{a3, o1}
1.{a2, o1} 1.{a2, o2} 1.{a2, o3}
1.∅ |A| = 2, |O| = 3, T = 2
Conditional plan β
a1
a2
a3
o1 o2 o3 o1 o2 o3 o3
o2
o1
a1 a2 a1 a2 a1 a2 a1 a2 a1 a2 a1 a2 a2
a1
a2
a1
a2
a1
61. 18/43
Lemma: V ∗
(b) is Piece-wise Linear and Convex
I For each conditional plan β ∈ Γ:
I Define vector αβ
∈ R|S|
such that αβ
i = V β
(i)
I =⇒ V β
(b) = bT
αβ
(linear in b).
I Thus, V ∗(b) = maxβ∈Γ bT αβ (piece-wise linear and convex8)
b
bα1
bα2
bα3
bα4
bα5
0
62. 18/43
Lemma: V ∗
(b) is Piece-wise Linear and Convex
I For each conditional plan β ∈ Γ:
I Define vector αβ
∈ R|S|
such that αβ
i = V β
(i)
I =⇒ V β
(b) = bT
αβ
(linear in b).
I Thus, V ∗(b) = maxβ∈Γ bT αβ (piece-wise linear and convex9)
b
V ∗
(b)
bα1
bα2
bα3
bα4
bα5
0
63. 19/43
Proofs: S1 is convex10
I S1 is convex if:
I for any two belief states b1, b2 ∈ S1
I any convex combination of b1, b2 is also in S1
I i.e. b1, b2 ∈ S1 =⇒ λb1 + (1 − λ)b2 ∈ S1 for λ ∈ [0, 1].
I Since V ∗(b) is convex:
V ∗
(λb1 + (1 − λ)b2) ≤ λV ∗
(b1) + (1 − λ)V (b2)
I Since b1, b2 ∈ S1:
V ∗
(b1) = Q∗
(b1, S) S=stop
V ∗
(b2) = Q∗
(b2, S) S=stop
10
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
64. 19/43
Proofs: S1 is convex11
I S1 is convex if:
I for any two belief states b1, b2 ∈ S1
I any convex combination of b1, b2 is also in S1
I i.e. b1, b2 ∈ S1 =⇒ λb1 + (1 − λ)b2 ∈ S1 for λ ∈ [0, 1].
I Since V ∗(b) is convex:
V ∗
(λb1 + (1 − λ)b2) ≤ λV ∗
(b1) + (1 − λ)V (b2)
I Since b1, b2 ∈ S1:
V ∗
(b1) = Q∗
(b1, S) S=stop
V ∗
(b2) = Q∗
(b2, S) S=stop
11
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
65. 19/43
Proofs: S1 is convex12
I S1 is convex if:
I for any two belief states b1, b2 ∈ S1
I any convex combination of b1, b2 is also in S1
I i.e. b1, b2 ∈ S1 =⇒ λb1 + (1 − λ)b2 ∈ S1 for λ ∈ [0, 1].
I Since V ∗(b) is convex:
V ∗
(λb1 + (1 − λ)b2) ≤ λV ∗
(b1) + (1 − λ)V (b2)
I Since b1, b2 ∈ S1:
V ∗
(b1) = Q∗
(b1, S) S=stop
V ∗
(b2) = Q∗
(b2, S) S=stop
12
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
66. 20/43
Proofs: S1 is convex13
Proof.
Assume b1, b2 ∈ S1. Then for any λ ∈ [0, 1]:
V ∗
λb1(1) + (1 − λ)b2(1)
≤ λV ∗
b1(1)) + (1 − λ)V ∗
(b2(1)
= λQ∗
(b1, S) + (1 − λ)Q∗
(b2, S)
13
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
67. 20/43
Proofs: S1 is convex14
Proof.
Assume b1, b2 ∈ S1. Then for any λ ∈ [0, 1]:
V ∗
λb1(1) + (1 − λ)b2(1)
≤ λV ∗
b1(1)) + (1 − λ)V ∗
(b2(1)
= λQ∗
(b1, S) + (1 − λ)Q∗
(b2, S)
= λR∅
b1
+ (1 − λ)R∅
b2
=
X
s
(λb1(s) + (1 − λ)b2(s))R∅
s
14
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
68. 20/43
Proofs: S1 is convex15
Proof.
Assume b1, b2 ∈ S1. Then for any λ ∈ [0, 1]:
V ∗
λb1(1) + (1 − λ)b2(1)
≤ λV ∗
b1(1)) + (1 − λ)V ∗
(b2(1)
= λQ∗
(b1, S) + (1 − λ)Q∗
(b2, S)
= λR∅
b1
+ (1 − λ)R∅
b2
=
X
s
(λb1(s) + (1 − λ)b2(s))R∅
s
= Q∗
λb1 + (1 − λ)b2, S
15
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
69. 20/43
Proofs: S1 is convex16
Proof.
Assume b1, b2 ∈ S1. Then for any λ ∈ [0, 1]:
V ∗
λb1(1) + (1 − λ)b2(1)
≤ λV ∗
b1(1)) + (1 − λ)V ∗
(b2(1)
= λQ∗
(b1, S) + (1 − λ)Q∗
(b2, S)
= λR∅
b1
+ (1 − λ)R∅
b2
=
X
s
(λb1(s) + (1 − λ)b2(s))R∅
s
= Q∗
λb1 + (1 − λ)b2, S
≤ V ∗
λb1(1) + (1 − λ)b2(1)
the last inequality is because V ∗ is optimal. The second-to-last is
because there is just a single stop.
16
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
70. 20/43
Proofs: S1 is convex17
Proof.
Assume b1, b2 ∈ S1. Then for any λ ∈ [0, 1]:
V ∗
λb1(1) + (1 − λ)b2(1)
≤ λV ∗
b1(1)) + (1 − λ)V ∗
(b2(1)
= λQ∗
(b1, S) + (1 − λ)Q∗
(b2, S)
= Q∗
λb1 + (1 − λ)b2, S
≤ V ∗
λb1(1) + (1 − λ)b2(1)
the last inequality is because V ∗ is optimal. The second-to-last is
because there is just a single stop. Hence:
Q∗
λb1 + (1 − λ)b2, S
= V ∗
λb1(1) + (1 − λ)b2(1)
b1, b2 ∈ S1 =⇒ (λb1 + (1 − λ)) ∈ S1. Therefore S1 is
convex.
17
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
71. 20/43
Proofs: S1 is convex18
b(1)
0 1
belief space B = [0, 1]
S1
α∗
1 β∗
1
18
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
72. 21/43
Proofs: Single-threshold policy is optimal if L = 119
I In our case, B = [0, 1]. We know S1 is a convex subset of B.
I Consequence, S1 = [α∗, β∗]. We show that β∗ = 1.
I If b(1) = 1, using our definition of the reward function, the
Bellman equation states:
π∗
(1) ∈ arg max
{S,C}
150 + V ∗
(∅)
| {z }
a=S
, −90 +
X
o∈O
Z(o, 1, C)V ∗
bo
C (1)
| {z }
a=C
#
= arg max
{S,C}
150
|{z}
a=S
, −90 + V ∗
(1)
| {z }
a=C
#
= S i.e π∗
(1) =Stop
I Hence 1 ∈ S1. It follows that S1 = [α∗, 1] and:
π∗
b(1)
=
(
S if b(1) ≥ α∗
C otherwise
19
Kim Hammar and Rolf Stadler. “Learning Intrusion Prevention Policies through Optimal Stopping”. In:
International Conference on Network and Service Management (CNSM 2021).
http://dl.ifip.org/db/conf/cnsm/cnsm2021/1570732932.pdf. Izmir, Turkey, 2021.
73. 21/43
Proofs: Single-threshold policy is optimal if L = 120
I In our case, B = [0, 1]. We know S1 is a convex subset of B.
I Consequence, S1 = [α∗, β∗]. We show that β∗ = 1.
I If b(1) = 1, using our definition of the reward function, the
Bellman equation states:
π∗
(1) ∈ arg max
{S,C}
150 + V ∗
(∅)
| {z }
a=S
, −90 +
X
o∈O
Z(o, 1, C)V ∗
bo
C (1)
| {z }
a=C
#
= arg max
{S,C}
150
|{z}
a=S
, −90 + V ∗
(1)
| {z }
a=C
#
= S i.e π∗
(1) =Stop
I Hence 1 ∈ S1. It follows that S1 = [α∗, 1] and:
π∗
b(1)
=
(
S if b(1) ≥ α∗
C otherwise
20
Kim Hammar and Rolf Stadler. “Learning Intrusion Prevention Policies through Optimal Stopping”. In:
International Conference on Network and Service Management (CNSM 2021).
http://dl.ifip.org/db/conf/cnsm/cnsm2021/1570732932.pdf. Izmir, Turkey, 2021.
74. 21/43
Proofs: Single-threshold policy is optimal if L = 121
I In our case, B = [0, 1]. We know S1 is a convex subset of B.
I Consequence, S1 = [α∗, β∗]. We show that β∗ = 1.
I If b(1) = 1, using our definition of the reward function, the
Bellman equation states:
π∗
(1) ∈ arg max
{S,C}
150 + V ∗
(∅)
| {z }
a=S
, −90 +
X
o∈O
Z(o, 1, C)V ∗
bo
C (1)
| {z }
a=C
#
= arg max
{S,C}
150
|{z}
a=S
, −90 + V ∗
(1)
| {z }
a=C
#
= S i.e π∗
(1) =Stop
I Hence 1 ∈ S1. It follows that S1 = [α∗, 1] and:
π∗
b(1)
=
(
S if b(1) ≥ α∗
C otherwise
21
Kim Hammar and Rolf Stadler. “Learning Intrusion Prevention Policies through Optimal Stopping”. In:
International Conference on Network and Service Management (CNSM 2021).
http://dl.ifip.org/db/conf/cnsm/cnsm2021/1570732932.pdf. Izmir, Turkey, 2021.
76. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
22
I We want to show that Sl ⊆ S1+l
I Bellman Equation:
π∗
l−1(b(1)) ∈ arg max
{S,C}
RS
b(1),l−1 +
X
o
Po
b(1)V ∗
l−2 bo
(1)
| {z }
Stop
, RC
b(1),l−1 +
X
o
Po
b(1)V ∗
l−1 bo
(1)
| {z }
Continue
#
I =⇒ optimal to stop if:
RS
b(1),l−1 − RC
b(1),l−1 ≥
X
o
Po
b(1)
V ∗
l−1 bo
(1)
− V ∗
l−2 bo
(1)
13
I Hence, if b(1) ∈ Sl−1, then (13) holds.
22
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
77. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
23
I We want to show that Sl ⊆ S1+l
I Bellman Equation:
π∗
l−1(b(1)) ∈ arg max
{S,C}
RS
b(1),l−1 +
X
o
Po
b(1)V ∗
l−2 bo
(1)
| {z }
Stop
, RC
b(1),l−1 +
X
o
Po
b(1)V ∗
l−1 bo
(1)
| {z }
Continue
#
I =⇒ optimal to stop if:
RS
b(1),l−1 − RC
b(1),l−1 ≥
X
o
Po
b(1)
V ∗
l−1 bo
(1)
− V ∗
l−2 bo
(1)
13
I Hence, if b(1) ∈ Sl−1, then (13) holds.
23
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
78. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
24
I We want to show that Sl ⊆ S1+l
I Bellman Equation:
π∗
l−1(b(1)) ∈ arg max
{S,C}
RS
b(1),l−1 +
X
o
Po
b(1)V ∗
l−2 bo
(1)
| {z }
Stop
, RC
b(1),l−1 +
X
o
Po
b(1)V ∗
l−1 bo
(1)
| {z }
Continue
#
I =⇒ optimal to stop if:
RS
b(1),l−1 − RC
b(1),l−1 ≥
X
o
Po
b(1)
V ∗
l−1 bo
(1)
− V ∗
l−2 bo
(1)
(13)
I Hence, if b(1) ∈ Sl−1, then (13) holds.
24
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
79. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
25
I
RS
b(1) − RC
b(1) ≥
X
o
Po
b(1)
V ∗
l−1 bo
(1)
− V ∗
l−2 bo
(1)
I We want to show that b(1) ∈ Sl−2 =⇒ b(1) ∈ Sl−1.
I Sufficient to show that LHS above is non-decreasing in l and
RHS is non-increasing in l.
I LHS is non-decreasing by definition of reward function.
I We show that RHS is non-increasing by induction on
k = 0, 1 . . . where k is the iteration of value iteration.
I We know limk→∞ V k(b) = V ∗(b).
I Define W k
l b(1)
= V k
l b(1)
− V k
l−1 b(1)
25
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
80. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
26
I
RS
b(1) − RC
b(1) ≥
X
o
Po
b(1)
V ∗
l−1 bo
(1)
− V ∗
l−2 bo
(1)
I We want to show that b(1) ∈ Sl−2 =⇒ b(1) ∈ Sl−1.
I Sufficient to show that LHS above is non-decreasing in l and
RHS is non-increasing in l.
I LHS is non-decreasing by definition of reward function.
I We show that RHS is non-increasing by induction on
k = 0, 1 . . . where k is the iteration of value iteration.
I We know limk→∞ V k(b) = V ∗(b).
I Define W k
l b(1)
= V k
l b(1)
− V k
l−1 b(1)
26
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
81. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
27
I
RS
b(1) − RC
b(1) ≥
X
o
Po
b(1)
V ∗
l−1 bo
(1)
− V ∗
l−2 bo
(1)
I We want to show that b(1) ∈ Sl−2 =⇒ b(1) ∈ Sl−1.
I Sufficient to show that LHS above is non-decreasing in l and
RHS is non-increasing in l.
I LHS is non-decreasing by definition of reward function.
I We show that RHS is non-increasing by induction on
k = 0, 1 . . . where k is the iteration of value iteration.
I We know limk→∞ V k(b) = V ∗(b).
I Define W k
l b(1)
= V k
l b(1)
− V k
l−1 b(1)
27
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
82. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
28
Proof.
W 0
l b(1)
= 0 ∀l. Assume W k−1
l−1 b(1)
− W k−1
l b(1)
≥ 0.
28
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
83. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
29
Proof.
W 0
l b(1)
= 0 ∀l. Assume W k−1
l−1 b(1)
− W k−1
l b(1)
≥ 0.
W k
l−1 b(1)
− W k
l (b(1)) = 2V k
l−1 − V k
l−2 − V k
l
29
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
84. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
30
Proof.
W 0
l b(1)
= 0 ∀l. Assume W k−1
l−1 b(1)
− W k−1
l b(1)
≥ 0.
W k
l−1 b(1)
− W k
l (b(1)) = 2V k
l−1 − V k
l−2 − V k
l =
2R
ak
l−1
b(1) − R
ak
l
b(1) − R
ak
l−2
b(1)
+
X
o∈O
Po
b(1)
2V k−1
l−1−ak
l−1
b(1)
− V k−1
l−ak
l
b(1)
− V k−1
l−2−ak
l−2
b(1)
Want to show that the above is non-negative. This depends on
ak
l , ak
l−1, ak
l−2.
30
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
85. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
31
Proof.
W 0
l b(1)
= 0 ∀l. Assume W k−1
l−1 b(1)
− W k−1
l b(1)
≥ 0.
W k
l−1 b(1)
− W k
l (b(1)) = 2V k
l−1 − V k
l−2 − V k
l =
2R
ak
l−1
b(1) − R
ak
l
b(1) − R
ak
l−2
b(1)
+
X
o∈O
Po
b(1)
2V k−1
l−1−ak
l−1
b(1)
− V k−1
l−ak
l
b(1)
− V k−1
l−2−ak
l−2
b(1)
Want to show that the above is non-negative. This depends on
ak
l , ak
l−1, ak
l−2.
There are four cases to consider: (1) b(1) ∈ S k
l ∩ S k
l−1 ∩ S k
l−2;
(2) b(1) ∈ S k
l ∩ C k
l−1 ∩ C k
l−2; (3) b(1) ∈ S k
l ∩ S k
l−1 ∩ C k
l−2; (4)
b(1) ∈ C k
l ∩ C k
l−1 ∩ C k
l−2.
The other cases, e.g. b(1) ∈ S k
l ∩ C k
l−1 ∩ S k
l−2, can be discarded
due to the induction assumption.
86. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
32
Proof.
If b(1) ∈ S k
l ∩ S k
l−1 ∩ S k
l−2, then:
W k
l−1 b(1)
− W k
l b(1)
=
X
o∈O
Po
b(1)
W k−1
l−2 bo
(1)
− W k−1
l−1 bo
(1)
which is non-negative by the induction hypothesis.
32
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
87. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
33
Proof.
If b(1) ∈ S k
l ∩ S k
l−1 ∩ S k
l−2, then:
W k
l−1 b(1)
− W k
l b(1)
=
X
o∈O
Po
b(1)
W k−1
l−2 bo
(1)
− W k−1
l−1 bo
(1)
which is non-negative by the induction hypothesis.
If b(1) ∈ S k
l ∩ C k
l−1 ∩ C k
l−2, then:
W k
l b(1)
− W k
l−1 b(1)
= RC
b(1) − RS
b(1) +
X
o∈O
Po
b(1)
W k−1
l−1 bo
(1)
33
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
88. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
34
Proof.
If b(1) ∈ S k
l ∩ S k
l−1 ∩ S k
l−2, then:
W k
l−1 b(1)
− W k
l b(1)
=
X
o∈O
Po
b(1)
W k−1
l−2 bo
(1)
− W k−1
l−1 bo
(1)
which is non-negative by the induction hypothesis.
If b(1) ∈ S k
l ∩ C k
l−1 ∩ C k
l−2, then:
W k
l b(1)
− W k
l−1 b(1)
= RC
b(1) − RS
b(1) +
X
o∈O
Po
b(1)
W k−1
l−1 bo
(1)
Bellman eq. implies, if b(1) ∈ Cl−1, then:
RC
b(1) − RS
b(1) +
X
o∈O
Po
b(1)
W k−1
l−1 bo
(1)
≥ 0
89. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
35
Proof.
If b(1) ∈ S k
l ∩ S k
l−1 ∩ C k
l−2, then:
W k
l−1 b(1)
− W k
l b(1)
= RS
b(1) − RC
b(1) −
X
o∈O
Po
b(1)
W k−1
l−1 bo
(1)
35
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
90. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
36
Proof.
If b(1) ∈ S k
l ∩ S k
l−1 ∩ C k
l−2, then:
W k
l−1 b(1)
− W k
l b(1)
= RS
b(1) − RC
b(1) −
X
o∈O
Po
b(1)
W k−1
l−1 bo
(1)
Bellman eq. implies, if b(1) ∈ S k
l−1, then:
RS
b(1) − RC
b(1) −
X
o∈O
Po
b(1)
W k−1
l−1 bo
(1)
≥ 0
36
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
91. 23/43
Proofs: Nested stopping sets Sl ⊆ S1+l
37
Proof.
If b(1) ∈ S k
l ∩ S k
l−1 ∩ C k
l−2, then:
W k
l−1 b(1)
− W k
l b(1)
= RS
b(1) − RC
b(1) −
X
o∈O
Po
b(1)
W k−1
l−1 bo
(1)
Bellman eq. implies, if b(1) ∈ S k
l−1, then:
RS
b(1) − RC
b(1) −
X
o∈O
Po
b(1)
W k−1
l−1 bo
(1)
≥ 0
If b(1) ∈ C k
l ∩ C k
l−1 ∩ C k
l−2, then:
W k
l−1 b(1)
− W k
l b(1)
=
X
o∈O
Po
b(1)
W k−1
l−1 bo
(1)
− W k−1
l bo
(1)
which is non-negative by the induction hypothesis.
37
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
92. 24/43
Proofs: Nested stopping sets Sl ⊆ S1+l
38
S1 ⊆ S2 still allows:
b(1)
0 1
S1
S2
S2
We need to show that Sl is connected, for all l ∈ {1, . . . , L}.
38
T. Nakai. “The problem of optimal stopping in a partially observable Markov chain”. In: Journal of
Optimization Theory and Applications 45.3 (1985), pp. 425–442. issn: 1573-2878. doi: 10.1007/BF00938445.
url: https://doi.org/10.1007/BF00938445.
93. 25/43
Proofs: Connected stopping sets Sl
39
I Sl is connected if b(1) ∈ Sl , b0(1) ≥ b(1) =⇒ b0(1) ∈ Sl
I If b(1) ∈ Sl we use the Bellman eq. to obtain:
RS
b(1) − RC
b(1) +
X
o
Po
b(1)
V ∗
l−1 bo
(1)
− V ∗
l bo
(1)
≥ 0
I We need to show that the above inequality holds for any
b0(1) ≥ b(1)
39
Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on
Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781.
94. 26/43
Proofs: Monotone belief update
Lemma (Monotone belief update)
Given two beliefs b1(1) ≥ b2(1), if the transition probabilities and
the observation probabilities are Totally Positive of Order 2
(TP2), then bo
a,1(1) ≥ bo
a,2(1), where bo
a,1(1) and bo
a,2(1) denote
the beliefs updated with the Bayesian filter after taking action
a ∈ A and observing o ∈ O.
See Theorem 10.3.1 and proof on pp 225,238 in40
40
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
95. 27/43
Proofs: Necessary Condition, Total Positivity of Order 241
I A row-stochastic matrix is totally positive of order 2 (TP2) if:
I The rows of the matrix are stochastically monotone
I Equivalently, all second-order minors are non-negative.
I Example:
A =
0.3 0.5 0.2
0.2 0.4 0.4
0.1 0.2 0.7
(14)
There are 3
2
2
second-order minors:
det
0.3 0.5
0.2 0.4
#
= 0.02, det
0.2 0.4
0.1 0.2
#
= 0, ...etc. (15)
Since all minors are non-negative, the matrix is TP2
41
Samuel Karlin. “Total positivity, absorption probabilities and applications”. In: Transactions of the American
Mathematical Society 111 (1964).
96. 28/43
Proofs: Connected stopping sets Sl
42
I Since the transition probabilities are TP2 by definition and we
assume the observation probabilities are TP2, the condition
for showing that the stopping sets are connected reduces to
the following.
I Show that the below expression is weakly increasing in b(1).
RS
b(1) − RC
b(1) + V ∗
l−1 b(1)
− V ∗
l b(1)
I We prove this by induction on k.
42
Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on
Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781.
97. 29/43
Proofs: Connected stopping sets Sl
43
Assume RS
b(1) − RC
b(1) + V k−1
l−1 b(1)
− V k−1
l b(1)
is weakly
increasing in b(1).
RS
b(1) − RC
b(1) + V k
l−1 b(1)
− V k
l b(1)
= RS
b(1) − RC
b(1)+
R
ak
l−1
b(1) − R
ak
l
b(1) +
X
o∈O
Po
b(1)
V k−1
l−1−ak
l−1
bo
(1)
− V k−1
l−ak
l
bo
(1)
43
Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on
Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781.
98. 29/43
Proofs: Connected stopping sets Sl
44
Assume RS
b(1) − RC
b(1) + V k−1
l−1 b(1)
− V k−1
l b(1)
is weakly
increasing in b(1).
RS
b(1) − RC
b(1) + V k
l−1 b(1)
− V k
l b(1)
= RS
b(1) − RC
b(1)+
R
ak
l−1
b(1) − R
ak
l
b(1) +
X
o∈O
Po
b(1)
V k−1
l−1−ak
l−1
bo
(1)
− V k−1
l−ak
l
bo
(1)
Want to show that the above is weakly-increasing in b(1). This
depends on ak
l and ak
l−1.
44
Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on
Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781.
99. 29/43
Proofs: Connected stopping sets Sl
45
Assume RS
b(1) − RC
b(1) + V k−1
l−1 b(1)
− V k−1
l b(1)
is weakly
increasing in b(1).
RS
b(1) − RC
b(1) + V k
l−1 b(1)
− V k
l b(1)
= RS
b(1) − RC
b(1)+
R
ak
l−1
b(1) − R
ak
l
b(1) +
X
o∈O
Po
b(1)
V k−1
l−1−ak
l−1
bo
(1)
− V k−1
l−ak
l
bo
(1)
Want to show that the above is weakly-increasing in b(1). This
depends on ak
l and ak
l−1.
There are three cases to consider:
1. b(1) ∈ S k
l ∩ S k
l−1
2. b(1) ∈ S k
l ∩ C k
l−1
3. b(1) ∈ C k
l ∩ C k
l−1
45
Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on
Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781.
100. 30/43
Proofs: Connected stopping sets Sl
46
Proof.
If b(1) ∈ Sl ∩ Sl−1, then:
RS
b(1) − RC
b(1) + V k
l−1 b(1)
− V k
l b(1)
=
RS
b(1) − RC
b(1)
X
o∈O
Po
b(1)
V k−1
l−2 bo
(1)
− V k−1
l−1 bo
(1)
which is weakly increasing in b(1) by the induction hypothesis.
46
Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on
Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781.
101. 30/43
Proofs: Connected stopping sets Sl
47
Proof.
If b(1) ∈ S k
l ∩ S k
l−1, then:
RS
b(1) − RC
b(1) + V k
l−1 b(1)
− V k
l b(1)
=
RS
b(1) − RC
b(1)
X
o∈O
Po
b(1)
V k−1
l−2 bo
(1)
− V k−1
l−1 bo
(1)
which is weakly increasing in b(1) by the induction hypothesis.
If b(1) ∈ S k
l ∩ C k
l−1, then:
RS
b(1) − RC
b(1) + V k
l−1 b(1)
− V k
l b(1)
=
X
o∈O
Po
b(1)
V k−1
l−1 bo
(1)
− V k−1
l−1 bo
(1)
= 0
which is trivially weakly increasing in b(1).
47
Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on
Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781.
102. 30/43
Proofs: Connected stopping sets Sl
48
Proof.
If b(1) ∈ C k
l ∩ C k
l−1, then:
RS
b(1) − RC
b(1) + V k
l−1 b(1)
− V k
l b(1)
=
RS
b(1) − RC
b(1)
X
o∈O
Po
b(1)
V k−1
l−1 bo
(1)
− V k−1
l bo
(1)
which is weakly increasing in b(1) by the induction hypothesis.
Hence, if b(1) ∈ Sl and b0(1) ≥ b(1) then b0(1) ∈ Sl .
Therefore, Sl is connected.
48
Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on
Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781.
103. 31/43
Proofs: Optimal multi-threshold policy π∗
l
49
We have shown that:
I S1 = [α∗
1, 1]
I Sl ⊆ Sl+1
I Sl is connected (convex) for l = 1, . . . , L
It follows that, Sl = [α∗
l , 1] and α∗
1 ≥ α∗
2 ≥ . . . ≥ α∗
L.
b(1)
0 1
belief space B = [0, 1]
S1
S2
.
.
.
SL
α∗
1
α∗
2
α∗
L
. . .
49
Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on
Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781.
104. 32/43
Structural Result: Best Response Multi-Threshold Attacker
Strategy
Theorem
Given the intrusion MDP, the following holds:
1. Given a defender strategy π1 ∈ Π1 where π1(S|b(1)) is
non-decreasing in b(1) and π1(S|1) = 1, then there exist
values β̃0,1, β̃1,1, . . ., β̃0,L, β̃1,L ∈ [0, 1] and a best response
strategy π̃2 ∈ B2(π1) for the attacker that satisfies
π̃2,l (0, b(1)) = C ⇐⇒ π1,l (S|b(1)) ≥ β̃0,l (16)
π̃2,l (1, b(1)) = S ⇐⇒ π1,l (S|b(1)) ≥ β̃1,l (17)
for l ∈ {1, . . . , L}.
Proof.
Follows the same idea as the proof for the defender case.
See50.
50
Kim Hammar and Rolf Stadler. Learning Near-Optimal Intrusion Responses Against Dynamic Attackers. 2023.
105. 33/43
Outline
I Use Case Approach
I Use case: Intrusion response
I Approach: Optimal stopping
I Theoretical Background Formal Model
I Optimal stopping problem definition
I Formulating the Dynkin game as a one-sided POSG
I Structure of π∗
I Stopping sets Sl are connected and nested, S1 is convex.
I Existence of multi-threshold best response strategies π̃1, π̃2.
I Efficient Algorithms for Learning π∗
I T-SPSA: A stochastic approximation algorithm to learn π∗
I T-FP: A Fictitious-play algorithm to approximate (π∗
1 , π∗
2 )
I Evaluation Results
I Target system, digital twin, system identification, results
I Conclusions Future Work
106. 34/43
Threshold-SPSA to Learn Best Responses
0.5 1
0.5
1
π̃1,l,θ̃(1) (S|b(1))
threshold σ(θ̃
(1)
l )
b(1)
0
A mixed threshold strategy where σ(θ̃
(1)
l ) is the threshold.
I Parameterizes π̃i through threshold vectors according to
Theorem 1:
ϕ(a, b) , 1 +
b(1 − σ(a))
σ(a)(1 − b)
−20
!−1
(18)
π̃i,θ̃(i) S|b(1)
, ϕ
θ̃
(i)
l , b(1)
(19)
I The parameterized strategies are mixed (and differentiable)
strategies that approximate threshold strategies.
I Update threshold vectors θ(i) using SPSA iteratively.
107. 35/43
Threshold-Fictitious Play to Approximate an Equilibrium
π̃2 ∈ B2(π1)
π2
π1
π̃1 ∈ B1(π2)
π̃0
2 ∈ B2(π0
1)
π0
2
π0
1
π̃0
1 ∈ B1(π0
2)
. . .
π∗
2 ∈ B2(π∗
1)
π∗
1 ∈ B1(π∗
2)
Fictitious play: iterative averaging of best responses.
I Learn best response strategies iteratively through T-SPSA
I Average best responses to approximate the equilibrium
108. 36/43
Comparison against State-of-the-art Algorithms
0 10 20 30 40 50 60
training time (min)
0
50
100
Reward per episode against Novice
0 10 20 30 40 50 60
training time (min)
Reward per episode against Experienced
0 10 20 30 40 50 60
training time (min)
Reward per episode against Expert
PPO ThresholdSPSA Shiryaev’s Algorithm (α = 0.75) HSVI upper bound
109. 37/43
Outline
I Use Case Approach
I Use case: Intrusion response
I Approach: Optimal stopping
I Theoretical Background Formal Model
I Optimal stopping problem definition
I Formulating the Dynkin game as a one-sided POSG
I Structure of π∗
I Stopping sets Sl are connected and nested, S1 is convex.
I Existence of multi-threshold best response strategies π̃1, π̃2.
I Efficient Algorithms for Learning π∗
I T-SPSA: A stochastic approximation algorithm to learn π∗
I T-FP: A Fictitious-play algorithm to approximate (π∗
1 , π∗
2 )
I Evaluation Results
I Target system, digital twin, system identification, results
I Conclusions Future Work
110. 37/43
Outline
I Use Case Approach
I Use case: Intrusion response
I Approach: Optimal stopping
I Theoretical Background Formal Model
I Optimal stopping problem definition
I Formulating the Dynkin game as a one-sided POSG
I Structure of π∗
I Stopping sets Sl are connected and nested, S1 is convex.
I Existence of multi-threshold best response strategies π̃1, π̃2.
I Efficient Algorithms for Learning π∗
I T-SPSA: A stochastic approximation algorithm to learn π∗
I T-FP: A Fictitious-play algorithm to approximate (π∗
1 , π∗
2 )
I Evaluation Results
I Target system, digital twin, system identification, results
I Conclusions Future Work
111. 38/43
Creating a Digital Twin of the Target Infrastructure
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning
Generalization
Strategy evaluation
Model estimation
Automation
Self-learning systems
112. 39/43
Creating a Digital Twin of the Target Infrastructure
I Emulate hosts with docker containers
I Emulate IPS and vulnerabilities with
software
I Network isolation and traffic shaping
through NetEm in the Linux kernel
I Enforce resource constraints using
cgroups.
I Emulate client arrivals with Poisson
process
I Internal connections are full-duplex
loss-less with bit capacities of 1000
Mbit/s
I External connections are full-duplex
with bit capacities of 100 Mbit/s
0.1% packet loss in normal operation
and random bursts of 1% packet loss
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
113. 39/43
Creating a Digital Twin of the Target Infrastructure
I Emulate hosts with docker containers
I Emulate IPS and vulnerabilities with
software
I Network isolation and traffic shaping
through NetEm in the Linux kernel
I Enforce resource constraints using
cgroups.
I Emulate client arrivals with Poisson
process
I Internal connections are full-duplex
loss-less with bit capacities of 1000
Mbit/s
I External connections are full-duplex
with bit capacities of 100 Mbit/s
0.1% packet loss in normal operation
and random bursts of 1% packet loss
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
114. 39/43
Creating a Digital Twin of the Target Infrastructure
I Emulate hosts with docker containers
I Emulate IPS and vulnerabilities with
software
I Network isolation and traffic shaping
through NetEm in the Linux kernel
I Enforce resource constraints using
cgroups.
I Emulate client arrivals with Poisson
process
I Internal connections are full-duplex
loss-less with bit capacities of 1000
Mbit/s
I External connections are full-duplex
with bit capacities of 100 Mbit/s
0.1% packet loss in normal operation
and random bursts of 1% packet loss
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
115. 39/43
Creating a Digital Twin of the Target Infrastructure
I Emulate hosts with docker containers
I Emulate IPS and vulnerabilities with
software
I Network isolation and traffic shaping
through NetEm in the Linux kernel
I Enforce resource constraints using
cgroups.
I Emulate client arrivals with Poisson
process
I Internal connections are full-duplex
loss-less with bit capacities of 1000
Mbit/s
I External connections are full-duplex
with bit capacities of 100 Mbit/s
0.1% packet loss in normal operation
and random bursts of 1% packet loss
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
116. 39/43
Creating a Digital Twin of the Target Infrastructure
I Emulate hosts with docker containers
I Emulate IPS and vulnerabilities with
software
I Network isolation and traffic shaping
through NetEm in the Linux kernel
I Enforce resource constraints using
cgroups.
I Emulate client arrivals with Poisson
process
I Internal connections are full-duplex
loss-less with bit capacities of 1000
Mbit/s
I External connections are full-duplex
with bit capacities of 100 Mbit/s
0.1% packet loss in normal operation
and random bursts of 1% packet loss
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
117. 39/43
Creating a Digital Twin of the Target Infrastructure
I Emulate hosts with docker containers
I Emulate IPS and vulnerabilities with
software
I Network isolation and traffic shaping
through NetEm in the Linux kernel
I Enforce resource constraints using
cgroups.
I Emulate client arrivals with Poisson
process
I Internal connections are full-duplex
loss-less with bit capacities of 1000
Mbit/s
I External connections are full-duplex
with bit capacities of 100 Mbit/s
0.1% packet loss in normal operation
and random bursts of 1% packet loss
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
118. 39/43
Our Approach for Automated Network Security
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning
Generalization
Strategy evaluation
Model estimation
Automation
Self-learning systems
119. 40/43
System Identification
ˆ
f
O
(o
t
|0)
Probability distribution of # IPS alerts weighted by priority ot
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
ˆ
f
O
(o
t
|1)
Fitted model Distribution st = 0 Distribution st = 1
I The distribution fO of defender observations (system metrics)
is unknown.
I We fit a Gaussian mixture distribution ˆ
fO as an estimate of fO
in the target infrastructure.
I For each state s, we obtain the conditional distribution ˆ
fO|s
through expectation-maximization.
120. 41/43
Learning Curves in Simulation and Digital Twin
0
50
100
Novice
Reward per episode
0
50
100
Episode length (steps)
0.0
0.5
1.0
P[intrusion interrupted]
0.0
0.5
1.0
1.5
P[early stopping]
5
10
Duration of intrusion
−50
0
50
100
experienced
0
50
100
150
0.0
0.5
1.0
0.0
0.5
1.0
1.5
0
5
10
0 20 40 60
training time (min)
−50
0
50
100
expert
0 20 40 60
training time (min)
0
50
100
150
0 20 40 60
training time (min)
0.0
0.5
1.0
0 20 40 60
training time (min)
0.0
0.5
1.0
1.5
2.0
0 20 40 60
training time (min)
0
5
10
15
20
πθ,l simulation πθ,l emulation (∆x + ∆y) ≥ 1 baseline Snort IPS upper bound
121. 42/43
Conclusions
I We develop a method to
automatically learn security strategies.
I We apply the method to an intrusion
response use case.
I We design a solution framework guided
by the theory of optimal stopping.
I We present several theoretical results
on the structure of the optimal
solution.
I We show numerical results in a
realistic emulation environment.
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation
Target
System
Model Creation
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation
Learning
122. 43/43
Current and Future Work
Time
st
st+1
st+2
st+3
. . .
rt+1
rt+2
rt+3
rrT
1. Extend use case
I Additional defender actions
I Utilize SDN controller and NFV-based defenses
I Increase observation space and attacker model
I More heterogeneous client population
2. Extend solution framework
I Model-predictive control
I Rollout-based techniques
I Extend system identification algorithm
3. Extend theoretical results
I Exploit symmetries and causal structure
I Utilize theory to improve sample efficiency
I Decompose solution framework hierarchically