IAC 2024 - IA Fast Track to Search Focused AI Solutions
Self-learning systems for cyber security
1. 1/30
Self-Learning Systems for Cyber Security
NSE Seminar
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
April 9, 2021
4. 4/30
Challenges: Evolving and Automated Attacks
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
Attacker Client 1 Client 2 Client 3
Defender
R1
5. 4/30
Goal: Automation and Learning
I Challenges
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
Attacker Client 1 Client 2 Client 3
Defender
R1
6. 4/30
Approach: Game Model & Reinforcement Learning
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
I Our Approach:
I Model network attack and defense as
games.
I Use reinforcement learning to learn
policies.
I Incorporate learned policies in
self-learning systems.
Attacker Client 1 Client 2 Client 3
Defender
R1
7. 5/30
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero1
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling2
, automated intrusion detection
etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory3
I Markov decision theory
I =⇒ Many security operations involves
strategic decision making
1
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
3
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
8. 5/30
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero4
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling5
, automated intrusion detection
etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory6
I Markov decision theory
I =⇒ Many security operations involves
strategic decision making
4
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
5
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
6
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
9. 5/30
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero7
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling8
, automated intrusion detection
etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory9
I Markov decision theory
I =⇒ Many security operations involves
strategic decision making
7
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
8
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
9
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
10. 6/30
Our Work
I Use Case: Intrusion Prevention
I Our Method:
I Emulating computer infrastructures
I System identification and model creation
I Reinforcement learning and generalization
I Results:
I Learning to Capture The Flag
I Learning to Detect Network Intrusions
I Conclusions and Future Work
11. 7/30
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and patching
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Client 1 Client 2 Client 3
Defender
R1
12. 8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
13. 8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
14. 8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
15. 8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
16. 8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
17. 8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
18. 8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
19. 8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
20. 8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
21. 9/30
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
22. 9/30
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
23. 10/30
Emulation: Execution Times of Replicated Operations
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Normalized
Frequency
Action execution times (costs)
|N| = 25
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 50
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 75
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 100
I Fundamental issue: Computational methods for policy
learning typically require samples on the order of 100k − 10M.
I =⇒ Infeasible to optimize in the emulation system
24. 11/30
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k
m3,1
.
.
.
m3,k
m7,1
.
.
.
m7,k
m4,1
.
.
.
m4,k
Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
25. 11/30
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k
m3,1
.
.
.
m3,k
m7,1
.
.
.
m7,k
m4,1
.
.
.
m4,k
Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
26. 11/30
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k
m3,1
.
.
.
m3,k
m7,1
.
.
.
m7,k
m4,1
.
.
.
m4,k
Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
28. 13/30
System Identification: Estimated Dynamics Model
−5 0 5 10 15 20
# New TCP/UDP Connections
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New TCP/UDP Connections
0 10 20 30 40 50
# New Failed Login Attempts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Failed Login Attempts
−30 −20 −10 0 10 20 30
# Created User Accounts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Created User Accounts
−0.04 −0.02 0.00 0.02 0.04
# New Logged in Users
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Logged in Users
−0.04 −0.02 0.00 0.02 0.04
# Login Events
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Login Events
0 20 40 60 80 100 120 140
# Created Processes
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Created Processes
Node IP: 172.18.4.2
(b0, a0) (b1, a0) ...
29. 14/30
System Identification: Estimated Dynamics Model
0 20 40 60 80 100 120
# IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
IDS Alerts
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
# Severe IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Severe IDS Alerts
0 20 40 60 80 100
# Warning IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Warning IDS Alerts
0 50 100 150 200 250
# IDS Alert Priorities
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
IDS Alert Priorities
IDS Dynamics
(b0, a0) (b1, a0) ...
30. 15/30
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=0 γt
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eo∼ρπθ ,a∼πθ
[R]
I Maximize J(θ) by stochastic gradient ascent with
gradient
∇θJ(θ) = Eo∼ρπθ ,a∼πθ
[∇θ log πθ(a|o)Aπθ
(o, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space |S| = (w + 1)|N|·m·(m+1)
I Large action space |A| = |N| · (m + 1)
I Non-stationary Environment due to presence of
adversary
I Generalization
Agent
Environment
at
st+1
rt+1
31. 15/30
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=0 γt
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eo∼ρπθ ,a∼πθ
[R]
I Maximize J(θ) by stochastic gradient ascent with
gradient
∇θJ(θ) = Eo∼ρπθ ,a∼πθ
[∇θ log πθ(a|o)Aπθ
(o, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space |S| = (w + 1)|N|·m·(m+1)
I Large action space |A| = |N| · (m + 1)
I Non-stationary Environment due to presence of
adversary
I Generalization
Agent
Environment
at
st+1
rt+1
32. 15/30
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=0 γt
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eo∼ρπθ ,a∼πθ
[R]
I Maximize J(θ) by stochastic gradient ascent with
gradient
∇θJ(θ) = Eo∼ρπθ ,a∼πθ
[∇θ log πθ(a|o)Aπθ
(o, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space |S| = (w + 1)|N|·m·(m+1)
I Large action space |A| = |N| · (m + 1)
I Non-stationary Environment due to presence of
adversary
I Generalization
Agent
Environment
at
st+1
rt+1
33. 15/30
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
PT
t=0
γt
rt+1
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eo∼ρπθ ,a∼πθ
[R]
I Maximize J(θ) by stochastic gradient ascent with gradient
∇θJ(θ) = Eo∼ρπθ ,a∼πθ
[∇θ log πθ(a|o)Aπθ (o, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space |S| = (w + 1)|N |·m·(m+1)
I Large action space |A| = |N | · (m + 1)
I Non-stationary Environment due to presence of adversary
I Generalization
I Finding Effective Security Strategies through
Reinforcement Learning and Self-Playa
a
Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and
Self-Play”. In: International Conference on Network and Service Management (CNSM 2020) (CNSM 2020). Izmir,
Turkey, Nov. 2020.
Agent
Environment
at
st+1
rt+1
34. 16/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning
Generalization
Policy evaluation
Model estimation
Automation
Self-learning systems
35. 17/30
Learning Capture-the-Flag Strategies: Target Infrastructure
I Topology:
I 32 Servers, 1 IDS (Snort), 3 Clients
I Services
I 1 SNMP, 1 Cassandra, 2 Kafka, 8 HTTP, 1 DNS, 1 SMTP, 2 NTP, 5
IRC, 1 Teamspeak, 1 MongoDB, 1 Samba, 1 RethinkDB, 1
CockroachDB, 2 Postgres, 3 FTP, 15 SSH, 2 FTP
I Vulnerabilities
I 2 CVE-2010-0426, 2 CVE-2010-0426, 1 CVE-2015-3306, 1
CVE-2015-5602, 1 CVE-2016-10033, 1 CVE-2017-7494, 1
CVE-2014-6271
I 5 Brute-force vulnerabilities
I Operating Systems
I 14 Ubuntu-20, 9 Ubuntu-14, 1 Debian 9:2, 2 Debian Wheezy, 5
Debian Jessie, 1 Kali
I Traffic
I FTP, SSH, IRC, SNMP, HTTP, Telnet, IRC, Postgres, MongoDB,
Samba
I curl, ping, tracerotue, nmap..
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
36. 18/30
Learning Capture-the-Flag Strategies: System Model 1/3
I A hacker (pentester) has T time-periods to
collect flags hidden in the infrastructure.
I The hacker is located at a dedicated starting
position N0 and can connect to a gateway
that exposes public-facing services in the
infrastructure.
I The hacker has a pre-defined set
(cardinality ∼ 200) of network/shell
commands available.
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
37. 19/30
Learning Capture-the-Flag Strategies: System Model 2/3
I By execution of commands, the hacker
collects information
I Open ports, failed/successful exploits,
vulnerabilities, costs, OS, ...
I Sequences of commands can yield
shell-access to nodes
I Given shell access, the hacker can search
for flags
I Associated with each command is a cost c
(execution time) and noise n (IDS alerts).
I The objective is to capture all flags with the
minimal cost within the fixed time horizon T.
What strategy achieves this end?
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
38. 19/30
Learning Capture-the-Flag Strategies: System Model 2/3
I By execution of commands, the hacker
collects information
I Open ports, failed/successful exploits,
vulnerabilities, costs, OS, ...
I Sequences of commands can yield
shell-access to nodes
I Given shell access, the hacker can search
for flags
I Associated with each command is a cost c
(execution time) and noise n (IDS alerts).
I The objective is to capture all flags with the
minimal cost within the fixed time horizon T.
What strategy achieves this end?
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
39. 20/30
Learning Capture-the-Flag Strategies: System Model 3/3
I Contextual Stochastic CTF with Partial
Information
I Model infrastructure as a graph G = hN, Ei
I There are k flags at nodes C ⊆ N
I Ni ∈ N has a node state si of m attributes
I Network state
s = {sA, si | i ∈ N} ∈ R|N|×m+|N|
I Hacker observes oA
⊂ s
I Action space: A = {aA
1 , . . . , aA
k }, aA
i
(commands)
I ∀(b, a) ∈ A × S, there is a probability ~
w
A,(x)
i,j of
failure a probability of detection
ϕ(det(si ) · n
A,(x)
i,j )
I State transitions s → s0
are decided by a
discrete dynamical system s0
= F(s, a)
I Exact dynamics (F, cA
, nA
, wA
, det(·), ϕ(·)),
are unknown to us!
N0, ~
S0
N1, ~
S1
N2, ~
S2
N3, ~
S3
N4, ~
S4
N5, ~
S5 N6, ~
S6
N7, ~
S7
~
cA
2,3, ~
nA
2,3, ~
wA
2,3
~
cA
3,2, ~
nA
3,2, ~
wA
3,2
~
cA
2,1, ~
nA
2,1, ~
wA
2,1
~
cA
1,2, ~
nA
1,2, ~
wA
1,2
~
c
A
0,1
,
~
n
A
0,1
,
~
w
A
0,1
~
cA
0,2, ~
nA
0,2, ~
wA
0,2
~
cA
3,6, ~
nA
3,6, ~
wA
3,6
~
cA
6,7, ~
nA
6,7, ~
wA
6,7
~
cA
7,4, ~
nA
7,4, ~
wA
7,4
~
c
A
4
,6
,
~
n
A
4
,6
,
~
w
A
4
,6
~
c
A
4
,5
,
~
n
A
4
,5
,
~
w
A
4
,5
~
c
A
5
,4
,
~
n
A
5
,4
,
~
w
A
5
,4
~
cA
1,5, ~
nA
1,5, ~
wA
1,5
~
c
A
5,2
,
~
n
A
5,2
,
~
w
A
5,2
~
c
A
2,
4
,
~
n
A
2,
4
,
~
w
A
2,
4
~
c A
1,4 , ~
n A
1,4 , ~
w A
1,4
Si ∈ Rm, ~
cA
i,j ∈ Rk
+
~
nA
i,j ∈ Rk
+, ~
wA
i,j ∈ [0, 1]k
~
cD ∈ Rl
+, ~
wD ∈ [0, 1]l
Graphical Model.
40. 20/30
Learning Capture-the-Flag Strategies: System Model 3/3
I Contextual Stochastic CTF with Partial
Information
I Model infrastructure as a graph G = hN, Ei
I There are k flags at nodes C ⊆ N
I Ni ∈ N has a node state si of m attributes
I Network state
s = {sA, si | i ∈ N} ∈ R|N|×m+|N|
I Hacker observes oA
⊂ s
I Action space: A = {aA
1 , . . . , aA
k }, aA
i
(commands)
I ∀(b, a) ∈ A × S, there is a probability ~
w
A,(x)
i,j of
failure a probability of detection
ϕ(det(si ) · n
A,(x)
i,j )
I State transitions s → s0
are decided by a
discrete dynamical system s0
= F(s, a)
I Exact dynamics (F, cA
, nA
, wA
, det(·), ϕ(·)),
are unknown to us!
N0, ~
S0
N1, ~
S1
N2, ~
S2
N3, ~
S3
N4, ~
S4
N5, ~
S5 N6, ~
S6
N7, ~
S7
~
cA
2,3, ~
nA
2,3, ~
wA
2,3
~
cA
3,2, ~
nA
3,2, ~
wA
3,2
~
cA
2,1, ~
nA
2,1, ~
wA
2,1
~
cA
1,2, ~
nA
1,2, ~
wA
1,2
~
c
A
0,1
,
~
n
A
0,1
,
~
w
A
0,1
~
cA
0,2, ~
nA
0,2, ~
wA
0,2
~
cA
3,6, ~
nA
3,6, ~
wA
3,6
~
cA
6,7, ~
nA
6,7, ~
wA
6,7
~
cA
7,4, ~
nA
7,4, ~
wA
7,4
~
c
A
4
,6
,
~
n
A
4
,6
,
~
w
A
4
,6
~
c
A
4
,5
,
~
n
A
4
,5
,
~
w
A
4
,5
~
c
A
5
,4
,
~
n
A
5
,4
,
~
w
A
5
,4
~
cA
1,5, ~
nA
1,5, ~
wA
1,5
~
c
A
5,2
,
~
n
A
5,2
,
~
w
A
5,2
~
c
A
2,
4
,
~
n
A
2,
4
,
~
w
A
2,
4
~
c A
1,4 , ~
n A
1,4 , ~
w A
1,4
Si ∈ Rm, ~
cA
i,j ∈ Rk
+
~
nA
i,j ∈ Rk
+, ~
wA
i,j ∈ [0, 1]k
~
cD ∈ Rl
+, ~
wD ∈ [0, 1]l
Graphical Model.
43. 23/30
Learning to Detect Network Intrusions: Target
Infrastructure
I Topology:
I 6 Servers, 1 IDS (Snort), 3 Clients
I Services
I 3 SSH, 2 HTTP, 1 DNS, 1 Telnet, 1 FTP, 1 MongoDB, 2
SMTP, 1 Tomcat, 1 Teamspeak3, 1 SNMP, 1 IRC, 1 Postgres,
1 NTP
I Vulnerabilities
I 1 CVE-2010-0426, 3 Brute-force vulnerabilities
I Operating Systems
I 4 Ubuntu-20, 1 Ubuntu-14, 1 Kali
I Traffic
I FTP, SSH, IRC, SNMP, HTTP, Telnet, IRC, Postgres,
MongoDB,
I curl, ping, tracerotue, nmap..
Attacker Client 1 Client 2 Client 3
Defender
R1
Evaluation infrastructure.
44. 24/30
Learning to Detect Network Intrusions: System Model
(1/3)
I An admin should manage the
infrastructure for T time-periods.
I The admin can monitor the
infrastructure to get a belief about it’s
state bt
I b1, . . . , bT−1 can be assumed to be
generated from some unknown
distribution ϕ.
I If the admin suspects that the
infrastructure is being intruded based on
bt, he can suspend the suspicious
user/traffic.
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
45. 24/30
Learning to Detect Network Intrusions: System Model
(1/3)
I An admin should manage the
infrastructure for T time-periods.
I The admin can monitor the
infrastructure to get a belief about it’s
state bt
I b1, . . . , bT−1 can be assumed to be
generated from some unknown
distribution ϕ.
I If the admin suspects that the
infrastructure is being intruded based on
bt, he can suspend the suspicious
user/traffic.
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
46. 25/30
Learning to Detect Network Intrusions: System Model
(2/3)
I Suspending traffic from a true intrusion
yields a reward r (salary bonus)
I Not suspending traffic of a true
intrusion, incurs a cost c (admin is fired)
I Suspending traffic of a false intrusion,
incurs a cost of o (breaking the SLA)
I The objective is to to decide an optimal
response for suspending network traffic.
What strategy achieves this end?
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
47. 25/30
Learning to Detect Network Intrusions: System Model
(2/3)
I Suspending traffic from a true intrusion
yields a reward r (salary bonus)
I Not suspending traffic of a true
intrusion, incurs a cost c (admin is fired)
I Suspending traffic of a false intrusion,
incurs a cost of o (breaking the SLA)
I The objective is to to decide an optimal
response for suspending network traffic.
What strategy achieves this end?
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
48. 26/30
Learning to Detect Network Intrusions: System Model
(3/3)
I Optimal Stopping Problem
I Action space A = {STOP, CONTINUE}
I Belief state space B ∈ R8+10·m
I A belief state b ∈ B contains relevant
metrics to detect intrusions
I Alerts from IDS, Entries in
/var/log/auth, logged in users, TCP
connections, processes, ...
I Reward function R
I r(bt, STOP, st) = 1intrusion
β
ti
I β is a positive constant and ti is the
number of nodes compromised by the
attacker
I =⇒ incentive to detect intrusion early.
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
49. 27/30
Structural Properties of the Optimal Policy
I Assumptions: Always an intrusion before T, f (bt): probability of
intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t.
I Claim: Optimal policy is a threshold based policy
I Necessary condition for optimality (Bellman):
ut(bt) = sup
a
rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(1)
I Thus I have that it is optimal to stop at state bt iff
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (2)
I Stopping threshold αt:
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (3)
50. 27/30
Structural Properties of the Optimal Policy
I Assumptions: Always an intrusion before T, f (bt): probability of
intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t.
I Claim: Optimal policy is a threshold based policy
I Necessary condition for optimality (Bellman):
ut(bt) = sup
a
rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(4)
= max
f (bt) ·
β
ti
,
X
b0∈B
ϕ(b0
)ut+1(b0
)
#
(5)
I Thus I have that it is optimal to stop at state bt iff
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (6)
I Stopping threshold αt:
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (7)
51. 27/30
Structural Properties of the Optimal Policy
I Assumptions: Always an intrusion before T, f (bt): probability of
intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t.
I Claim: Optimal policy is a threshold based policy
I Necessary condition for optimality (Bellman):
ut(bt) = sup
a
rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(8)
= max
f (bt) ·
β
ti
,
X
b0∈B
ϕ(b0
)ut+1(b0
)
#
(9)
I Thus I have that it is optimal to stop at state bt iff
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (10)
I Stopping threshold αt:
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (11)
52. 27/30
Structural Properties of the Optimal Policy
I Assumptions: Always an intrusion before T, f (bt): probability of
intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t.
I Claim: Optimal policy is a threshold based policy
I Necessary condition for optimality (Bellman):
ut(bt) = sup
a
rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(12)
= max
f (bt) ·
β
ti
,
X
b0∈B
ϕ(b0
)ut+1(b0
)
#
(13)
I Thus I have that it is optimal to stop at state bt iff
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (14)
I Stopping threshold αt:
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (15)
55. 30/30
Conclusions Future Work
I Conclusions:
I We develop a method to find effective strategies for intrusion
prevention
I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement
learning and (5) domain randomization and generalization.
I We show that self-learning can be successfully applied to
network infrastructures.
I Self-play reinforcement learning in Markov security game
I Key challenges: stable convergence, sample efficiency,
complexity of emulations, large state and action spaces
I Our research plans:
I Improving the system identification algorithm generalization
I Evaluation on real world infrastructures