This presentation explores why it is so hard to come up with a security monitoring (or shall we call it security intelligence) approach that helps find sophisticated attackers in all the data collected. It explores the question of how to visualize a billion events. To do so, the presentation dives deeply into heatmaps - matrices - as an example of a simple type of visualization. While these heatmaps are very simple, they are incredibly versatile and help us think about the problem of security visualization. They help illustrate how data mining and user experience design help get a handle of the security visualization challenges - enabling us to gain deep insight for a number of security use-cases.
4. Security. Analytics. Insight.4
Attacks have changed:
• Targeted
• Objectives beyond
monetization
• Low and Slow
• Multiple access vectors
• Remotely controlled
The (New) Threat Landscape
APT 1
Unit 61398
(61398部 )
Motivations have changed:
• Nation state sponsored
• Political, economic, and military
advantage
• Monetization / Crimeware
• Religion
• Hacktivism
Security approaches failed due to:
• Reliance on past knowledge /
signatures
• Systems are too rigid (e.g, schema)
• Poor scalability
• Limited knowledge exchange
5. Security. Analytics. Insight.5
How Compromises Are Detected
Mandiant M Trends Report 2014 Threat Report
Attackers innetworks before detection
27 days
229 days
Average time toresolveacyberattack
Successfulattackspercompany perweek
1.4
Average cost percompany peryear
$7.2M
8. Security. Analytics. Insight.8
Visualize 1TB of Data - What Graph?
drop reject NONE ctl accept
DNS Update Failed
Log In
IP Fragments
Max Flows Initiated
Packet Flood
UDP Flood
Aggressive Aging
Bootp
Renew
Log Out
Release
NACK
Conflict
DNS Update Successful
DNS record not deleted
DNS Update Request
Port Flood
1 10000 100000000
How much information does each of the graphs convey?
9. Security. Analytics. Insight.9
The Heatmap
Matrix A, where aij are integer values mapped to a color scale.
aij = 1 10 20 30 40 50 60 70 80 >90
42
rows
columns
11. Security. Analytics. Insight.11
Mapping Log Records to Heatmaps
May 5 23:57:50 pixl-ram sudo: pam_unix(sudo:session):
session opened for user root by ram(uid=0)
root
ram
peg
sue
}
∆t .. time bin
12. Security. Analytics. Insight.11
Mapping Log Records to Heatmaps
May 5 23:57:50 pixl-ram sudo: pam_unix(sudo:session):
session opened for user root by ram(uid=0)
root
ram
peg
sue
}
∆t .. time bin
13. Security. Analytics. Insight.11
Mapping Log Records to Heatmaps
May 5 23:57:50 pixl-ram sudo: pam_unix(sudo:session):
session opened for user root by ram(uid=0)
root
ram
peg
sue
}
∆t .. time bin
⨍()=+1
14. Security. Analytics. Insight.12
• Scales well to a lot of data (can aggregate ad infinitum)
• Shows more information than a bar chart
• Flexible ‘measure’ mapping
• frequency count
• sum(variable) [avg(), stddev(), …]
• distinct count(variable)
Why Heatmaps?
15. Security. Analytics. Insight.12
• Scales well to a lot of data (can aggregate ad infinitum)
• Shows more information than a bar chart
• Flexible ‘measure’ mapping
• frequency count
• sum(variable) [avg(), stddev(), …]
• distinct count(variable)
Why Heatmaps?
• BUT information content is limited!
• Aggregates too highly in time and potentially value dimensions
17. Security. Analytics. Insight.14
Heatmap
• Can pack millions of records (although highly aggregated)
• Allows for zoom-in to expose detail
• By itself exposes patterns
• Great ‘navigation’ tool to drill into different, ‘non-scalable’ visualization
!
• No other visualization possesses these properties
Data Visualization Workflow - Overview
19. Security. Analytics. Insight.16
2. Mouse-Over
• What information to show?
• Position - x/y coordinates
• Original records
• Query backend for each position?
HeatMap Challenges - Display
20. Security. Analytics. Insight.17
3. Sorting
• Random
• Alphabetically
• Based on values
• Similarity
• What algorithm?
• What distance metric?
• Leverage third data field / context?
HeatMap Challenges - Display
random row order
rows clustered
user
21. Security. Analytics. Insight.18
4. Overplotting
• How to summarize multiple rows in one pixel?
• Sum?
• Overplot x and y axes?
• Undo overplot on zoom?
1 row -> 1 pixel
n rows -> 1 pixel
1 row -> m pixels
}∑
HeatMap Challenges - Display
22. Security. Analytics. Insight.19
1. Time Selection
• Take screen resolution into account
(you have 1000 pixels and you query 1005 seconds?)
• Chose start AND end time?
• Communicate to user what data is available?
HeatMap Challenges - Interaction
start time end time
28. Security. Analytics. Insight.24
Different backend technologies (big data)
• Key-value store
• Search engine
• GraphDB
• RDBMS
• Columnar - can answer analytical questions
• Hadoop (Map Reduce)
• good for operations on ALL data
HeatMap Challenges - Backend
Other things to consider:
• Caching
• Joins
29. Security. Analytics. Insight.25
• Showing relationships
-> link graphs
!
!
!
• Showing multiple dimensions and their inter-
relatedness
-> || coords
What’s the HeatMap Not Good At
31. Security. Analytics. Insight.27
Leverage Data Mining to Summarize Data
Overview Zoom / Filter Details on Demand
Overview
• Leverage data mining (clustering) to create an overview
• Summarizing dozens of dimensions into a two-dimensional overview
32. Security. Analytics. Insight.28
Self Organizing Maps
• Clustering based on a single data dimension
• for example “attackers”
• It’s hard to
• engineer the right features
• avoid over-learning
• interpret the clusters
3
2
1
3 clusters
34. Security. Analytics. Insight.30
Vincent
Th i s h eat m a p s h o w s
behavior over time.
!
In this case, we see activity
per user. We can see that
‘vincent’ is visually different
from all of the other users.
He shows up very lightly
over the entire time
period. This seems to be
something to look into.
!
Purely visual, without
understanding the data
were we able to find this.
46. Security. Analytics. Insight.44
• Millions of rows
• High-cardinality fields
!
!
• Where to start analysis?
• Formulate some hypotheses
• Informs visualization process and data preparation
• Our hypothesis and assumption
• Machines that get passed and blocked might be of interest
• Low-frequency sources are not interesting
Firewall Data
firewall data data type cardinality distribution
source ip ipv4 10-10^6 depends
dest ip ipv4 10-10^6 depends
source port int 65535 depends
dest port int
int
65535 highly skewed
bytes in/out int - skewed
action bool / int 3 -
direction / iface bool / str small -
47. Security. Analytics. Insight.45
Visual Mapping
}
∆t .. time bin - aggregation
source
10.0.0.1
10.0.0.2
10.0.0.3
10.0.0.4
block &
pass
blockpass
color mapping:
51. Security. Analytics. Insight.49
High Frequency Traffic Split Up
inbound outbound
192.168.0.201!
195.141.69.42
195.141.69.43!
195.141.69.44
195.141.69.45!
195.141.69.46
212.254.110.100!
212.254.110.101!
212.254.110.107!
212.254.110.108!
212.254.110.109!
212.254.110.110!
212.254.110.98!
212.254.110.99 !
62.245.245.139 !
52. Security. Analytics. Insight.50
Outbound Traffic - Some Questions To Ask
• What happened mid-way through?
• Why is anything outbound blocked?
• What are the top and bottom machines doing?
• Did we get a new machine into the network?
• Some machines went away?
195.141.69.42
57. Security. Analytics. Insight.55
• Attackers are very successful
• Data could reveal adversaries
• We have a big data analytics problem
• We need the right analytics and visualizations
• Security visualization is hard
• Data visualization workflow is a promising approach
• Heatmaps are great for overviews
• We need a set of heuristics and workflows
Recap