The Office 365 intrusion detection team uses graphs to link alerts and incorporate low-fidelity observations without overwhelming our analysts. In this talk, we describe how we represent alerts in the graph, how we use the structure of the graph to determine which alerts should be reviewed by our analysts, and how we rank subgraphs to ensure that the most important activity is reviewed first. We also discuss approaches we are investigating next to get even more value out of our alert graph.
2. The challenge
Windows server intrusion detection in Office 365
Security event logs from hundreds of thousands of servers
Contains system activity like deployment, upgrade, engineer troubleshooting
Analysis and response performed by security engineering team
Graphs help us succeed at scale and in detail
Review alerts in context, not in isolation
Prioritize investigation according to risk
Incorporate low-fidelity signals without overwhelming analysts
3. Detection pipeline
Detection inputs
Process, user behavior from built-in Windows audit events
Per-process network activity, DNS lookups
Windows internal subsystem activity via ETW monitoring
Detection results
Stored in a flexible-schema columnar database (Azure Data Explorer)
Column values are normalized to enforce common semantics across results
Classified according to the fidelity of the detection
4.
5. Building the graph
Three steps
Extract entities that represent “pivots” between detection results
Link each result to the entities it contains and insert these into the graph
If an entity already exists from a prior step, use it
Forms a hypergraph that links related results together
Resulting graph is sparsely-connected and easy to visualize
Algorithm is O(n) and trivial to implement in Javascript, C#, etc
6. Building the graph
Anomalous DLL rundll32.exe launched as svc_sql11 on CFE110095
New process uploading rundll32.exe to 40.114.40.133 on CFE110095
Large transfer 50MB to 40.114.40.133 from sqlagent.exe on SQL11006
7. Building the graph
Anomalous DLL rundll32.exe launched as svc_sql11 on CFE110095
New process uploading rundll32.exe to 40.114.40.133 on CFE110095
Large transfer 50MB to 40.114.40.133 from sqlagent.exe on SQL11006
detection type
detection type
detection type hostname
process
process
process user hostname
hostname
hostname
hostname
anomalousdll
procupload
largetransfer
svc_sql11
CFE110095rundll32.exe
40.114.40.133sqlagent.exe
SQL11006
8.
9. Graph clustering
Each cluster represents an “incident”
Detection results with entities in common that tell a story
Analysts view and triage all results in the cluster together
View cluster results in tabular form for increased density and detail
Identical clusters are merged together
Define similarity by the types of detection results each cluster contains
Collapses the long tail of small clusters caused by environment-wide changes
10. Cluster scoring
Clusters must meet a criteria to be eligible for triage
One result classified alert or atomic
Two unique detection types classified behavioral
Score based on detection and entity uniqueness
Points assigned to each distinct detection type in the cluster
Divided by number of distinct machines emitting that detection type
Multiplied together to generate an overall cluster score
Down-votes systemic behavior and up-votes clusters with many unique detections
11. Cluster-based actions
Alerting for high-scoring clusters
In-memory graph ingests new detection results and triage decisions
Scores each cluster, persists cluster snapshot as JSON, exposes REST API
Emits a high-fidelity alert when cluster score reaches a threshold
Automated triage for environment-wide behavior
“Time-travel triage” identifies activity that occurs across many servers
Adds a rule to suppress future alerts and a detection result to inform analysts
12. Opportunities
Time-series analysis
Updated cluster snapshots are written every 5 minutes
Can we visualize progression over time or score based on rate of change?
Improved cluster scoring
Can we use statistics to boost influence of detections that rarely fire?
Can we categorize detections by killchain stage and look for in-time-order traversal?
Can we use ML to identify detection types that typically fire together?
13. Bonus
Same technique can be applied to customer audit logs
Are privileged operations being performed across many resources?
Are specific IP addresses responsible for a high number of access attempts?
Are sensitive documents being accessed in bulk by a single user?
Example using O365 audit logs and PowerBI: aka.ms/auditgraph
Graph-based exploratory data analysis on user behavior
Great opportunity to help customers get more value out of their audit logs
Would love to see someone make this a point-and-click integration with O365
Matt is a Principal Engineering Manager in the OneDrive and SharePoint team at Microsoft. He drove the security development process for SharePoint 2010 and 2013, then built a team focused on cloud security for SharePoint Online. Matt is passionate about intrusion detection, incident response and catching adversaries. When he’s not catching bad guys, you can find him at home with his kids or hiking in Washington's beautiful Cascades.