CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Roy Levin, Microsoft
CyberMLToolkit:
Anomaly Detection as a Scalable
Generic Service Over Apache
Spark
#UnifiedDataAnalytics #SparkAISummit

Session goals
• Present an easy-to-use framework that produces
cyber-security-anomalies
• Explain how recommendation systems are used to
find anomalous resource access
• Show how we evaluated the framework to show its
usefulness
3

Motivation
Formulation & Models
Scalability for Large Datasets
Evaluation
Summary
Agenda
4

centralized cloud native
Security Information &
Event Management system
Build Your Own ML (BYOML)
1. Log data from cloud resources
2. Process logs from Azure
Databricks cluster
3. Author custom security analytics
5

6
General Anomaly Detector
Dataset
Fault
detection
System health
monitoring
Security
incidents
…
We would like to capture only
Security-related-anomalies

anomalous access
• Train and apply on a simple-to-construct dataset
– Avoid writing and maintaining complex rules and logic
– Avoid the need to analyze multiple complex datasets such as:
§ Org-charts
§ RBAC tables
§ Cloud architectures
8

Motivation
Evaluation
Summary
Agenda
10

• Given user & resource pair (u, r)
• Provide an anomaly score of user u accessing resource r
• If anomaly score is above some threshold then surface the event
11

?
The straight forward approach
But users access new resources quite
often, so this is just not good enough
12

?Create profile per user and
resource and see if access
deviates from that profile
13

Intuition:
• Take a recommendation system and use it for anti-recommendations
14

Roy1 Inbal2 Hasan3 Lior4 Anat5 Arnon6
The God Father1 4 5
The Dark Knight2 3 2 5
Pulp Fiction3 5 3 5 4 4 5
40 Year Old Virgin4 2 4 3 3
Analyze That5 3 5 4 4
Anger Management6 3 5 5
Black Hawk Down7 5 4
Model Training Phase
Movie Recommendations
16

The God Father1 ? 4 ? 5 ? ?
The Dark Knight2 3 ? ? ? 2 5
40 Year Old Virgin4 2 4 ? ? 3 3
Analyze That5 3 5 4 ? 4 ?
Anger Management6 3 5 ? ? ? 5
Black Hawk Down7 5 ? ? 4 ? ?
Romance Action Comedy
x1
x2
xm
f1 f2 f3
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
f1 ? ? ? ? ? ?
f2 ? ? ? ? ? ?
f3 ? ? ? ? ? ?
𝜃"
Romance
Action
Comedy
𝜃# 𝜃$
17

x1
x2
xm
f1 f2 f3
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
18
f1 ? ? ? ? ? ?
f2 ? ? ? ? ? ?
f3 ? ? ? ? ? ?
𝜃"
Romance
Action
Comedy
𝜃# 𝜃$

x1
x2
xm
f1 f2 f3
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
Model Apply Phase
f1 ? ? ? ? ? ?
f2 ? ? ? ? ? ?
f3 ? ? ? ? ? ?
𝜃"
Romance
Action
Comedy
𝜃# 𝜃$

Back to
Anomalous Resource Access
20

• Let us re-examine our data:
– User-resource pairs with number of times accessed
• Standard CF model assumes explicit item ratings, some problems:
– A rating is not really what we have in the input
• Although more user access to a resource likely means he should be allowed access
– We do not really have negative rating indications either, i.e., there is no explicit
indicator saying that a user should not have access to some resource
• what we do have is missing access
21

user1 user2 user3 user4 user5 user6
resource1 1200 1500
resource2 900 301 1
resource3 1500 599 1 902 1205 1500
resource4 299 1200 895 901
resource5 601 1500 1200 1203
resource6 603 1499 1495
resource7 1499 1200
resource1 9 10
resource2 8 6 5
resource3 10 7 5 8 9 10
resource4 6 9 8 8
resource5 7 10 9 9
resource6 7 10 10
resource7 10 9
Linear Scaling
22

resource1 9 10
resource2 8 6 5
resource3 10 7 5 8 9 10
resource4 6 9 8 8
resource5 7 10 9 9
resource6 7 10 10
resource7 10 9
Random Negative Samples
23

resource1 1 9 10
resource2 8 1 6 5
resource3 10 7 5 8 9 10
resource4 6 9 1 8 8
resource5 7 10 9 9
resource6 7 10 10
resource7 10 9 1
Random Negative Samples
24

resource1 1 9 10
resource2 8 1 6 5
resource3 10 7 5 8 9 10
resource4 6 9 1 8 8
resource5 7 10 9 9
resource6 7 10 10
resource7 10 9 1
Adjusting for user & resource bias and create an anomaly score
−
25

Motivation
Evaluation
Summary
Agenda
26

• Actually: we are given a tenant-id, user, resource triplet (tid, u, r)
• Provide anomaly score of user u accessing resource r per-tenant
• Note: access within each tenant is isolated
• Goals:
– Process tenants in parallel
– Cope with data from large tenants
27

• Create a PUDF which uses the Surprise Python library to run the
CF algorithm locally on each worker node
• Provided PUDF works on Pandas-DFs that are created per-group
when apply is called
• The method is applied as follows:
– df.groupBy(tid_colname).apply(my_pudf)
* SurPRISE: Simple Python RecommendatIon System Engine http://surpriselib.com/
28

• Problem: the data from some tenants may be too large to fit into
the memory of a single worker node
• Solution: before applying, count number of entries per-tenant
– If number of entries can fit in-memory then apply PUDF method
– If not, then apply Spark CF, per tenant, one-by-one
29

• Training produces a model which is basically
– A dataframe mapping (tenant-id, user) and (tenant-id, resource) pairs to
their corresponding latent feature vectors
• Applying the model requires:
– Joining with respective user/resource to retrieve vectors
– Applying a dot-product
* Note: model can be applied with Structured Streaming
30

Motivation
Evaluation
Summary
Agenda
31

Experiments for Azure Sentinel AI
1. Synthetic dataset
2. Actual file share data from large customer
• Users accessing shared network files
32

Add cross
group access
For testing
1.
2.
34

Results
100%, i.e. all 100 cross group access
receives top-100 anomaly scores!
Add cross
group access
35

File Share SMB server
Actual Attack Description
shares
Machine 1
shares
Machine 2
shares
Machine n
58% of companies have over 100,000 folders open to everyone within the network
(source: Varonis cybersecurity data security and analytics)
36

Algorithm Training
shares
Machine 1
shares
Machine 2
shares
Machine n
37

Testset (2 days after training)
shares
Machine 1
shares
Machine 2
shares
Machine n
38

Results
dataset/anomaly
scores
Mean stddev min Max count
Entire test set 0.05 1.16 -19.21 8.07 3.8M
𝑼𝒏𝒔𝒆𝒆𝒏 𝒗𝒂𝒍𝒊𝒅 𝒂𝒄𝒄𝒆𝒔𝒔 -0.28 0.38 -1.2 1.18 410
𝑹𝒆𝒔𝒕𝒓𝒊𝒄𝒕𝒆𝒅 𝒂𝒄𝒄𝒆𝒔𝒔 7.81 0.11 7.44 8.07 400
39

Motivation
Evaluation
Summary
Agenda
40

41
from sentinel_ai.peer_anomaly.spark_collaborative_filtering import AccessAnomaly
access_anomaly = AccessAnomaly( # it is just an estimator
tenant_colname,
user_colname,
res_colname,
score_colname
)
anom_model = access_anomaly.fit(training_dataset_scored_triplets)
scored_test_dataset_triplets = anom_model.transform(test_dataset_triplets)
scored_test_dataset_triplets.show()
https://github.com/Azure/Azure-Sentinel-BYOML

• Introduced an Access Anomaly Detection framework for cyber
security and how it fits into the BYOML pillar of Azure Sentinel
– an anti-recommendation is an access-anomaly
– code has been open sourced
• The framework provides a simple-to-use API allowing security
analysts to surface access anomalies
• Call-to-action: experiment with the framework, continue this line
of research, suggest and add more algorithm
42

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark

Similaire à CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark