Cybercrime is one the greatest threats to every company in the world today and a major problem for mankind in general. The damage due to Cybercrime is estimated to be around $6 Trillion By 2021. Security professionals are struggling to cope with the threat. As a result, powerful and easy to use tools are necessary to aid in this battle. For this purpose we created an anomaly detection framework focused on security which can identify anomalous access patterns. It is built on top of Apache Spark and can be applied in parallel over multiple tenants. This allows the model to be trained over the data of thousands of customers over a Databricks cluster within less than an hour. The model leverages proven technologies from Recommendation Engines to produce high quality anomalies. We thoroughly evaluated the model’s ability to identify actual anomalies by using synthetically generated data and also by creating an actual attack and showing that the model clearly identifies the attack as anomalous behavior. We plan to open source this library as part of a cyber-ML toolkit we will be offering.
3. Session goals
• Present an easy-to-use framework that produces
cyber-security-anomalies
• Explain how recommendation systems are used to
find anomalous resource access
• Show how we evaluated the framework to show its
usefulness
3
5. centralized cloud native
Security Information &
Event Management system
Build Your Own ML (BYOML)
1. Log data from cloud resources
2. Process logs from Azure
Databricks cluster
3. Author custom security analytics
5
8. anomalous access
• Train and apply on a simple-to-construct dataset
– Avoid writing and maintaining complex rules and logic
– Avoid the need to analyze multiple complex datasets such as:
§ Org-charts
§ RBAC tables
§ Cloud architectures
8
11. • Given user & resource pair (u, r)
• Provide an anomaly score of user u accessing resource r
• If anomaly score is above some threshold then surface the event
11
12. ?
The straight forward approach
But users access new resources quite
often, so this is just not good enough
12
13. ?Create profile per user and
resource and see if access
deviates from that profile
13
14. Intuition:
• Take a recommendation system and use it for anti-recommendations
14
21. • Let us re-examine our data:
– User-resource pairs with number of times accessed
• Standard CF model assumes explicit item ratings, some problems:
– A rating is not really what we have in the input
• Although more user access to a resource likely means he should be allowed access
– We do not really have negative rating indications either, i.e., there is no explicit
indicator saying that a user should not have access to some resource
• what we do have is missing access
21
27. • Actually: we are given a tenant-id, user, resource triplet (tid, u, r)
• Provide anomaly score of user u accessing resource r per-tenant
• Note: access within each tenant is isolated
• Goals:
– Process tenants in parallel
– Cope with data from large tenants
27
28. • Create a PUDF which uses the Surprise Python library to run the
CF algorithm locally on each worker node
• Provided PUDF works on Pandas-DFs that are created per-group
when apply is called
• The method is applied as follows:
– df.groupBy(tid_colname).apply(my_pudf)
* SurPRISE: Simple Python RecommendatIon System Engine http://surpriselib.com/
28
29. • Problem: the data from some tenants may be too large to fit into
the memory of a single worker node
• Solution: before applying, count number of entries per-tenant
– If number of entries can fit in-memory then apply PUDF method
– If not, then apply Spark CF, per tenant, one-by-one
29
30. • Training produces a model which is basically
– A dataframe mapping (tenant-id, user) and (tenant-id, resource) pairs to
their corresponding latent feature vectors
• Applying the model requires:
– Joining with respective user/resource to retrieve vectors
– Applying a dot-product
* Note: model can be applied with Structured Streaming
30
35. Results
100%, i.e. all 100 cross group access
receives top-100 anomaly scores!
Add cross
group access
35
36. File Share SMB server
Actual Attack Description
shares
Machine 1
shares
Machine 2
shares
Machine n
58% of companies have over 100,000 folders open to everyone within the network
(source: Varonis cybersecurity data security and analytics)
36
41. 41
from sentinel_ai.peer_anomaly.spark_collaborative_filtering import AccessAnomaly
access_anomaly = AccessAnomaly( # it is just an estimator
tenant_colname,
user_colname,
res_colname,
score_colname
)
anom_model = access_anomaly.fit(training_dataset_scored_triplets)
scored_test_dataset_triplets = anom_model.transform(test_dataset_triplets)
scored_test_dataset_triplets.show()
https://github.com/Azure/Azure-Sentinel-BYOML
42. • Introduced an Access Anomaly Detection framework for cyber
security and how it fits into the BYOML pillar of Azure Sentinel
– an anti-recommendation is an access-anomaly
– code has been open sourced
• The framework provides a simple-to-use API allowing security
analysts to surface access anomalies
• Call-to-action: experiment with the framework, continue this line
of research, suggest and add more algorithm
42
43. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT