SlideShare a Scribd company logo
1 of 60
Download to read offline
Preventing Abuse Using Unsupervised
Learning
Spark + AI Summit
2020-06-22 to 2020-06-26
Grace Tang
Senior Staff Machine
Learning Engineer
gtang@linkedin.com
James Verbus
Staff Machine Learning
Engineer
jverbus@linkedin.com
Anti-Abuse AI
Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
2
Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
Signal
Limited signal for individual
fake accounts
3
Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
Signal
Limited signal for individual
fake accounts
Adversarial
Attackers are very quick to
adapt and evolve
4
Unsupervised Outlier
Detection:
Isolation Forests
Anti-Abuse AI 5
Isolation Forest
• Proposed by Liu et al. in 2008
• Ensemble of randomly created
binary trees
• Tree structure captures multi-
dimensional feature distribution
• Future instances can be scored
Outliers are
easier to isolate
Inliers are harder
to isolate
6
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
7
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
8
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
9
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
10
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
11
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
12
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
13
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
14
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
15
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
16
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
17
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
18
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
19
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
20
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
21
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
22
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
23
Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
24
Isolation Forest Advantages
Performant
A top performer in
recent
benchmarks
Scalable
Low computation
and memory
complexity
Fewer assumptions
No assumptions
about data
distributions or
distance metrics
Widely used
Active academic
research and
industry use
25
Normal Day
Number of user actions of interest
Isolationforestscore
26
Normal Day
A cluster of real members
using automation tools with
similar behavior
Number of user actions of interest
Isolationforestscore
27
Behavior of two example real accounts using automation tools
Time
AutomateduseractionsAutomateduseractions
~30 user actions performed
at a constant rate
28
Normal Day
Number of user actions of interest
Isolationforestscore
29
Fake Account
Attack Day
Number of user actions of interest
Isolationforestscore
30
Fake Account
Attack Day
A major fake account attack
using abusive automation
Number of user actions of interest
Isolationforestscore
31
Behavior of two example accounts from the attack
Time
AutomateduseractionsAutomateduseractions
32
Potential Isolation Forest Use Cases
Account takeover Alerting on time-series
data
Payment fraud Data center monitoring
Automation detection Advanced persistent
threats
Insider threats /
intrusion detection
ML health assurance
33
Open Source isolation-forest
Library @ LinkedIn
• Scala/Spark
• Developed by LinkedIn Anti-Abuse AI
• Distributed training and scoring
• Compatible with spark.ml
• Open sourced and available on GitHub
• Artifacts in JCenter and Maven Central
https://github.com/linkedin/isolation-forest
https://engineering.linkedin.com/blog/2019/isolation-forest
34
Isolation Forest Acknowledgements
• Thank you to Frank Astier and Ahmed Metwally for providing advice and performing code
reviews.
• Thank you to Will Chan, Jason Chang, Milinda Lakkam, and Xin Wang for providing useful
feedback as the first users of this library.
35
Graph Clustering
Anti-Abuse AI 36
Variations of Clustering
1. Group By Clustering
• Refers to grouping accounts by shared signals.
• Example: accounts created on the same IP
address on the same day.
• Computationally cheap.
• Effective at catching basic bulk fake accounts but
brittle against more sophisticated bad actors.
37
Variations of Clustering
1. Group By Clustering
• Refers to grouping accounts by shared signals.
• Example: accounts created on the same IP
address on the same day.
• Computationally cheap.
• Effective at catching basic bulk fake accounts but
brittle against more sophisticated bad actors.
2. Graph Clustering
• Refers to grouping accounts by similar signals.
• Example: accounts liking similar content (fake
engagement).
• Computationally expensive.
• Effective at catching more sophisticated bad
actors.
38
Behavioral Graph Clustering
likes {article 1, article 3, article 6}
likes {article 2, article 3, article 6}
1. Find pairs of members that have high content engagement similarity
39
Behavioral Graph Clustering
40
1. Find pairs of members that have high content engagement similarity
• Use optimizations to avoid comparisons
Behavioral Graph Clustering
41
1. Find pairs of members that have high content engagement similarity
• Similarity defined using Jaccard (Ruzicka) = ⁄∑! min(𝑥!, 𝑦!) ∑! max(𝑥!, 𝑦!)
Behavioral Graph Clustering
1. Find pairs of members that have high content engagement similarity
• Apply a threshold
42
Behavioral Graph Clustering
1. Find pairs of members that have high content engagement similarity
2. Cluster the similarity graph using Jarvis-Patrick algorithm
43
Jarvis-Patrick
Clustering
• Fast (Parallelizable & Non-Iterative)
• Deterministic
• No need to predefine # of clusters
• Clusters do not overlap
• Accommodates clusters of variable size
44
Behavioral Graph Clustering: Fake Accounts
- 1.00
- 0.95
- 0.90
- 0.85
- 0.80
- 0.75
- 0.70
similarity
45
Behavioral Graph Clustering: Real Accounts
Book Club
Cryptocurrency Interest Group Cryptocurrency Interest Group Cryptocurrency Interest Group
Employees of a Company Employees of a Company
- 1.00
- 0.95
- 0.90
- 0.85
- 0.80
- 0.75
- 0.70
similarity
46
Unsupervised + Supervised
47
Accounts with
similar behavior
Isolation Forest
Graph Clustering
Unsupervised + Supervised
Accounts with
similar behavior
Fake account clusters,
Hacked account clusters,
& Real account clusters
Isolation Forest
Graph Clustering
Supervised model
or heuristics
Isolation Forest
Graph Clustering
48
Unsupervised + Supervised
Secure and reinstate
hacked account clusters
No action or educate
real account clusters
Take down fake account
clusters
49
Accounts with
similar behavior
Fake account clusters,
Hacked account clusters,
& Real account clusters
Isolation Forest
Graph Clustering
Supervised model
or heuristics
Isolation Forest
Graph Clustering
Conclusions
Anti-Abuse AI 50
Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
Solution: Unsupervised
learning is a natural fit for
challenges with few labels
51
Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
Limited signal for individual
fake accounts
Solution: Unsupervised
learning is a natural fit for
challenges with few labels
Solution: Our techniques
catch groups of camouflaged
bad accounts due to shared
behavioral patterns 52
Signal
Solution: As long as attacker
behavior is different than
normal user behavior, it can
be detected
Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
Limited signal for individual
fake accounts
Attackers are very quick to
adapt and evolve
Solution: Unsupervised
learning is a natural fit for
challenges with few labels
Solution: Our techniques
catch groups of camouflaged
bad accounts due to shared
behavioral patterns 53
Signal Adversarial
Behavioral Graph Clustering Acknowledgements
• The behavioral graph clustering library was created by Ahmed Metwally.
• Thank you to Frank Astier and Ted Hwa for providing advice and performing code reviews.
• Thank you to Adam Jacoby for creating the first production application using the behavioral
graph clustering library.
54
Thank you
Grace Tang
Senior Staff Machine
Learning Engineer
gtang@linkedin.com
James Verbus
Staff Machine Learning
Engineer
jverbus@linkedin.com
Anti-Abuse AI 55
Extra slides
Anti-Abuse AI 56
Isolation Tree
• Randomly sample data for
isolation in a binary tree
• Randomly choose split feature
and split value
• Repeat until all instances are
isolated
Outliers are
easier to isolate
Inliers are harder
to isolate
57
Isolation Tree
Outliers are
easier to isolate
Inliers are harder
to isolate
• Score is derived from path
length from root to leaf node
• New instances can be scored
by tracing their path through
the trained tree
58
Isolation Forest
Ensemble of isolation trees
59
Jarvis-Patrick Variations
Vanilla
• Two nodes can only be clustered if they are in
each other’s J nearest neighbors
Modified
• Two nodes can only be clustered if their nearest
neighbors have a minimum similarity
• J = the number of nearest neighbors to consider for each node
• K = the minimum number of nearest neighbors that must be in common for two nodes to be clustered together
x y
a
b
c
d
x y
a
b
c
d
a b c d
x 1 1 1
y 1 1 1
𝑠𝑖𝑚 𝑥, 𝑦 =
∑! min(𝑥!, 𝑦!)
∑! max(𝑥!, 𝑦!)
60

More Related Content

What's hot

Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA
 

What's hot (20)

Tag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and RangerTag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and Ranger
 
MySQL Performance Schema in Action: the Complete Tutorial
MySQL Performance Schema in Action: the Complete TutorialMySQL Performance Schema in Action: the Complete Tutorial
MySQL Performance Schema in Action: the Complete Tutorial
 
Access Security - Privileged Identity Management
Access Security - Privileged Identity ManagementAccess Security - Privileged Identity Management
Access Security - Privileged Identity Management
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
MySQL Security
MySQL SecurityMySQL Security
MySQL Security
 
Disaster Recovery using AWS -Architecture blueprints
Disaster Recovery using AWS -Architecture blueprintsDisaster Recovery using AWS -Architecture blueprints
Disaster Recovery using AWS -Architecture blueprints
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
 
Understand AWS Pricing
Understand AWS PricingUnderstand AWS Pricing
Understand AWS Pricing
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
AWS Best Practices Version 2
AWS Best Practices Version 2AWS Best Practices Version 2
AWS Best Practices Version 2
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
 
An Active Case Study on Insider Threat Detection in your Applications
An Active Case Study on Insider Threat Detection in your ApplicationsAn Active Case Study on Insider Threat Detection in your Applications
An Active Case Study on Insider Threat Detection in your Applications
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
How LEGO.com Accelerates With Serverless
How LEGO.com Accelerates With ServerlessHow LEGO.com Accelerates With Serverless
How LEGO.com Accelerates With Serverless
 

Similar to Preventing Abuse Using Unsupervised Learning

لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
Egyptian Engineers Association
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousing
Vaishnavi
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
anjanishah774
 

Similar to Preventing Abuse Using Unsupervised Learning (20)

Towards Responsible AI - KC.pptx
Towards Responsible AI - KC.pptxTowards Responsible AI - KC.pptx
Towards Responsible AI - KC.pptx
 
Building responsible AI models in Azure Machine Learning.pptx
Building responsible AI models in Azure Machine Learning.pptxBuilding responsible AI models in Azure Machine Learning.pptx
Building responsible AI models in Azure Machine Learning.pptx
 
التقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتباتالتقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتبات
 
Supervised Learning
Supervised LearningSupervised Learning
Supervised Learning
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 
Towards Responsible AI - NY.pptx
Towards Responsible AI - NY.pptxTowards Responsible AI - NY.pptx
Towards Responsible AI - NY.pptx
 
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
 
AI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxAI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptx
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerce
 
Cluster Analysis in Data Science.pptx
Cluster Analysis in Data Science.pptxCluster Analysis in Data Science.pptx
Cluster Analysis in Data Science.pptx
 
Cluster Analysis in Data Science.pptx
Cluster Analysis in Data Science.pptxCluster Analysis in Data Science.pptx
Cluster Analysis in Data Science.pptx
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousing
 
CrowdInG_learning_from_crowds.pptx
CrowdInG_learning_from_crowds.pptxCrowdInG_learning_from_crowds.pptx
CrowdInG_learning_from_crowds.pptx
 
High time to add machine learning to your information security stack
High time to add machine learning to your information security stackHigh time to add machine learning to your information security stack
High time to add machine learning to your information security stack
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
 
Summit EU Machine Learning
Summit EU Machine LearningSummit EU Machine Learning
Summit EU Machine Learning
 
Artificial Intelligence Primer
Artificial Intelligence PrimerArtificial Intelligence Primer
Artificial Intelligence Primer
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
 
lect1.ppt
lect1.pptlect1.ppt
lect1.ppt
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

Preventing Abuse Using Unsupervised Learning

  • 1. Preventing Abuse Using Unsupervised Learning Spark + AI Summit 2020-06-22 to 2020-06-26 Grace Tang Senior Staff Machine Learning Engineer gtang@linkedin.com James Verbus Staff Machine Learning Engineer jverbus@linkedin.com Anti-Abuse AI
  • 2. Unique Challenges in Anti-Abuse Unsupervised learning is a solution Labels Few “ground truth” labels for model training or evaluation 2
  • 3. Unique Challenges in Anti-Abuse Unsupervised learning is a solution Labels Few “ground truth” labels for model training or evaluation Signal Limited signal for individual fake accounts 3
  • 4. Unique Challenges in Anti-Abuse Unsupervised learning is a solution Labels Few “ground truth” labels for model training or evaluation Signal Limited signal for individual fake accounts Adversarial Attackers are very quick to adapt and evolve 4
  • 6. Isolation Forest • Proposed by Liu et al. in 2008 • Ensemble of randomly created binary trees • Tree structure captures multi- dimensional feature distribution • Future instances can be scored Outliers are easier to isolate Inliers are harder to isolate 6
  • 7. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 7
  • 8. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 8
  • 9. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 9
  • 10. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 10
  • 11. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 11
  • 12. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 12
  • 13. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 13
  • 14. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 14
  • 15. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 15
  • 16. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 16
  • 17. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 17
  • 18. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 18
  • 19. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 19
  • 20. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 20
  • 21. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 21
  • 22. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 22
  • 23. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 23
  • 24. Isolation Forest Outliers are easier to isolate Inliers are harder to isolate 24
  • 25. Isolation Forest Advantages Performant A top performer in recent benchmarks Scalable Low computation and memory complexity Fewer assumptions No assumptions about data distributions or distance metrics Widely used Active academic research and industry use 25
  • 26. Normal Day Number of user actions of interest Isolationforestscore 26
  • 27. Normal Day A cluster of real members using automation tools with similar behavior Number of user actions of interest Isolationforestscore 27
  • 28. Behavior of two example real accounts using automation tools Time AutomateduseractionsAutomateduseractions ~30 user actions performed at a constant rate 28
  • 29. Normal Day Number of user actions of interest Isolationforestscore 29
  • 30. Fake Account Attack Day Number of user actions of interest Isolationforestscore 30
  • 31. Fake Account Attack Day A major fake account attack using abusive automation Number of user actions of interest Isolationforestscore 31
  • 32. Behavior of two example accounts from the attack Time AutomateduseractionsAutomateduseractions 32
  • 33. Potential Isolation Forest Use Cases Account takeover Alerting on time-series data Payment fraud Data center monitoring Automation detection Advanced persistent threats Insider threats / intrusion detection ML health assurance 33
  • 34. Open Source isolation-forest Library @ LinkedIn • Scala/Spark • Developed by LinkedIn Anti-Abuse AI • Distributed training and scoring • Compatible with spark.ml • Open sourced and available on GitHub • Artifacts in JCenter and Maven Central https://github.com/linkedin/isolation-forest https://engineering.linkedin.com/blog/2019/isolation-forest 34
  • 35. Isolation Forest Acknowledgements • Thank you to Frank Astier and Ahmed Metwally for providing advice and performing code reviews. • Thank you to Will Chan, Jason Chang, Milinda Lakkam, and Xin Wang for providing useful feedback as the first users of this library. 35
  • 37. Variations of Clustering 1. Group By Clustering • Refers to grouping accounts by shared signals. • Example: accounts created on the same IP address on the same day. • Computationally cheap. • Effective at catching basic bulk fake accounts but brittle against more sophisticated bad actors. 37
  • 38. Variations of Clustering 1. Group By Clustering • Refers to grouping accounts by shared signals. • Example: accounts created on the same IP address on the same day. • Computationally cheap. • Effective at catching basic bulk fake accounts but brittle against more sophisticated bad actors. 2. Graph Clustering • Refers to grouping accounts by similar signals. • Example: accounts liking similar content (fake engagement). • Computationally expensive. • Effective at catching more sophisticated bad actors. 38
  • 39. Behavioral Graph Clustering likes {article 1, article 3, article 6} likes {article 2, article 3, article 6} 1. Find pairs of members that have high content engagement similarity 39
  • 40. Behavioral Graph Clustering 40 1. Find pairs of members that have high content engagement similarity • Use optimizations to avoid comparisons
  • 41. Behavioral Graph Clustering 41 1. Find pairs of members that have high content engagement similarity • Similarity defined using Jaccard (Ruzicka) = ⁄∑! min(𝑥!, 𝑦!) ∑! max(𝑥!, 𝑦!)
  • 42. Behavioral Graph Clustering 1. Find pairs of members that have high content engagement similarity • Apply a threshold 42
  • 43. Behavioral Graph Clustering 1. Find pairs of members that have high content engagement similarity 2. Cluster the similarity graph using Jarvis-Patrick algorithm 43
  • 44. Jarvis-Patrick Clustering • Fast (Parallelizable & Non-Iterative) • Deterministic • No need to predefine # of clusters • Clusters do not overlap • Accommodates clusters of variable size 44
  • 45. Behavioral Graph Clustering: Fake Accounts - 1.00 - 0.95 - 0.90 - 0.85 - 0.80 - 0.75 - 0.70 similarity 45
  • 46. Behavioral Graph Clustering: Real Accounts Book Club Cryptocurrency Interest Group Cryptocurrency Interest Group Cryptocurrency Interest Group Employees of a Company Employees of a Company - 1.00 - 0.95 - 0.90 - 0.85 - 0.80 - 0.75 - 0.70 similarity 46
  • 47. Unsupervised + Supervised 47 Accounts with similar behavior Isolation Forest Graph Clustering
  • 48. Unsupervised + Supervised Accounts with similar behavior Fake account clusters, Hacked account clusters, & Real account clusters Isolation Forest Graph Clustering Supervised model or heuristics Isolation Forest Graph Clustering 48
  • 49. Unsupervised + Supervised Secure and reinstate hacked account clusters No action or educate real account clusters Take down fake account clusters 49 Accounts with similar behavior Fake account clusters, Hacked account clusters, & Real account clusters Isolation Forest Graph Clustering Supervised model or heuristics Isolation Forest Graph Clustering
  • 51. Unique Challenges in Anti-Abuse Unsupervised learning is a solution Labels Few “ground truth” labels for model training or evaluation Solution: Unsupervised learning is a natural fit for challenges with few labels 51
  • 52. Unique Challenges in Anti-Abuse Unsupervised learning is a solution Labels Few “ground truth” labels for model training or evaluation Limited signal for individual fake accounts Solution: Unsupervised learning is a natural fit for challenges with few labels Solution: Our techniques catch groups of camouflaged bad accounts due to shared behavioral patterns 52 Signal
  • 53. Solution: As long as attacker behavior is different than normal user behavior, it can be detected Unique Challenges in Anti-Abuse Unsupervised learning is a solution Labels Few “ground truth” labels for model training or evaluation Limited signal for individual fake accounts Attackers are very quick to adapt and evolve Solution: Unsupervised learning is a natural fit for challenges with few labels Solution: Our techniques catch groups of camouflaged bad accounts due to shared behavioral patterns 53 Signal Adversarial
  • 54. Behavioral Graph Clustering Acknowledgements • The behavioral graph clustering library was created by Ahmed Metwally. • Thank you to Frank Astier and Ted Hwa for providing advice and performing code reviews. • Thank you to Adam Jacoby for creating the first production application using the behavioral graph clustering library. 54
  • 55. Thank you Grace Tang Senior Staff Machine Learning Engineer gtang@linkedin.com James Verbus Staff Machine Learning Engineer jverbus@linkedin.com Anti-Abuse AI 55
  • 57. Isolation Tree • Randomly sample data for isolation in a binary tree • Randomly choose split feature and split value • Repeat until all instances are isolated Outliers are easier to isolate Inliers are harder to isolate 57
  • 58. Isolation Tree Outliers are easier to isolate Inliers are harder to isolate • Score is derived from path length from root to leaf node • New instances can be scored by tracing their path through the trained tree 58
  • 59. Isolation Forest Ensemble of isolation trees 59
  • 60. Jarvis-Patrick Variations Vanilla • Two nodes can only be clustered if they are in each other’s J nearest neighbors Modified • Two nodes can only be clustered if their nearest neighbors have a minimum similarity • J = the number of nearest neighbors to consider for each node • K = the minimum number of nearest neighbors that must be in common for two nodes to be clustered together x y a b c d x y a b c d a b c d x 1 1 1 y 1 1 1 𝑠𝑖𝑚 𝑥, 𝑦 = ∑! min(𝑥!, 𝑦!) ∑! max(𝑥!, 𝑦!) 60