Detection of abusive activity on a large social network is an adversarial challenge with quickly evolving behavior patterns and imperfect ground truth labels.
1. Preventing Abuse Using Unsupervised
Learning
Spark + AI Summit
2020-06-22 to 2020-06-26
Grace Tang
Senior Staff Machine
Learning Engineer
gtang@linkedin.com
James Verbus
Staff Machine Learning
Engineer
jverbus@linkedin.com
Anti-Abuse AI
2. Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
2
3. Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
Signal
Limited signal for individual
fake accounts
3
4. Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
Signal
Limited signal for individual
fake accounts
Adversarial
Attackers are very quick to
adapt and evolve
4
6. Isolation Forest
• Proposed by Liu et al. in 2008
• Ensemble of randomly created
binary trees
• Tree structure captures multi-
dimensional feature distribution
• Future instances can be scored
Outliers are
easier to isolate
Inliers are harder
to isolate
6
25. Isolation Forest Advantages
Performant
A top performer in
recent
benchmarks
Scalable
Low computation
and memory
complexity
Fewer assumptions
No assumptions
about data
distributions or
distance metrics
Widely used
Active academic
research and
industry use
25
27. Normal Day
A cluster of real members
using automation tools with
similar behavior
Number of user actions of interest
Isolationforestscore
27
28. Behavior of two example real accounts using automation tools
Time
AutomateduseractionsAutomateduseractions
~30 user actions performed
at a constant rate
28
31. Fake Account
Attack Day
A major fake account attack
using abusive automation
Number of user actions of interest
Isolationforestscore
31
32. Behavior of two example accounts from the attack
Time
AutomateduseractionsAutomateduseractions
32
33. Potential Isolation Forest Use Cases
Account takeover Alerting on time-series
data
Payment fraud Data center monitoring
Automation detection Advanced persistent
threats
Insider threats /
intrusion detection
ML health assurance
33
34. Open Source isolation-forest
Library @ LinkedIn
• Scala/Spark
• Developed by LinkedIn Anti-Abuse AI
• Distributed training and scoring
• Compatible with spark.ml
• Open sourced and available on GitHub
• Artifacts in JCenter and Maven Central
https://github.com/linkedin/isolation-forest
https://engineering.linkedin.com/blog/2019/isolation-forest
34
35. Isolation Forest Acknowledgements
• Thank you to Frank Astier and Ahmed Metwally for providing advice and performing code
reviews.
• Thank you to Will Chan, Jason Chang, Milinda Lakkam, and Xin Wang for providing useful
feedback as the first users of this library.
35
37. Variations of Clustering
1. Group By Clustering
• Refers to grouping accounts by shared signals.
• Example: accounts created on the same IP
address on the same day.
• Computationally cheap.
• Effective at catching basic bulk fake accounts but
brittle against more sophisticated bad actors.
37
38. Variations of Clustering
1. Group By Clustering
• Refers to grouping accounts by shared signals.
• Example: accounts created on the same IP
address on the same day.
• Computationally cheap.
• Effective at catching basic bulk fake accounts but
brittle against more sophisticated bad actors.
2. Graph Clustering
• Refers to grouping accounts by similar signals.
• Example: accounts liking similar content (fake
engagement).
• Computationally expensive.
• Effective at catching more sophisticated bad
actors.
38
39. Behavioral Graph Clustering
likes {article 1, article 3, article 6}
likes {article 2, article 3, article 6}
1. Find pairs of members that have high content engagement similarity
39
40. Behavioral Graph Clustering
40
1. Find pairs of members that have high content engagement similarity
• Use optimizations to avoid comparisons
41. Behavioral Graph Clustering
41
1. Find pairs of members that have high content engagement similarity
• Similarity defined using Jaccard (Ruzicka) = ⁄∑! min(𝑥!, 𝑦!) ∑! max(𝑥!, 𝑦!)
42. Behavioral Graph Clustering
1. Find pairs of members that have high content engagement similarity
• Apply a threshold
42
43. Behavioral Graph Clustering
1. Find pairs of members that have high content engagement similarity
2. Cluster the similarity graph using Jarvis-Patrick algorithm
43
46. Behavioral Graph Clustering: Real Accounts
Book Club
Cryptocurrency Interest Group Cryptocurrency Interest Group Cryptocurrency Interest Group
Employees of a Company Employees of a Company
- 1.00
- 0.95
- 0.90
- 0.85
- 0.80
- 0.75
- 0.70
similarity
46
48. Unsupervised + Supervised
Accounts with
similar behavior
Fake account clusters,
Hacked account clusters,
& Real account clusters
Isolation Forest
Graph Clustering
Supervised model
or heuristics
Isolation Forest
Graph Clustering
48
49. Unsupervised + Supervised
Secure and reinstate
hacked account clusters
No action or educate
real account clusters
Take down fake account
clusters
49
Accounts with
similar behavior
Fake account clusters,
Hacked account clusters,
& Real account clusters
Isolation Forest
Graph Clustering
Supervised model
or heuristics
Isolation Forest
Graph Clustering
51. Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
Solution: Unsupervised
learning is a natural fit for
challenges with few labels
51
52. Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
Limited signal for individual
fake accounts
Solution: Unsupervised
learning is a natural fit for
challenges with few labels
Solution: Our techniques
catch groups of camouflaged
bad accounts due to shared
behavioral patterns 52
Signal
53. Solution: As long as attacker
behavior is different than
normal user behavior, it can
be detected
Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
Limited signal for individual
fake accounts
Attackers are very quick to
adapt and evolve
Solution: Unsupervised
learning is a natural fit for
challenges with few labels
Solution: Our techniques
catch groups of camouflaged
bad accounts due to shared
behavioral patterns 53
Signal Adversarial
54. Behavioral Graph Clustering Acknowledgements
• The behavioral graph clustering library was created by Ahmed Metwally.
• Thank you to Frank Astier and Ted Hwa for providing advice and performing code reviews.
• Thank you to Adam Jacoby for creating the first production application using the behavioral
graph clustering library.
54
55. Thank you
Grace Tang
Senior Staff Machine
Learning Engineer
gtang@linkedin.com
James Verbus
Staff Machine Learning
Engineer
jverbus@linkedin.com
Anti-Abuse AI 55
57. Isolation Tree
• Randomly sample data for
isolation in a binary tree
• Randomly choose split feature
and split value
• Repeat until all instances are
isolated
Outliers are
easier to isolate
Inliers are harder
to isolate
57
58. Isolation Tree
Outliers are
easier to isolate
Inliers are harder
to isolate
• Score is derived from path
length from root to leaf node
• New instances can be scored
by tracing their path through
the trained tree
58
60. Jarvis-Patrick Variations
Vanilla
• Two nodes can only be clustered if they are in
each other’s J nearest neighbors
Modified
• Two nodes can only be clustered if their nearest
neighbors have a minimum similarity
• J = the number of nearest neighbors to consider for each node
• K = the minimum number of nearest neighbors that must be in common for two nodes to be clustered together
x y
a
b
c
d
x y
a
b
c
d
a b c d
x 1 1 1
y 1 1 1
𝑠𝑖𝑚 𝑥, 𝑦 =
∑! min(𝑥!, 𝑦!)
∑! max(𝑥!, 𝑦!)
60