Preventing Abuse Using Unsupervised Learning

Preventing Abuse Using Unsupervised
Learning
Spark + AI Summit
2020-06-22 to 2020-06-26
Grace Tang
Senior Staff Machine
Learning Engineer
gtang@linkedin.com
James Verbus
Staff Machine Learning
Engineer
jverbus@linkedin.com
Anti-Abuse AI

Unique Challenges in Anti-Abuse
Unsupervised learning is a solution
Labels
Few “ground truth” labels for
model training or evaluation
2

Labels
Signal
Limited signal for individual
fake accounts
3

Labels
Signal
fake accounts
Adversarial
Attackers are very quick to
adapt and evolve
4

Unsupervised Outlier
Detection:
Isolation Forests
Anti-Abuse AI 5

Isolation Forest
• Proposed by Liu et al. in 2008
• Ensemble of randomly created
binary trees
• Tree structure captures multi-
dimensional feature distribution
• Future instances can be scored
Outliers are
easier to isolate
Inliers are harder
to isolate
6

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
7

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
8

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
9

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
10

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
11

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
12

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
13

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
14

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
15

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
16

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
17

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
18

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
19

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
20

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
21

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
22

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
23

Isolation Forest
Outliers are
easier to isolate
Inliers are harder
to isolate
24

Isolation Forest Advantages
Performant
A top performer in
recent
benchmarks
Scalable
Low computation
and memory
complexity
Fewer assumptions
No assumptions
about data
distributions or
distance metrics
Widely used
Active academic
research and
industry use
25

Normal Day
Number of user actions of interest
Isolationforestscore
26

Normal Day
A cluster of real members
using automation tools with
similar behavior
27

Behavior of two example real accounts using automation tools
Time
AutomateduseractionsAutomateduseractions
~30 user actions performed
at a constant rate
28

Normal Day
29

Fake Account
Attack Day
30

Fake Account
Attack Day
A major fake account attack
using abusive automation
31

Behavior of two example accounts from the attack
Time
AutomateduseractionsAutomateduseractions
32

Potential Isolation Forest Use Cases
Account takeover Alerting on time-series
data
Payment fraud Data center monitoring
Automation detection Advanced persistent
threats
Insider threats /
intrusion detection
ML health assurance
33

Open Source isolation-forest
Library @ LinkedIn
• Scala/Spark
• Developed by LinkedIn Anti-Abuse AI
• Distributed training and scoring
• Compatible with spark.ml
• Open sourced and available on GitHub
• Artifacts in JCenter and Maven Central
https://github.com/linkedin/isolation-forest
https://engineering.linkedin.com/blog/2019/isolation-forest
34

Isolation Forest Acknowledgements
• Thank you to Frank Astier and Ahmed Metwally for providing advice and performing code
reviews.
• Thank you to Will Chan, Jason Chang, Milinda Lakkam, and Xin Wang for providing useful
feedback as the first users of this library.
35

Graph Clustering
Anti-Abuse AI 36

Variations of Clustering
1. Group By Clustering
• Refers to grouping accounts by shared signals.
• Example: accounts created on the same IP
address on the same day.
• Computationally cheap.
• Effective at catching basic bulk fake accounts but
brittle against more sophisticated bad actors.
37

Variations of Clustering
1. Group By Clustering
• Refers to grouping accounts by shared signals.
• Example: accounts created on the same IP
address on the same day.
• Computationally cheap.
• Effective at catching basic bulk fake accounts but
brittle against more sophisticated bad actors.
2. Graph Clustering
• Refers to grouping accounts by similar signals.
• Example: accounts liking similar content (fake
engagement).
• Computationally expensive.
• Effective at catching more sophisticated bad
actors.
38

Behavioral Graph Clustering
likes {article 1, article 3, article 6}
likes {article 2, article 3, article 6}
1. Find pairs of members that have high content engagement similarity
39

40
• Use optimizations to avoid comparisons

41
• Similarity defined using Jaccard (Ruzicka) = ⁄∑! min(𝑥!, 𝑦!) ∑! max(𝑥!, 𝑦!)

• Apply a threshold
42

2. Cluster the similarity graph using Jarvis-Patrick algorithm
43

Jarvis-Patrick
Clustering
• Fast (Parallelizable & Non-Iterative)
• Deterministic
• No need to predefine # of clusters
• Clusters do not overlap
• Accommodates clusters of variable size
44

Behavioral Graph Clustering: Fake Accounts
- 1.00
- 0.95
- 0.90
- 0.85
- 0.80
- 0.75
- 0.70
similarity
45

Behavioral Graph Clustering: Real Accounts
Book Club
Cryptocurrency Interest Group Cryptocurrency Interest Group Cryptocurrency Interest Group
Employees of a Company Employees of a Company
- 1.00
- 0.95
- 0.90
- 0.85
- 0.80
- 0.75
- 0.70
similarity
46

Unsupervised + Supervised
47
Accounts with
similar behavior
Isolation Forest
Graph Clustering

Accounts with
similar behavior
Fake account clusters,
Hacked account clusters,
& Real account clusters
Isolation Forest
Graph Clustering
Supervised model
or heuristics
Isolation Forest
Graph Clustering
48

Secure and reinstate
hacked account clusters
No action or educate
real account clusters
Take down fake account
clusters
49
Accounts with
similar behavior
Fake account clusters,
Hacked account clusters,
& Real account clusters
Isolation Forest
Graph Clustering
Supervised model
or heuristics
Isolation Forest
Graph Clustering

Labels
Solution: Unsupervised
learning is a natural fit for
challenges with few labels
51

Labels
fake accounts
Solution: Our techniques
catch groups of camouflaged
bad accounts due to shared
behavioral patterns 52
Signal

Solution: As long as attacker
behavior is different than
normal user behavior, it can
be detected
Labels
fake accounts
Attackers are very quick to
adapt and evolve
Solution: Our techniques
catch groups of camouflaged
bad accounts due to shared
behavioral patterns 53
Signal Adversarial

Behavioral Graph Clustering Acknowledgements
• The behavioral graph clustering library was created by Ahmed Metwally.
• Thank you to Frank Astier and Ted Hwa for providing advice and performing code reviews.
• Thank you to Adam Jacoby for creating the first production application using the behavioral
graph clustering library.
54

Thank you
Grace Tang
Senior Staff Machine
Learning Engineer
gtang@linkedin.com
James Verbus
Staff Machine Learning
Engineer
jverbus@linkedin.com
Anti-Abuse AI 55

Isolation Tree
• Randomly sample data for
isolation in a binary tree
• Randomly choose split feature
and split value
• Repeat until all instances are
isolated
Outliers are
easier to isolate
Inliers are harder
to isolate
57

Isolation Tree
Outliers are
easier to isolate
Inliers are harder
to isolate
• Score is derived from path
length from root to leaf node
• New instances can be scored
by tracing their path through
the trained tree
58

Isolation Forest
Ensemble of isolation trees
59

Jarvis-Patrick Variations
Vanilla
• Two nodes can only be clustered if they are in
each other’s J nearest neighbors
Modified
• Two nodes can only be clustered if their nearest
neighbors have a minimum similarity
• J = the number of nearest neighbors to consider for each node
• K = the minimum number of nearest neighbors that must be in common for two nodes to be clustered together
x y
a
b
c
d
x y
a
b
c
d
a b c d
x 1 1 1
y 1 1 1
𝑠𝑖𝑚 𝑥, 𝑦 =
∑! min(𝑥!, 𝑦!)
∑! max(𝑥!, 𝑦!)
60

Preventing Abuse Using Unsupervised Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Preventing Abuse Using Unsupervised Learning

Similar to Preventing Abuse Using Unsupervised Learning (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Preventing Abuse Using Unsupervised Learning