ensembles_emptytemplate_v2

Ensemble
clustering
methods

Shrayes
Ramesh

Intro: classification versus clustering
1/4/16 2
Data
points
colored
by
“ground
truth”

labels
red
and
blue

Classiﬁca<on:

Build
a
model
that
assigns
a
data
point
its
best
label

From
scikit-‐learn.org

Intro: classification versus clustering
1/4/16 3
Clustering:

Build
a
model
that
groups
data
into
clusters

Note:
there
is
no
ground
truth
/
ground
truth
is
not
known
apriori

The
algorithm
chooses
a
par33oning
that
“makes
sense”

From
scikit-‐learn.org

Intro: why clustering?
1/4/16 4
•  Big, distributed data sets: store everything, know nothing
•  Detecting behaviors for exploratory analysis: let’s start
somewhere smaller (in parallel)
•  Unsupervised learning of patterns-of-life: wait, don’t we do
“machine learning?”
4
5
6
3
1
2

Distributed
dataset

1
4
5

6
3
2

Logical
database

par<<ons

…
…

…

4
5
6
3
1
2

Data
from
all
users,
employees

1
4
5

6
3
2

Groups
of
users

Groups
of
employees

…

…
…

…

Network Defense program
Detecting network infiltration via distributed computation that identifies
anomalous behavior
Rule-based signatures Adaptive behavior detection
Stateless single IP analyses Context-based decisions
Manual analysis Guided automation
Automated response to known threats,
suspicious periods flagged
Visual inspection Visual inspection aided by
distributed analytics
Jun 30 18:57:01 172.28.215.239
IPSEC: An outbound LAN-to-LAN
SA (SPI= 0xA75FC985) between
xxx.xxx.xxx.xxx and
yyy.yyy.yyy.yyy created.
Netflow and log data
100s of office
locations
Teleworker with
VPN client software
VPN = Virtual Private Network
Wireless VPN client
Homeworker
with VPN router
VPN client software
Homeworker with
Agency HQ
Approved for Public Release, Distribution Unlimited 5

“Small data” clustering
1/4/16 6
Hierarchical
clustering
works
great
for
small
data

These
algorithms
are
O(N^2)
requiring
computa<on
of
N^2
distances

They
will
not
scale
well

Most
(not
all)
algorithms
that
do
dimensionality
reduc<on,
produce
dendrograms,

learn
manifolds,
or
construct
2d
projec<ons
are
also
O(N^2)

Scalable clustering in a MapReduce world
1/4/16 7
E-‐M
approaches
(like
k-‐means)
are
embarrassingly
parallel
per
itera3on

But
they
only
produce
local
op3ma

Other
O(N*K)
mixture
models
es<mated
with
sampling
procedures
tend
to
require

lots
of
O(N*K)
itera<ons
(aka
MR
jobs/tasks)
to
converge

Ini<alize
centroids
Assign
each
point
to

nearest
centroid
;

O(N*K)

Group
data
into
clusters,

recompute
centroids

Repeat
un<l

convergence

But…

What
if
the
features
are
distributed
or
high
dimensional?

60 iterations of 3 clustering algorithms
consolidated via “voting procedure”
Ranked list of outliers
Outliers
8Approved for Public Release, Distribution Unlimited
Approach: Identify IP addresses that behave differently from others
5.2 billion communications
750 million communications between IP
address pairs
530 thousand summarized connections
Raw data:
•  Source IP address
•  Destination IP address
•  Bytes
•  Packets
•  Port
•  Protocol
Data
Algorithms
Outcome
IP address conducts recon on
9683 IP addresses from
inside network
Time
Bytesperhour
Investigated two extreme outliers out of
4.6 billion IP addresses
Identified potentially compromised IP
address
Reachability
Distance
IP Addresses
Detecting infiltrator conducting reconnaissance

Ensemble
Clustering:
Mo<va<on

1/4/16
9

Scalability and Robustness
•  Problem: “accurate” clustering algorithms are > O(N^2)
•  Typical solutions: E-M or O(K*N) clustering at scale
•  However: many fast clustering algorithms give local minima
•  Problem: data sets are high dimensional or distributed
•  Typical solution: repeated sampling or subsetting before clustering
•  However: unlike an ensemble of classifiers, cluster labels don’t align
References
• 
Strehl,
A.,
&
Ghosh,
J.
(2003).
Cluster
ensembles-‐-‐-‐a
knowledge
reuse
framework
for
combining
mul<ple
par<<ons.

The
Journal
of
Machine
Learning
Research,
3,
583-‐617.

• 
Fred,
A.
L.,
&
Jain,
A.
K.
(2005).
Combining
mul<ple
clusterings
using
evidence
accumula<on.
Pa6ern
Analysis
and

Machine
Intelligence,
IEEE
Transac>ons
on,
27(6),
835-‐850.

• 
Topchy,
A.,
Jain,
A.
K.,
&
Punch,
W.
(2005).
Clustering
ensembles:
Models
of
consensus
and
weak
par<<ons.
Pa6ern

Analysis
and
Machine
Intelligence,
IEEE
Transac>ons
on,
27(12),
1866-‐1881.

Ensemble
Clustering:
One
Slide
Overview

1/4/16
10

Cluster to generate initial partitions
Align clusters into metaclusters
Vote for
final
clusters

Genera<ng
an
ensemble
of
clustering
par<<ons

1/4/16
11

First stage clusters: the set of points tagged with the same color are called a “hyperedge”
1.  Cluster to generate
initial partitions

First-‐stage
clusters
(hyperedges)
don’t
align

1/4/16
12

2 3 4
1
2
3
4
1
Problem:
cluster labels (hyperedges) differ
across iterations
Solution: cluster the hyperedges
by the set of data points they
have in common
Solve smaller O(n^2) problem
2. Compute smaller (than N)
hyperedge similarity matrix

Second-‐stage
clustering
to
produce
metaclusters

1/4/16
13

¾
¼
3. Clustering hyperedges into “metaclusters”
4. Each point choose a final metacluster by voting

Decision
points:
ﬁrst
and
second
stage
clustering

1/4/16
14

id
feature1
feature2
feature3
feature4

…

Distribute
by
feature:

Cluster
the
en<re
dataset
by
diﬀerent
subsets
of
features

Overlapping
samples:

• 
Cluster
a
random
par<<on
of
the
dataset

• 
Each
point
needs
to
be
in
mul<ple
par<<ons

Bootstrap
samples:

• 
Cluster
a
random
par<<on
of
the
dataset;

output
“predic<ve
models”

• 
Assign
all
points
to
predicted
cluster
labels

The
algorithm
chosen
in
the
genera<on
of
clustering
assignments
malers

Dataset

Walkthrough:
ensemble
clustering
on
20k
smiley
face

1/4/16
15

Workflow:
1.  Draw a random
multivariate normal vector
r ~ N(0,I)
2.  Project data: y = x*r
3.  k-means on y (k=20)
4.  Repeat many times (80)
Output is first stage
cluster assignments:
(node; iteration; label)
Random projection kmeans first stage clusters (last four iterations)

Metaclustering

1/4/16
16

First stage clusters: 20 clusters x 80 iterations
Colored by number of points in common
20 clusters per iteration
80 iterations
Spectral clustering
k= 6
Metaclusters

Vo<ng

1/4/16
17

K=4
K=6
From convex first stage clusters to
nonconvex metaclusters

Extensions:
entropy
and
density

1/4/16
18

Entropy:
To what extent do your
first stage clusters agree?
¾
¼
Density:
On average, how many other points
are in your bin?
High density
Easy to cluster
Low density
Hard to cluster

Extension:
ensemble
topology
visualiza<ons

1/4/16
19

Spectral clustering
Classical multidimensional scaling
First stage clusters
Metaclusters
Point’s coordinate = average coordinate of its clusters

Workﬂow:
categorizing
237k
song
lyrics

1/4/16
20

Ensemble Clustering
Abc
Abc
Abc
Abc
Abc
Abc
#
#
#
#
#
#
Abc
Abc
Abc
Abc
Abc
Abc
musiXmatch
dataset,
the
oﬃcial
lyrics
collec<on
for
the
Million
Song
Dataset,
available
at:

hlp://labrosa.ee.columbia.edu/millionsong/musixmatch

1m
song
database:

• 
trackid

• 
ar<st

• 
song

• 
album

• 
year

237k
song
DTM:

• 
trackid

• 
stemmed
word

(top
5000)

• 
count

Hypothe<cal
ques<ons

• 
Can
we
learn
“genres”
from
examining
song
lyrics
alone?

• 
Can
we
iden<fy
when
genres
emerge
and
peak
over
<me?

• 
Can
we
quickly
visualize
the
landscape
of
song
lyrics?

• 
Can
we
track
how
ar<sts
evolve
over
<me?

• 
50
itera<ons
of
spherical
k-‐means
(cosine
similarity)

with
k=20

• 
In
[R],
using
just
slam,
Matrix
and
doParallel

• 
Ensembled
together
with
spectral
clustering,
k=12

• 
With
ensemble
visualiza<on

• 
Joined/aligned
with
metadata
on
year,
ar<st

1/4/16
21

Cluster % over time
Ensemble layout (colored by clusters)
"he" "his" "him" "the" "was"
"hey" "gonna" "wanna" "you" "i"
"we" "ich" "our" "und" "are"
“i" "you" "am" "the" "not"
"che" "e" "di" "non" "il"
"n****" "f***" "s***" "i" "ya"
"na" "o" "eu" "e" "nÃ£o"
"je" "de" "et" "les" "le"
"love" "babi" "you" "i" "me"
"she" "her" "i" "the" "girl"
"of" "the" "death" "blood" "their"
"que" "y" "de" "la" "el"
Applica<on:
clustering
237k
song
lyrics

Thank
you!

Shrayes
Ramesh

shrayes.ramesh@gmail.com

www.github.com/shrayesramesh

ensembles_emptytemplate_v2

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à ensembles_emptytemplate_v2

Similaire à ensembles_emptytemplate_v2 (20)

ensembles_emptytemplate_v2