SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Ensemble	
  clustering	
  methods	
  
Shrayes	
  Ramesh	
  
Intro: classification versus clustering
1/4/16 2
Data	
  points	
  colored	
  by	
  “ground	
  truth”	
  
labels	
  red	
  and	
  blue	
  
Classifica<on:	
  	
  
Build	
  a	
  model	
  that	
  assigns	
  a	
  data	
  point	
  its	
  best	
  label	
  
From	
  scikit-­‐learn.org	
  
Intro: classification versus clustering
1/4/16 3
Clustering:	
  	
  
Build	
  a	
  model	
  that	
  groups	
  data	
  into	
  clusters	
  
Note:	
  there	
  is	
  no	
  ground	
  truth	
  /	
  ground	
  truth	
  is	
  not	
  known	
  apriori	
  
The	
  algorithm	
  chooses	
  a	
  par33oning	
  that	
  “makes	
  sense”	
  
From	
  scikit-­‐learn.org	
  
Intro: why clustering?
1/4/16 4
•  Big, distributed data sets: store everything, know nothing
•  Detecting behaviors for exploratory analysis: let’s start
somewhere smaller (in parallel)
•  Unsupervised learning of patterns-of-life: wait, don’t we do
“machine learning?”
4	
   5	
   6	
  3	
  1	
   2	
  
Distributed	
  dataset	
  
1	
   4	
   5	
  
6	
  3	
  2	
  
Logical	
  database	
  
par<<ons	
  
…	
   …	
  
…	
  
4	
   5	
   6	
  3	
  1	
   2	
  
Data	
  from	
  all	
  users,	
  employees	
  
1	
   4	
   5	
  
6	
  3	
  2	
  
Groups	
  of	
  users	
  
Groups	
  of	
  employees	
  
…	
  
…	
   …	
  
…	
  
Network Defense program
Detecting network infiltration via distributed computation that identifies
anomalous behavior
Rule-based signatures Adaptive behavior detection
Stateless single IP analyses Context-based decisions
Manual analysis Guided automation
Automated response to known threats,
suspicious periods flagged
Visual inspection Visual inspection aided by
distributed analytics
Jun 30 18:57:01 172.28.215.239
IPSEC: An outbound LAN-to-LAN
SA (SPI= 0xA75FC985) between
xxx.xxx.xxx.xxx and
yyy.yyy.yyy.yyy created.
Netflow and log data
100s of office
locations
Teleworker with
VPN client software
VPN = Virtual Private Network
Wireless VPN client
Homeworker
with VPN router
VPN client software
Homeworker with
Agency HQ
Approved for Public Release, Distribution Unlimited 5
“Small data” clustering
1/4/16 6
Hierarchical	
  clustering	
  works	
  great	
  for	
  small	
  data	
  
These	
  algorithms	
  are	
  O(N^2)	
  requiring	
  computa<on	
  of	
  N^2	
  distances	
  
They	
  will	
  not	
  scale	
  well	
  
Most	
  (not	
  all)	
  algorithms	
  that	
  do	
  dimensionality	
  reduc<on,	
  produce	
  dendrograms,	
  
learn	
  manifolds,	
  or	
  construct	
  2d	
  projec<ons	
  are	
  also	
  O(N^2)	
  
Scalable clustering in a MapReduce world
1/4/16 7
E-­‐M	
  approaches	
  (like	
  k-­‐means)	
  are	
  embarrassingly	
  parallel	
  per	
  itera3on	
  
But	
  they	
  only	
  produce	
  local	
  op3ma	
  
Other	
  O(N*K)	
  mixture	
  models	
  es<mated	
  with	
  sampling	
  procedures	
  tend	
  to	
  require	
  
lots	
  of	
  O(N*K)	
  itera<ons	
  (aka	
  MR	
  jobs/tasks)	
  to	
  converge	
  
Ini<alize	
  centroids	
   Assign	
  each	
  point	
  to	
  
nearest	
  centroid	
  ;	
  
O(N*K)	
  
Group	
  data	
  into	
  clusters,	
  
recompute	
  centroids	
  
Repeat	
  un<l	
  
convergence	
  
But…	
  
What	
  if	
  the	
  features	
  are	
  distributed	
  or	
  high	
  dimensional?	
  
60 iterations of 3 clustering algorithms
consolidated via “voting procedure”
Ranked list of outliers
Outliers
8Approved for Public Release, Distribution Unlimited
Approach: Identify IP addresses that behave differently from others
5.2 billion communications
750 million communications between IP
address pairs
530 thousand summarized connections
Raw data:
•  Source IP address
•  Destination IP address
•  Bytes
•  Packets
•  Port
•  Protocol
Data
Algorithms
Outcome
IP address conducts recon on
9683 IP addresses from
inside network
Time
Bytesperhour
Investigated two extreme outliers out of
4.6 billion IP addresses
Identified potentially compromised IP
address
Reachability
Distance
IP Addresses
Detecting infiltrator conducting reconnaissance
Ensemble	
  Clustering:	
  Mo<va<on	
  
1/4/16	
   9	
  
Scalability and Robustness
•  Problem: “accurate” clustering algorithms are > O(N^2)
•  Typical solutions: E-M or O(K*N) clustering at scale
•  However: many fast clustering algorithms give local minima
•  Problem: data sets are high dimensional or distributed
•  Typical solution: repeated sampling or subsetting before clustering
•  However: unlike an ensemble of classifiers, cluster labels don’t align
References
• 	
  Strehl,	
  A.,	
  &	
  Ghosh,	
  J.	
  (2003).	
  Cluster	
  ensembles-­‐-­‐-­‐a	
  knowledge	
  reuse	
  framework	
  for	
  combining	
  mul<ple	
  par<<ons.	
  
The	
  Journal	
  of	
  Machine	
  Learning	
  Research,	
  3,	
  583-­‐617.	
  
• 	
  Fred,	
  A.	
  L.,	
  &	
  Jain,	
  A.	
  K.	
  (2005).	
  Combining	
  mul<ple	
  clusterings	
  using	
  evidence	
  accumula<on.	
  Pa6ern	
  Analysis	
  and	
  
Machine	
  Intelligence,	
  IEEE	
  Transac>ons	
  on,	
  27(6),	
  835-­‐850.	
  
• 	
  Topchy,	
  A.,	
  Jain,	
  A.	
  K.,	
  &	
  Punch,	
  W.	
  (2005).	
  Clustering	
  ensembles:	
  Models	
  of	
  consensus	
  and	
  weak	
  par<<ons.	
  Pa6ern	
  
Analysis	
  and	
  Machine	
  Intelligence,	
  IEEE	
  Transac>ons	
  on,	
  27(12),	
  1866-­‐1881.	
  
Ensemble	
  Clustering:	
  One	
  Slide	
  Overview	
  
1/4/16	
   10	
  
Cluster to generate initial partitions
Align clusters into metaclusters
Vote for
final
clusters
Genera<ng	
  an	
  ensemble	
  of	
  clustering	
  par<<ons	
  
1/4/16	
   11	
  
First stage clusters: the set of points tagged with the same color are called a “hyperedge”
1.  Cluster to generate
initial partitions
First-­‐stage	
  clusters	
  (hyperedges)	
  don’t	
  align	
  
1/4/16	
   12	
  
2 3 4
1
2
3
4
1
Problem:
cluster labels (hyperedges) differ
across iterations
Solution: cluster the hyperedges
by the set of data points they
have in common
Solve smaller O(n^2) problem
2. Compute smaller (than N)
hyperedge similarity matrix
Second-­‐stage	
  clustering	
  to	
  produce	
  metaclusters	
  
1/4/16	
   13	
  
¾
¼
3. Clustering hyperedges into “metaclusters”
4. Each point choose a final metacluster by voting
Decision	
  points:	
  first	
  and	
  second	
  stage	
  clustering	
  
1/4/16	
   14	
  
id	
   feature1 	
  feature2 	
  feature3 	
  feature4	
  
…	
  
Distribute	
  by	
  feature:	
  
Cluster	
  the	
  en<re	
  dataset	
  by	
  different	
  subsets	
  of	
  features	
  
Overlapping	
  samples:	
  
• 	
  Cluster	
  a	
  random	
  par<<on	
  of	
  the	
  dataset	
  
• 	
  Each	
  point	
  needs	
  to	
  be	
  in	
  mul<ple	
  par<<ons	
  
Bootstrap	
  samples:	
  
• 	
  Cluster	
  a	
  random	
  par<<on	
  of	
  the	
  dataset;	
  
output	
  “predic<ve	
  models”	
  
• 	
  Assign	
  all	
  points	
  to	
  predicted	
  cluster	
  labels	
  
The	
  algorithm	
  chosen	
  in	
  the	
  genera<on	
  of	
  clustering	
  assignments	
  malers	
  
Dataset	
  
Walkthrough:	
  ensemble	
  clustering	
  on	
  20k	
  smiley	
  face	
  
1/4/16	
   15	
  
Workflow:
1.  Draw a random
multivariate normal vector
r ~ N(0,I)
2.  Project data: y = x*r
3.  k-means on y (k=20)
4.  Repeat many times (80)
Output is first stage
cluster assignments:
(node; iteration; label)
Random projection kmeans first stage clusters (last four iterations)
Metaclustering	
  
1/4/16	
   16	
  
First stage clusters: 20 clusters x 80 iterations
Colored by number of points in common
20 clusters per iteration
80 iterations
Spectral clustering
k= 6
Metaclusters
Vo<ng	
  
1/4/16	
   17	
  
K=4
K=6
From convex first stage clusters to
nonconvex metaclusters
Extensions:	
  entropy	
  and	
  density	
  
1/4/16	
   18	
  
Entropy:
To what extent do your
first stage clusters agree?
¾
¼
Density:
On average, how many other points
are in your bin?
High density
Easy to cluster
Low density
Hard to cluster
Extension:	
  ensemble	
  topology	
  visualiza<ons	
  
1/4/16	
   19	
  
Spectral clustering
Classical multidimensional scaling
First stage clusters
Metaclusters
Point’s coordinate = average coordinate of its clusters
Workflow:	
  categorizing	
  237k	
  song	
  lyrics	
  
1/4/16	
   20	
  
Ensemble Clustering
Abc
Abc
Abc
Abc
Abc
Abc
#
#
#
#
#
#
Abc
Abc
Abc
Abc
Abc
Abc
musiXmatch	
  dataset,	
  the	
  official	
  lyrics	
  collec<on	
  for	
  the	
  Million	
  Song	
  Dataset,	
  available	
  at:	
  
hlp://labrosa.ee.columbia.edu/millionsong/musixmatch	
  	
  
1m	
  song	
  database:	
  
	
  
• 	
  trackid	
  
• 	
  ar<st	
  
• 	
  song	
  
• 	
  album	
  
• 	
  year	
  
237k	
  song	
  DTM:	
  
	
  
• 	
  trackid	
  
• 	
  stemmed	
  word	
  
(top	
  5000)	
  
• 	
  count	
  
	
  
Hypothe<cal	
  ques<ons	
  
• 	
  Can	
  we	
  learn	
  “genres”	
  from	
  examining	
  song	
  lyrics	
  alone?	
  
• 	
  Can	
  we	
  iden<fy	
  when	
  genres	
  emerge	
  and	
  peak	
  over	
  <me?	
  
• 	
  Can	
  we	
  quickly	
  visualize	
  the	
  landscape	
  of	
  song	
  lyrics?	
  
• 	
  Can	
  we	
  track	
  how	
  ar<sts	
  evolve	
  over	
  <me?	
  
• 	
  50	
  itera<ons	
  of	
  spherical	
  k-­‐means	
  (cosine	
  similarity)	
  	
  with	
  k=20	
  
• 	
  In	
  [R],	
  using	
  just	
  slam,	
  Matrix	
  and	
  doParallel	
  
• 	
  Ensembled	
  together	
  with	
  spectral	
  clustering,	
  k=12	
  
• 	
  With	
  ensemble	
  visualiza<on	
  
• 	
  Joined/aligned	
  with	
  metadata	
  on	
  year,	
  ar<st	
  
1/4/16	
   21	
  
Cluster % over time
Ensemble layout (colored by clusters)
"he" "his" "him" "the" "was"
"hey" "gonna" "wanna" "you" "i"
"we" "ich" "our" "und" "are"
“i" "you" "am" "the" "not"
"che" "e" "di" "non" "il"
"n****" "f***" "s***" "i" "ya"
"na" "o" "eu" "e" "não"
"je" "de" "et" "les" "le"
"love" "babi" "you" "i" "me"
"she" "her" "i" "the" "girl"
"of" "the" "death" "blood" "their"
"que" "y" "de" "la" "el"
Applica<on:	
  clustering	
  237k	
  song	
  lyrics	
  
Thank	
  you!	
  
Shrayes	
  Ramesh	
  
	
  
shrayes.ramesh@gmail.com	
  
www.github.com/shrayesramesh	
  
	
  
	
  

Contenu connexe

Tendances

4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonAfzal Ahmad
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithmsiqbalphy1
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture SearchDaeJin Kim
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
 
stacks and queues for public
stacks and queues for publicstacks and queues for public
stacks and queues for publiciqbalphy1
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with RMaarten Smeets
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)Waqas Nawaz
 
On Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic ProgramsOn Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic ProgramsLino Possamai
 
Dynamic Memory & Linked Lists
Dynamic Memory & Linked ListsDynamic Memory & Linked Lists
Dynamic Memory & Linked ListsAfaq Mansoor Khan
 
Lightning talk at MLConf NYC 2015
Lightning talk at MLConf NYC 2015Lightning talk at MLConf NYC 2015
Lightning talk at MLConf NYC 2015Mohitdeep Singh
 

Tendances (20)

4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil
 
Searching Algorithms
Searching AlgorithmsSearching Algorithms
Searching Algorithms
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In python
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Basic data-structures-v.1.1
Basic data-structures-v.1.1Basic data-structures-v.1.1
Basic data-structures-v.1.1
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search
 
Chapter 7 ds
Chapter 7 dsChapter 7 ds
Chapter 7 ds
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
stacks and queues for public
stacks and queues for publicstacks and queues for public
stacks and queues for public
 
Recursion Pattern Analysis and Feedback
Recursion Pattern Analysis and FeedbackRecursion Pattern Analysis and Feedback
Recursion Pattern Analysis and Feedback
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with R
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)
 
SAX-TimeSeries
SAX-TimeSeriesSAX-TimeSeries
SAX-TimeSeries
 
On Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic ProgramsOn Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic Programs
 
Dynamic Memory & Linked Lists
Dynamic Memory & Linked ListsDynamic Memory & Linked Lists
Dynamic Memory & Linked Lists
 
Lightning talk at MLConf NYC 2015
Lightning talk at MLConf NYC 2015Lightning talk at MLConf NYC 2015
Lightning talk at MLConf NYC 2015
 

Similaire à ensembles_emptytemplate_v2

Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clusteringmonalisa Das
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesRui Vieira
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Austin Benson
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.pptLPrashanthi
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsBita Kazemi
 
ASE2010
ASE2010ASE2010
ASE2010swy351
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...Waqas Nawaz
 

Similaire à ensembles_emptytemplate_v2 (20)

Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clustering
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
K means report
K means reportK means report
K means report
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.ppt
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 
ASE2010
ASE2010ASE2010
ASE2010
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
My8clst
My8clstMy8clst
My8clst
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
 

ensembles_emptytemplate_v2

  • 1. Ensemble  clustering  methods   Shrayes  Ramesh  
  • 2. Intro: classification versus clustering 1/4/16 2 Data  points  colored  by  “ground  truth”   labels  red  and  blue   Classifica<on:     Build  a  model  that  assigns  a  data  point  its  best  label   From  scikit-­‐learn.org  
  • 3. Intro: classification versus clustering 1/4/16 3 Clustering:     Build  a  model  that  groups  data  into  clusters   Note:  there  is  no  ground  truth  /  ground  truth  is  not  known  apriori   The  algorithm  chooses  a  par33oning  that  “makes  sense”   From  scikit-­‐learn.org  
  • 4. Intro: why clustering? 1/4/16 4 •  Big, distributed data sets: store everything, know nothing •  Detecting behaviors for exploratory analysis: let’s start somewhere smaller (in parallel) •  Unsupervised learning of patterns-of-life: wait, don’t we do “machine learning?” 4   5   6  3  1   2   Distributed  dataset   1   4   5   6  3  2   Logical  database   par<<ons   …   …   …   4   5   6  3  1   2   Data  from  all  users,  employees   1   4   5   6  3  2   Groups  of  users   Groups  of  employees   …   …   …   …  
  • 5. Network Defense program Detecting network infiltration via distributed computation that identifies anomalous behavior Rule-based signatures Adaptive behavior detection Stateless single IP analyses Context-based decisions Manual analysis Guided automation Automated response to known threats, suspicious periods flagged Visual inspection Visual inspection aided by distributed analytics Jun 30 18:57:01 172.28.215.239 IPSEC: An outbound LAN-to-LAN SA (SPI= 0xA75FC985) between xxx.xxx.xxx.xxx and yyy.yyy.yyy.yyy created. Netflow and log data 100s of office locations Teleworker with VPN client software VPN = Virtual Private Network Wireless VPN client Homeworker with VPN router VPN client software Homeworker with Agency HQ Approved for Public Release, Distribution Unlimited 5
  • 6. “Small data” clustering 1/4/16 6 Hierarchical  clustering  works  great  for  small  data   These  algorithms  are  O(N^2)  requiring  computa<on  of  N^2  distances   They  will  not  scale  well   Most  (not  all)  algorithms  that  do  dimensionality  reduc<on,  produce  dendrograms,   learn  manifolds,  or  construct  2d  projec<ons  are  also  O(N^2)  
  • 7. Scalable clustering in a MapReduce world 1/4/16 7 E-­‐M  approaches  (like  k-­‐means)  are  embarrassingly  parallel  per  itera3on   But  they  only  produce  local  op3ma   Other  O(N*K)  mixture  models  es<mated  with  sampling  procedures  tend  to  require   lots  of  O(N*K)  itera<ons  (aka  MR  jobs/tasks)  to  converge   Ini<alize  centroids   Assign  each  point  to   nearest  centroid  ;   O(N*K)   Group  data  into  clusters,   recompute  centroids   Repeat  un<l   convergence   But…   What  if  the  features  are  distributed  or  high  dimensional?  
  • 8. 60 iterations of 3 clustering algorithms consolidated via “voting procedure” Ranked list of outliers Outliers 8Approved for Public Release, Distribution Unlimited Approach: Identify IP addresses that behave differently from others 5.2 billion communications 750 million communications between IP address pairs 530 thousand summarized connections Raw data: •  Source IP address •  Destination IP address •  Bytes •  Packets •  Port •  Protocol Data Algorithms Outcome IP address conducts recon on 9683 IP addresses from inside network Time Bytesperhour Investigated two extreme outliers out of 4.6 billion IP addresses Identified potentially compromised IP address Reachability Distance IP Addresses Detecting infiltrator conducting reconnaissance
  • 9. Ensemble  Clustering:  Mo<va<on   1/4/16   9   Scalability and Robustness •  Problem: “accurate” clustering algorithms are > O(N^2) •  Typical solutions: E-M or O(K*N) clustering at scale •  However: many fast clustering algorithms give local minima •  Problem: data sets are high dimensional or distributed •  Typical solution: repeated sampling or subsetting before clustering •  However: unlike an ensemble of classifiers, cluster labels don’t align References •   Strehl,  A.,  &  Ghosh,  J.  (2003).  Cluster  ensembles-­‐-­‐-­‐a  knowledge  reuse  framework  for  combining  mul<ple  par<<ons.   The  Journal  of  Machine  Learning  Research,  3,  583-­‐617.   •   Fred,  A.  L.,  &  Jain,  A.  K.  (2005).  Combining  mul<ple  clusterings  using  evidence  accumula<on.  Pa6ern  Analysis  and   Machine  Intelligence,  IEEE  Transac>ons  on,  27(6),  835-­‐850.   •   Topchy,  A.,  Jain,  A.  K.,  &  Punch,  W.  (2005).  Clustering  ensembles:  Models  of  consensus  and  weak  par<<ons.  Pa6ern   Analysis  and  Machine  Intelligence,  IEEE  Transac>ons  on,  27(12),  1866-­‐1881.  
  • 10. Ensemble  Clustering:  One  Slide  Overview   1/4/16   10   Cluster to generate initial partitions Align clusters into metaclusters Vote for final clusters
  • 11. Genera<ng  an  ensemble  of  clustering  par<<ons   1/4/16   11   First stage clusters: the set of points tagged with the same color are called a “hyperedge” 1.  Cluster to generate initial partitions
  • 12. First-­‐stage  clusters  (hyperedges)  don’t  align   1/4/16   12   2 3 4 1 2 3 4 1 Problem: cluster labels (hyperedges) differ across iterations Solution: cluster the hyperedges by the set of data points they have in common Solve smaller O(n^2) problem 2. Compute smaller (than N) hyperedge similarity matrix
  • 13. Second-­‐stage  clustering  to  produce  metaclusters   1/4/16   13   ¾ ¼ 3. Clustering hyperedges into “metaclusters” 4. Each point choose a final metacluster by voting
  • 14. Decision  points:  first  and  second  stage  clustering   1/4/16   14   id   feature1  feature2  feature3  feature4   …   Distribute  by  feature:   Cluster  the  en<re  dataset  by  different  subsets  of  features   Overlapping  samples:   •   Cluster  a  random  par<<on  of  the  dataset   •   Each  point  needs  to  be  in  mul<ple  par<<ons   Bootstrap  samples:   •   Cluster  a  random  par<<on  of  the  dataset;   output  “predic<ve  models”   •   Assign  all  points  to  predicted  cluster  labels   The  algorithm  chosen  in  the  genera<on  of  clustering  assignments  malers   Dataset  
  • 15. Walkthrough:  ensemble  clustering  on  20k  smiley  face   1/4/16   15   Workflow: 1.  Draw a random multivariate normal vector r ~ N(0,I) 2.  Project data: y = x*r 3.  k-means on y (k=20) 4.  Repeat many times (80) Output is first stage cluster assignments: (node; iteration; label) Random projection kmeans first stage clusters (last four iterations)
  • 16. Metaclustering   1/4/16   16   First stage clusters: 20 clusters x 80 iterations Colored by number of points in common 20 clusters per iteration 80 iterations Spectral clustering k= 6 Metaclusters
  • 17. Vo<ng   1/4/16   17   K=4 K=6 From convex first stage clusters to nonconvex metaclusters
  • 18. Extensions:  entropy  and  density   1/4/16   18   Entropy: To what extent do your first stage clusters agree? ¾ ¼ Density: On average, how many other points are in your bin? High density Easy to cluster Low density Hard to cluster
  • 19. Extension:  ensemble  topology  visualiza<ons   1/4/16   19   Spectral clustering Classical multidimensional scaling First stage clusters Metaclusters Point’s coordinate = average coordinate of its clusters
  • 20. Workflow:  categorizing  237k  song  lyrics   1/4/16   20   Ensemble Clustering Abc Abc Abc Abc Abc Abc # # # # # # Abc Abc Abc Abc Abc Abc musiXmatch  dataset,  the  official  lyrics  collec<on  for  the  Million  Song  Dataset,  available  at:   hlp://labrosa.ee.columbia.edu/millionsong/musixmatch     1m  song  database:     •   trackid   •   ar<st   •   song   •   album   •   year   237k  song  DTM:     •   trackid   •   stemmed  word   (top  5000)   •   count     Hypothe<cal  ques<ons   •   Can  we  learn  “genres”  from  examining  song  lyrics  alone?   •   Can  we  iden<fy  when  genres  emerge  and  peak  over  <me?   •   Can  we  quickly  visualize  the  landscape  of  song  lyrics?   •   Can  we  track  how  ar<sts  evolve  over  <me?   •   50  itera<ons  of  spherical  k-­‐means  (cosine  similarity)    with  k=20   •   In  [R],  using  just  slam,  Matrix  and  doParallel   •   Ensembled  together  with  spectral  clustering,  k=12   •   With  ensemble  visualiza<on   •   Joined/aligned  with  metadata  on  year,  ar<st  
  • 21. 1/4/16   21   Cluster % over time Ensemble layout (colored by clusters) "he" "his" "him" "the" "was" "hey" "gonna" "wanna" "you" "i" "we" "ich" "our" "und" "are" “i" "you" "am" "the" "not" "che" "e" "di" "non" "il" "n****" "f***" "s***" "i" "ya" "na" "o" "eu" "e" "não" "je" "de" "et" "les" "le" "love" "babi" "you" "i" "me" "she" "her" "i" "the" "girl" "of" "the" "death" "blood" "their" "que" "y" "de" "la" "el" Applica<on:  clustering  237k  song  lyrics  
  • 22. Thank  you!   Shrayes  Ramesh     shrayes.ramesh@gmail.com   www.github.com/shrayesramesh