SlideShare a Scribd company logo
1 of 26
Download to read offline
Clusterix
A visual analytics approach to clustering
Eamonn Maguire 1
, Ilias Koutsakis 1
, Gilles Louppe 2
1
CERN, Geneva Switzerland
2
NYU, New York, USA
Visual Data Science Workshop, Baltimore, 2016
Overview
Here we present some work in progress…
I will present :
1. an initial background to clustering;
2. clustering problems;
3. how we’ve tried to solve these problems with a VA
approach
4. some case studies; and
5. ideas for what we can do in the future.
We appreciate suggestions, critiques, further use cases, and
comments.
Clusterix
Clusterix
Clustering
Why is it still a problem?
Clustering is a form of unsupervised learning, whereby no classification labels
are available in data to be classified, e.g. k-means.
Therefore, a clustering algorithm must try to learn the groupings through minimising some
distance between elements.
Data Features Projection Distance to Centroids Resulting Clusters Final Result
Clusterix
Clustering
Why is it still a problem?
Clustering is a form of unsupervised learning, whereby no classification labels
are available in data to be classified, e.g. hierarchical clustering.
Data Features Linkage functions Cut location Final ResultProjection
Average
Linkage
Complete
Linkage
Single
Linkage
Clusterix
Clustering
Why is it still a problem?
There are a number of parameters that are not always known beforehand…
and not always clear afterwards.
1. Which features should be used?
2. How do other distance functions change the clustering result?
3. How many actual clusters are there in the data?
4. How can we evaluate our results to ensure the clustering is ‘good’?
e.g. value of K, where to cut the tree in hierarchical clustering
e.g. how does euclidean vs manhattan distance
& how do the inclusion/exclusion of features change results
and refine/re-evaluate
For text data, the picture is even more muddled…
Features are not so clear, and how we pre-process text data can drastically
affect the clustering output.
Clustering
Why is it still a problem?
Stemming
Visual
ise
ize
ization
…
Removing Stop words
Removing stop words,
e.g. an, and, the…
Feature extraction
e,g. TF-IDF
Related Work
What have others been doing?
xCluSim
Focused on cluster stability across different algorithms.
Matchmaker
Compare clustering results and stability
Lex, Alexander, et al. "Comparative analysis of
multidimensional, quantitative data." IEEE Transactions on
Visualization and Computer Graphics 16.6 (2010): 1027-1035.
L'Yi, Sehi, et al. "XCluSim: a visual analytics tool for
interactively comparing multiple clustering results of
bioinformatics data." BMC bioinformatics 16.11 (2015): 1.
Related Work
What have others been doing?
Clustrophile
A tool created this year which supports
visual analytics and clustering.
Demiralp, Çagatay. "Clustrophile: A Tool for Visual
Clustering Analysis." (2016).
We also support text clustering, our visual
representations are a little different, but it’s
very much operating in the same lines we
see this work going.
Lots of nice statistical tools for forward
projection etc. added.
Multiple projections etc. added (PCA, MDS,
CMDS, LLE, and t-SNE)
Clusterix
Our steps to help solve this problem?
Methodology
Clusterix
1. Load
their data
We wished to provide a simple interface to enable users to:
2. Choose
features of their
data to consider
4. Visualize
the results
5. Repeat
steps 2-4
3. Choose distance
measures,
vectorizors,
processing steps
Features
Clusterix
File Input
Data Preview
Field selection
Field Scaling
Vectorisers
Count Vectorizer
Tf-Idf Vectorizer
Hashing Vectorizer
Algorithms
K-Means
Hierarchical Clustering
Full text search for nodes
Brushing and zoom for
targeted inspection
Dimension Visualizations
TF-IDF Visualization
Clustering projectionsVisualizations
{
a: 2, b: 4, c:1
}
Algorithm
Definition
Feature
Distribution
Scatter plot
of SVD
projection
History of
projections
File Input
Examples
Clusterix
Wine Quality
Data Set
Titanic Survivors
To demonstrate the utility of Clusterix on a variety of data,
we will look at the following data sets.
High energy
physics
Examples
Clusterix
Wine Quality
From UCI Dataset archive at https://archive.ics.uci.edu/ml/datasets/Wine+Quality
Clusterix
Examples
Wine Quality
From UCI Dataset archive at https://archive.ics.uci.edu/ml/datasets/Wine+Quality
Clusterix
Titanic Survivors
Examples
From Kaggle
challenge at
kaggle.com/c/titanic
HEP data
Examples
From Kaggle LHC flavours of physics challenge at kaggle.com/c/flavours-of-physicsClusterix
HEP data
Examples
Clusterix From Kaggle LHC flavours of physics challenge at kaggle.com/c/flavours-of-physics
Clusterix
Upcoming Work…
Better hierarchical clustering support
Cut Level 0.3 4 Clusters
View as treemapDendrogram Search
Clusterix
Clusterix
Surely we can automate parameter selection using a minimisation function.
Use parallel coordinates to visualize the data and features, define clusters, and find
the parameters that work for a training and test set.
Automating parameter finding…
Clusterix
Expected clustering
Purity score
e.g. NMI, RI, F5
Actual clusterings
…
Select clustering
parameters that yield
the best score
Automating parameter finding…
Finally
Clusterix
Improvements
More projections.
Addition of clustering
algorithms, e.g.
HDBSCAN, XDBSCAN, etc.
Visualization plugin
environment
Define rules to detect content
type, and visualise dimension
with an appropriate tool. e.g.
geographic data — maps.
Scalability
How can we improve
performance on the visualization
side. How can we display 100s of
1000s of points, or even millions
of points?
Maybe we can develop techniques
to subsample data better?
Country Distribution
Word Distributions
UK
University
Italia
Universita
England Fisica
“Visualization can surprise you, but doesn't scale well.
Modelling scales well, but can't surprise you.”
Hadley Wickham
Challenges
Many of the challenges in visual data science will lay in how we merge the
complementary powers of statistical techniques and data visualization.
Clusterix
Acknowledgements
Ilias Koutsakis Univ. of Amsterdam
Who did much of the programming work as part
of his Bachelors thesis whilst at CERN.
Gilles Louppe NYU & CERN
Who co-supervised Ilias with me.
Research & Computing Services @ CERN
and the INSPIREHEP.net team.
And those in
github.com/Lilykos/clusterix
github.com/Lilykos/clusterix
Thanks for listening!
Questions? Suggestions?
Clusterix

More Related Content

What's hot

Bhadale group of companies ai neural networks and algorithms catalogue
Bhadale group of companies ai neural networks and algorithms catalogueBhadale group of companies ai neural networks and algorithms catalogue
Bhadale group of companies ai neural networks and algorithms catalogueVijayananda Mohire
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingAkin Osman Kazakci
 
MultiObjective(11) - Copy
MultiObjective(11) - CopyMultiObjective(11) - Copy
MultiObjective(11) - CopyAMIT KUMAR
 
How data science works and how can customers help
How data science works and how can customers helpHow data science works and how can customers help
How data science works and how can customers helpDanko Nikolic
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET Journal
 
Computer Vision, Computation, and Geometry
Computer Vision, Computation, and GeometryComputer Vision, Computation, and Geometry
Computer Vision, Computation, and GeometryJason Miller
 
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...Databricks
 
Object Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNNObject Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNNMinhazul Arefin
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningHaptik
 
Networks, Deep Learning (and COVID-19)
Networks, Deep Learning (and COVID-19)Networks, Deep Learning (and COVID-19)
Networks, Deep Learning (and COVID-19)tm1966
 
Interpretable AI: Not Just For Regulators
Interpretable AI: Not Just For RegulatorsInterpretable AI: Not Just For Regulators
Interpretable AI: Not Just For RegulatorsDatabricks
 
An Iterative Improved k-means Clustering
An Iterative Improved k-means ClusteringAn Iterative Improved k-means Clustering
An Iterative Improved k-means ClusteringIDES Editor
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre..."An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...Edge AI and Vision Alliance
 
184816386 x mining
184816386 x mining184816386 x mining
184816386 x mining496573
 
Machine Learning Real Life Applications By Examples
Machine Learning Real Life Applications By ExamplesMachine Learning Real Life Applications By Examples
Machine Learning Real Life Applications By ExamplesMario Cartia
 
Comparison on PCA ICA and LDA in Face Recognition
Comparison on PCA ICA and LDA in Face RecognitionComparison on PCA ICA and LDA in Face Recognition
Comparison on PCA ICA and LDA in Face Recognitionijdmtaiir
 
Been Kim - Interpretable machine learning, Nov 2015
Been Kim - Interpretable machine learning, Nov 2015Been Kim - Interpretable machine learning, Nov 2015
Been Kim - Interpretable machine learning, Nov 2015Seattle DAML meetup
 

What's hot (20)

Bhadale group of companies ai neural networks and algorithms catalogue
Bhadale group of companies ai neural networks and algorithms catalogueBhadale group of companies ai neural networks and algorithms catalogue
Bhadale group of companies ai neural networks and algorithms catalogue
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototyping
 
MultiObjective(11) - Copy
MultiObjective(11) - CopyMultiObjective(11) - Copy
MultiObjective(11) - Copy
 
How data science works and how can customers help
How data science works and how can customers helpHow data science works and how can customers help
How data science works and how can customers help
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and Python
 
Computer Vision, Computation, and Geometry
Computer Vision, Computation, and GeometryComputer Vision, Computation, and Geometry
Computer Vision, Computation, and Geometry
 
final seminar
final seminarfinal seminar
final seminar
 
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
 
Object Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNNObject Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNN
 
G44093135
G44093135G44093135
G44093135
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
 
Networks, Deep Learning (and COVID-19)
Networks, Deep Learning (and COVID-19)Networks, Deep Learning (and COVID-19)
Networks, Deep Learning (and COVID-19)
 
Interpretable AI: Not Just For Regulators
Interpretable AI: Not Just For RegulatorsInterpretable AI: Not Just For Regulators
Interpretable AI: Not Just For Regulators
 
An Iterative Improved k-means Clustering
An Iterative Improved k-means ClusteringAn Iterative Improved k-means Clustering
An Iterative Improved k-means Clustering
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre..."An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
 
184816386 x mining
184816386 x mining184816386 x mining
184816386 x mining
 
Machine Learning Real Life Applications By Examples
Machine Learning Real Life Applications By ExamplesMachine Learning Real Life Applications By Examples
Machine Learning Real Life Applications By Examples
 
Ijetcas14 329
Ijetcas14 329Ijetcas14 329
Ijetcas14 329
 
Comparison on PCA ICA and LDA in Face Recognition
Comparison on PCA ICA and LDA in Face RecognitionComparison on PCA ICA and LDA in Face Recognition
Comparison on PCA ICA and LDA in Face Recognition
 
Been Kim - Interpretable machine learning, Nov 2015
Been Kim - Interpretable machine learning, Nov 2015Been Kim - Interpretable machine learning, Nov 2015
Been Kim - Interpretable machine learning, Nov 2015
 

Viewers also liked

Visual Compression of Workflow Visualizations with Automated Detection of Mac...
Visual Compression of Workflow Visualizations with Automated Detection of Mac...Visual Compression of Workflow Visualizations with Automated Detection of Mac...
Visual Compression of Workflow Visualizations with Automated Detection of Mac...Eamonn Maguire
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkEamonn Maguire
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
 
Clustrix Database Percona Ruby on Rails benchmark
Clustrix Database Percona Ruby on Rails benchmarkClustrix Database Percona Ruby on Rails benchmark
Clustrix Database Percona Ruby on Rails benchmarkClustrix
 
Why Traditional Databases Fail so Miserably to Scale with E-Commerce Site Growth
Why Traditional Databases Fail so Miserably to Scale with E-Commerce Site GrowthWhy Traditional Databases Fail so Miserably to Scale with E-Commerce Site Growth
Why Traditional Databases Fail so Miserably to Scale with E-Commerce Site GrowthClustrix
 
E-Commerce Success is a Balancing Act. Ensure Success with ClustrixDB.
E-Commerce Success is a Balancing Act. Ensure Success with ClustrixDB.E-Commerce Success is a Balancing Act. Ensure Success with ClustrixDB.
E-Commerce Success is a Balancing Act. Ensure Success with ClustrixDB.Clustrix
 
ClustrixDB 7.5 Announcement
ClustrixDB 7.5 AnnouncementClustrixDB 7.5 Announcement
ClustrixDB 7.5 AnnouncementClustrix
 
Moving an E-commerce Site to AWS. A Case Study
Moving an  E-commerce Site to AWS. A Case StudyMoving an  E-commerce Site to AWS. A Case Study
Moving an E-commerce Site to AWS. A Case StudyClustrix
 
Clustrix Database Overview
Clustrix Database OverviewClustrix Database Overview
Clustrix Database OverviewClustrix
 
Achieve new levels of performance for Magento e-commerce sites.
Achieve new levels of performance for Magento e-commerce sites.Achieve new levels of performance for Magento e-commerce sites.
Achieve new levels of performance for Magento e-commerce sites.Clustrix
 
Scaling Techniques to Increase Magento Capacity
Scaling Techniques to Increase Magento CapacityScaling Techniques to Increase Magento Capacity
Scaling Techniques to Increase Magento CapacityClustrix
 
Db performance optimization with indexing
Db performance optimization with indexingDb performance optimization with indexing
Db performance optimization with indexingRajeev Kumar
 
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Clustrix
 
Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data VisualizationEamonn Maguire
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
Web valley talk - usability, visualization and mobile app development
Web valley talk - usability, visualization and mobile app developmentWeb valley talk - usability, visualization and mobile app development
Web valley talk - usability, visualization and mobile app developmentEamonn Maguire
 
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Hakka Labs
 
Visualization of Publication Impact
Visualization of Publication ImpactVisualization of Publication Impact
Visualization of Publication ImpactEamonn Maguire
 

Viewers also liked (19)

HEPData
HEPDataHEPData
HEPData
 
Visual Compression of Workflow Visualizations with Automated Detection of Mac...
Visual Compression of Workflow Visualizations with Automated Detection of Mac...Visual Compression of Workflow Visualizations with Automated Detection of Mac...
Visual Compression of Workflow Visualizations with Automated Detection of Mac...
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 Talk
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
 
Clustrix Database Percona Ruby on Rails benchmark
Clustrix Database Percona Ruby on Rails benchmarkClustrix Database Percona Ruby on Rails benchmark
Clustrix Database Percona Ruby on Rails benchmark
 
Why Traditional Databases Fail so Miserably to Scale with E-Commerce Site Growth
Why Traditional Databases Fail so Miserably to Scale with E-Commerce Site GrowthWhy Traditional Databases Fail so Miserably to Scale with E-Commerce Site Growth
Why Traditional Databases Fail so Miserably to Scale with E-Commerce Site Growth
 
E-Commerce Success is a Balancing Act. Ensure Success with ClustrixDB.
E-Commerce Success is a Balancing Act. Ensure Success with ClustrixDB.E-Commerce Success is a Balancing Act. Ensure Success with ClustrixDB.
E-Commerce Success is a Balancing Act. Ensure Success with ClustrixDB.
 
ClustrixDB 7.5 Announcement
ClustrixDB 7.5 AnnouncementClustrixDB 7.5 Announcement
ClustrixDB 7.5 Announcement
 
Moving an E-commerce Site to AWS. A Case Study
Moving an  E-commerce Site to AWS. A Case StudyMoving an  E-commerce Site to AWS. A Case Study
Moving an E-commerce Site to AWS. A Case Study
 
Clustrix Database Overview
Clustrix Database OverviewClustrix Database Overview
Clustrix Database Overview
 
Achieve new levels of performance for Magento e-commerce sites.
Achieve new levels of performance for Magento e-commerce sites.Achieve new levels of performance for Magento e-commerce sites.
Achieve new levels of performance for Magento e-commerce sites.
 
Scaling Techniques to Increase Magento Capacity
Scaling Techniques to Increase Magento CapacityScaling Techniques to Increase Magento Capacity
Scaling Techniques to Increase Magento Capacity
 
Db performance optimization with indexing
Db performance optimization with indexingDb performance optimization with indexing
Db performance optimization with indexing
 
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
 
Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data Visualization
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Web valley talk - usability, visualization and mobile app development
Web valley talk - usability, visualization and mobile app developmentWeb valley talk - usability, visualization and mobile app development
Web valley talk - usability, visualization and mobile app development
 
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
 
Visualization of Publication Impact
Visualization of Publication ImpactVisualization of Publication Impact
Visualization of Publication Impact
 

Similar to Clusterix at VDS 2016

Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...butest
 
ShawnQuinnCSS581FinalProjectReport
ShawnQuinnCSS581FinalProjectReportShawnQuinnCSS581FinalProjectReport
ShawnQuinnCSS581FinalProjectReportShawn Quinn
 
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...ijcsit
 
Using the Machine to predict Testability
Using the Machine to predict TestabilityUsing the Machine to predict Testability
Using the Machine to predict TestabilityMiguel Lopez
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 
Performance characterization in computer vision
Performance characterization in computer visionPerformance characterization in computer vision
Performance characterization in computer visionpotaters
 
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
 
03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajooMeetika Gupta
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIJSRD
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEEFINALYEARSTUDENTPROJECTS
 
EssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfEssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfAnkita Tiwari
 
data mining with weka application
data mining with weka applicationdata mining with weka application
data mining with weka applicationRezapourabbas
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
 
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...christopher corlett
 
Recent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewRecent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewIOSRjournaljce
 

Similar to Clusterix at VDS 2016 (20)

Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...
 
ShawnQuinnCSS581FinalProjectReport
ShawnQuinnCSS581FinalProjectReportShawnQuinnCSS581FinalProjectReport
ShawnQuinnCSS581FinalProjectReport
 
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
 
Using the Machine to predict Testability
Using the Machine to predict TestabilityUsing the Machine to predict Testability
Using the Machine to predict Testability
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Performance characterization in computer vision
Performance characterization in computer visionPerformance characterization in computer vision
Performance characterization in computer vision
 
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
 
03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
 
EssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfEssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdf
 
data mining with weka application
data mining with weka applicationdata mining with weka application
data mining with weka application
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
 
Machine_Learning_Co__
Machine_Learning_Co__Machine_Learning_Co__
Machine_Learning_Co__
 
Recent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewRecent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A Review
 
SpectralClassificationOfStars
SpectralClassificationOfStarsSpectralClassificationOfStars
SpectralClassificationOfStars
 

Recently uploaded

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 

Recently uploaded (20)

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 

Clusterix at VDS 2016

  • 1. Clusterix A visual analytics approach to clustering Eamonn Maguire 1 , Ilias Koutsakis 1 , Gilles Louppe 2 1 CERN, Geneva Switzerland 2 NYU, New York, USA Visual Data Science Workshop, Baltimore, 2016
  • 2. Overview Here we present some work in progress… I will present : 1. an initial background to clustering; 2. clustering problems; 3. how we’ve tried to solve these problems with a VA approach 4. some case studies; and 5. ideas for what we can do in the future. We appreciate suggestions, critiques, further use cases, and comments. Clusterix
  • 3. Clusterix Clustering Why is it still a problem? Clustering is a form of unsupervised learning, whereby no classification labels are available in data to be classified, e.g. k-means. Therefore, a clustering algorithm must try to learn the groupings through minimising some distance between elements. Data Features Projection Distance to Centroids Resulting Clusters Final Result
  • 4. Clusterix Clustering Why is it still a problem? Clustering is a form of unsupervised learning, whereby no classification labels are available in data to be classified, e.g. hierarchical clustering. Data Features Linkage functions Cut location Final ResultProjection Average Linkage Complete Linkage Single Linkage
  • 5. Clusterix Clustering Why is it still a problem? There are a number of parameters that are not always known beforehand… and not always clear afterwards. 1. Which features should be used? 2. How do other distance functions change the clustering result? 3. How many actual clusters are there in the data? 4. How can we evaluate our results to ensure the clustering is ‘good’? e.g. value of K, where to cut the tree in hierarchical clustering e.g. how does euclidean vs manhattan distance & how do the inclusion/exclusion of features change results and refine/re-evaluate
  • 6. For text data, the picture is even more muddled… Features are not so clear, and how we pre-process text data can drastically affect the clustering output. Clustering Why is it still a problem? Stemming Visual ise ize ization … Removing Stop words Removing stop words, e.g. an, and, the… Feature extraction e,g. TF-IDF
  • 7. Related Work What have others been doing? xCluSim Focused on cluster stability across different algorithms. Matchmaker Compare clustering results and stability Lex, Alexander, et al. "Comparative analysis of multidimensional, quantitative data." IEEE Transactions on Visualization and Computer Graphics 16.6 (2010): 1027-1035. L'Yi, Sehi, et al. "XCluSim: a visual analytics tool for interactively comparing multiple clustering results of bioinformatics data." BMC bioinformatics 16.11 (2015): 1.
  • 8. Related Work What have others been doing? Clustrophile A tool created this year which supports visual analytics and clustering. Demiralp, Çagatay. "Clustrophile: A Tool for Visual Clustering Analysis." (2016). We also support text clustering, our visual representations are a little different, but it’s very much operating in the same lines we see this work going. Lots of nice statistical tools for forward projection etc. added. Multiple projections etc. added (PCA, MDS, CMDS, LLE, and t-SNE)
  • 9. Clusterix Our steps to help solve this problem?
  • 10. Methodology Clusterix 1. Load their data We wished to provide a simple interface to enable users to: 2. Choose features of their data to consider 4. Visualize the results 5. Repeat steps 2-4 3. Choose distance measures, vectorizors, processing steps
  • 11. Features Clusterix File Input Data Preview Field selection Field Scaling Vectorisers Count Vectorizer Tf-Idf Vectorizer Hashing Vectorizer Algorithms K-Means Hierarchical Clustering Full text search for nodes Brushing and zoom for targeted inspection Dimension Visualizations TF-IDF Visualization Clustering projectionsVisualizations { a: 2, b: 4, c:1 }
  • 13. Examples Clusterix Wine Quality Data Set Titanic Survivors To demonstrate the utility of Clusterix on a variety of data, we will look at the following data sets. High energy physics
  • 14. Examples Clusterix Wine Quality From UCI Dataset archive at https://archive.ics.uci.edu/ml/datasets/Wine+Quality
  • 15. Clusterix Examples Wine Quality From UCI Dataset archive at https://archive.ics.uci.edu/ml/datasets/Wine+Quality
  • 17. HEP data Examples From Kaggle LHC flavours of physics challenge at kaggle.com/c/flavours-of-physicsClusterix
  • 18. HEP data Examples Clusterix From Kaggle LHC flavours of physics challenge at kaggle.com/c/flavours-of-physics
  • 20. Better hierarchical clustering support Cut Level 0.3 4 Clusters View as treemapDendrogram Search Clusterix
  • 21. Clusterix Surely we can automate parameter selection using a minimisation function. Use parallel coordinates to visualize the data and features, define clusters, and find the parameters that work for a training and test set. Automating parameter finding…
  • 22. Clusterix Expected clustering Purity score e.g. NMI, RI, F5 Actual clusterings … Select clustering parameters that yield the best score Automating parameter finding…
  • 23. Finally Clusterix Improvements More projections. Addition of clustering algorithms, e.g. HDBSCAN, XDBSCAN, etc. Visualization plugin environment Define rules to detect content type, and visualise dimension with an appropriate tool. e.g. geographic data — maps. Scalability How can we improve performance on the visualization side. How can we display 100s of 1000s of points, or even millions of points? Maybe we can develop techniques to subsample data better? Country Distribution Word Distributions UK University Italia Universita England Fisica
  • 24. “Visualization can surprise you, but doesn't scale well. Modelling scales well, but can't surprise you.” Hadley Wickham Challenges Many of the challenges in visual data science will lay in how we merge the complementary powers of statistical techniques and data visualization.
  • 25. Clusterix Acknowledgements Ilias Koutsakis Univ. of Amsterdam Who did much of the programming work as part of his Bachelors thesis whilst at CERN. Gilles Louppe NYU & CERN Who co-supervised Ilias with me. Research & Computing Services @ CERN and the INSPIREHEP.net team. And those in github.com/Lilykos/clusterix