Clusterix at VDS 2016

Clusterix
A visual analytics approach to clustering
Eamonn Maguire 1
, Ilias Koutsakis 1
, Gilles Louppe 2
1
CERN, Geneva Switzerland
2
NYU, New York, USA
Visual Data Science Workshop, Baltimore, 2016

Overview
Here we present some work in progress…
I will present :
1. an initial background to clustering;
2. clustering problems;
3. how we’ve tried to solve these problems with a VA
approach
4. some case studies; and
5. ideas for what we can do in the future.
We appreciate suggestions, critiques, further use cases, and
comments.
Clusterix

Clusterix
Clustering
Why is it still a problem?
Clustering is a form of unsupervised learning, whereby no classiﬁcation labels
are available in data to be classiﬁed, e.g. k-means.
Therefore, a clustering algorithm must try to learn the groupings through minimising some
distance between elements.
Data Features Projection Distance to Centroids Resulting Clusters Final Result

Clusterix
Clustering
Clustering is a form of unsupervised learning, whereby no classiﬁcation labels
are available in data to be classiﬁed, e.g. hierarchical clustering.
Data Features Linkage functions Cut location Final ResultProjection
Average
Linkage
Complete
Linkage
Single
Linkage

Clusterix
Clustering
There are a number of parameters that are not always known beforehand…
and not always clear afterwards.
1. Which features should be used?
2. How do other distance functions change the clustering result?
3. How many actual clusters are there in the data?
4. How can we evaluate our results to ensure the clustering is ‘good’?
e.g. value of K, where to cut the tree in hierarchical clustering
e.g. how does euclidean vs manhattan distance
& how do the inclusion/exclusion of features change results
and reﬁne/re-evaluate

For text data, the picture is even more muddled…
Features are not so clear, and how we pre-process text data can drastically
affect the clustering output.
Clustering
Stemming
Visual
ise
ize
ization
…
Removing Stop words
Removing stop words,
e.g. an, and, the…
Feature extraction
e,g. TF-IDF

Related Work
What have others been doing?
xCluSim
Focused on cluster stability across different algorithms.
Matchmaker
Compare clustering results and stability
Lex, Alexander, et al. "Comparative analysis of
multidimensional, quantitative data." IEEE Transactions on
Visualization and Computer Graphics 16.6 (2010): 1027-1035.
L'Yi, Sehi, et al. "XCluSim: a visual analytics tool for
interactively comparing multiple clustering results of
bioinformatics data." BMC bioinformatics 16.11 (2015): 1.

Related Work
What have others been doing?
Clustrophile
A tool created this year which supports
visual analytics and clustering.
Demiralp, Çagatay. "Clustrophile: A Tool for Visual
Clustering Analysis." (2016).
We also support text clustering, our visual
representations are a little different, but it’s
very much operating in the same lines we
see this work going.
Lots of nice statistical tools for forward
projection etc. added.
Multiple projections etc. added (PCA, MDS,
CMDS, LLE, and t-SNE)

Clusterix
Our steps to help solve this problem?

Methodology
Clusterix
1. Load
their data
We wished to provide a simple interface to enable users to:
2. Choose
features of their
data to consider
4. Visualize
the results
5. Repeat
steps 2-4
3. Choose distance
measures,
vectorizors,
processing steps

Features
Clusterix
File Input
Data Preview
Field selection
Field Scaling
Vectorisers
Count Vectorizer
Tf-Idf Vectorizer
Hashing Vectorizer
Algorithms
K-Means
Hierarchical Clustering
Full text search for nodes
Brushing and zoom for
targeted inspection
Dimension Visualizations
TF-IDF Visualization
Clustering projectionsVisualizations
{
a: 2, b: 4, c:1
}

Algorithm
Deﬁnition
Feature
Distribution
Scatter plot
of SVD
projection
History of
projections
File Input

Examples
Clusterix
Wine Quality
Data Set
Titanic Survivors
To demonstrate the utility of Clusterix on a variety of data,
we will look at the following data sets.
High energy
physics

Examples
Clusterix
Wine Quality
From UCI Dataset archive at https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Clusterix
Examples
Wine Quality
From UCI Dataset archive at https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Clusterix
Titanic Survivors
Examples
From Kaggle
challenge at
kaggle.com/c/titanic

HEP data
Examples
From Kaggle LHC ﬂavours of physics challenge at kaggle.com/c/ﬂavours-of-physicsClusterix

HEP data
Examples
Clusterix From Kaggle LHC ﬂavours of physics challenge at kaggle.com/c/ﬂavours-of-physics

Better hierarchical clustering support
Cut Level 0.3 4 Clusters
View as treemapDendrogram Search
Clusterix

Clusterix
Surely we can automate parameter selection using a minimisation function.
Use parallel coordinates to visualize the data and features, define clusters, and find
the parameters that work for a training and test set.
Automating parameter finding…

Clusterix
Expected clustering
Purity score
e.g. NMI, RI, F5
Actual clusterings
…
Select clustering
parameters that yield
the best score
Automating parameter ﬁnding…

Finally
Clusterix
Improvements
More projections.
Addition of clustering
algorithms, e.g.
HDBSCAN, XDBSCAN, etc.
Visualization plugin
environment
Deﬁne rules to detect content
type, and visualise dimension
with an appropriate tool. e.g.
geographic data — maps.
Scalability
How can we improve
performance on the visualization
side. How can we display 100s of
1000s of points, or even millions
of points?
Maybe we can develop techniques
to subsample data better?
Country Distribution
Word Distributions
UK
University
Italia
Universita
England Fisica

“Visualization can surprise you, but doesn't scale well.
Modelling scales well, but can't surprise you.”
Hadley Wickham
Challenges
Many of the challenges in visual data science will lay in how we merge the
complementary powers of statistical techniques and data visualization.

Clusterix
Acknowledgements
Ilias Koutsakis Univ. of Amsterdam
Who did much of the programming work as part
of his Bachelors thesis whilst at CERN.
Gilles Louppe NYU & CERN
Who co-supervised Ilias with me.
Research & Computing Services @ CERN
and the INSPIREHEP.net team.
And those in
github.com/Lilykos/clusterix

github.com/Lilykos/clusterix
Thanks for listening!
Questions? Suggestions?
Clusterix

Clusterix at VDS 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Clusterix at VDS 2016

Similar to Clusterix at VDS 2016 (20)

Recently uploaded

Recently uploaded (20)

Clusterix at VDS 2016