1. Clusterix
A visual analytics approach to clustering
Eamonn Maguire 1
, Ilias Koutsakis 1
, Gilles Louppe 2
1
CERN, Geneva Switzerland
2
NYU, New York, USA
Visual Data Science Workshop, Baltimore, 2016
2. Overview
Here we present some work in progress…
I will present :
1. an initial background to clustering;
2. clustering problems;
3. how we’ve tried to solve these problems with a VA
approach
4. some case studies; and
5. ideas for what we can do in the future.
We appreciate suggestions, critiques, further use cases, and
comments.
Clusterix
3. Clusterix
Clustering
Why is it still a problem?
Clustering is a form of unsupervised learning, whereby no classification labels
are available in data to be classified, e.g. k-means.
Therefore, a clustering algorithm must try to learn the groupings through minimising some
distance between elements.
Data Features Projection Distance to Centroids Resulting Clusters Final Result
4. Clusterix
Clustering
Why is it still a problem?
Clustering is a form of unsupervised learning, whereby no classification labels
are available in data to be classified, e.g. hierarchical clustering.
Data Features Linkage functions Cut location Final ResultProjection
Average
Linkage
Complete
Linkage
Single
Linkage
5. Clusterix
Clustering
Why is it still a problem?
There are a number of parameters that are not always known beforehand…
and not always clear afterwards.
1. Which features should be used?
2. How do other distance functions change the clustering result?
3. How many actual clusters are there in the data?
4. How can we evaluate our results to ensure the clustering is ‘good’?
e.g. value of K, where to cut the tree in hierarchical clustering
e.g. how does euclidean vs manhattan distance
& how do the inclusion/exclusion of features change results
and refine/re-evaluate
6. For text data, the picture is even more muddled…
Features are not so clear, and how we pre-process text data can drastically
affect the clustering output.
Clustering
Why is it still a problem?
Stemming
Visual
ise
ize
ization
…
Removing Stop words
Removing stop words,
e.g. an, and, the…
Feature extraction
e,g. TF-IDF
7. Related Work
What have others been doing?
xCluSim
Focused on cluster stability across different algorithms.
Matchmaker
Compare clustering results and stability
Lex, Alexander, et al. "Comparative analysis of
multidimensional, quantitative data." IEEE Transactions on
Visualization and Computer Graphics 16.6 (2010): 1027-1035.
L'Yi, Sehi, et al. "XCluSim: a visual analytics tool for
interactively comparing multiple clustering results of
bioinformatics data." BMC bioinformatics 16.11 (2015): 1.
8. Related Work
What have others been doing?
Clustrophile
A tool created this year which supports
visual analytics and clustering.
Demiralp, Çagatay. "Clustrophile: A Tool for Visual
Clustering Analysis." (2016).
We also support text clustering, our visual
representations are a little different, but it’s
very much operating in the same lines we
see this work going.
Lots of nice statistical tools for forward
projection etc. added.
Multiple projections etc. added (PCA, MDS,
CMDS, LLE, and t-SNE)
10. Methodology
Clusterix
1. Load
their data
We wished to provide a simple interface to enable users to:
2. Choose
features of their
data to consider
4. Visualize
the results
5. Repeat
steps 2-4
3. Choose distance
measures,
vectorizors,
processing steps
11. Features
Clusterix
File Input
Data Preview
Field selection
Field Scaling
Vectorisers
Count Vectorizer
Tf-Idf Vectorizer
Hashing Vectorizer
Algorithms
K-Means
Hierarchical Clustering
Full text search for nodes
Brushing and zoom for
targeted inspection
Dimension Visualizations
TF-IDF Visualization
Clustering projectionsVisualizations
{
a: 2, b: 4, c:1
}
21. Clusterix
Surely we can automate parameter selection using a minimisation function.
Use parallel coordinates to visualize the data and features, define clusters, and find
the parameters that work for a training and test set.
Automating parameter finding…
23. Finally
Clusterix
Improvements
More projections.
Addition of clustering
algorithms, e.g.
HDBSCAN, XDBSCAN, etc.
Visualization plugin
environment
Define rules to detect content
type, and visualise dimension
with an appropriate tool. e.g.
geographic data — maps.
Scalability
How can we improve
performance on the visualization
side. How can we display 100s of
1000s of points, or even millions
of points?
Maybe we can develop techniques
to subsample data better?
Country Distribution
Word Distributions
UK
University
Italia
Universita
England Fisica
24. “Visualization can surprise you, but doesn't scale well.
Modelling scales well, but can't surprise you.”
Hadley Wickham
Challenges
Many of the challenges in visual data science will lay in how we merge the
complementary powers of statistical techniques and data visualization.
25. Clusterix
Acknowledgements
Ilias Koutsakis Univ. of Amsterdam
Who did much of the programming work as part
of his Bachelors thesis whilst at CERN.
Gilles Louppe NYU & CERN
Who co-supervised Ilias with me.
Research & Computing Services @ CERN
and the INSPIREHEP.net team.
And those in
github.com/Lilykos/clusterix