SlideShare une entreprise Scribd logo
1  sur  17
Network Flow Pattern Extraction by Clustering
Eugine Kang, Seojoon Jin
Korea University
Abstract
Network is a system which allows users to connect with the
internet. Although the internet seems to be ever increasing and
boundless, the infrastructure to provide the internet has physical
limitations. Network flow analysis tools are developed for efficient
network management. This paper discusses a statistical learning method
for efficient network management. Network flow data is first transformed
by time block and then followed by calculating Within Sum of Squares
(WSS) and Calinski-Harabasz index (CH) for optimal number of clusters.
The clustering algorithms used for this paper is k- means clustering.
Principal Component Analysis (PCA) allows dimension reduction of the
data set to visually confer the clusters by relative position of
observations. The stability of the clusters is measured by the mean of
Jaccard coefficient from 10,000 bootstrap resampled data sets.
Keywords: Calinski-Harabasz index, principal component analysis, Jaccard
coefficient, bootstrap, k-means clustering
Introduction
Data is collected from all fields and the amount of data collected
is exponentially growing. Richer datasets allow researchers and entities
to extract meaningful knowledge with the use of machine learning
algorithms. In many business and engineering applications, raw data
collected from systems needs to be processed into various fo rms of
structured and semantically rich records before data analysis tasks can
produce accurate, useful, and understandable results. However, the
difficulty comes when the raw data is difficult to understand and
manipulate. This paper walks through the network flow data collected at
Korea University Sejong Campus, and transform the data to extract
meaningful knowledge. This project first started under the goal to
increase network flow efficiency within the campus. The flow of data is
labeled with the type of application and trying to minimize or detect the
use of non-academic purpose flow was essential. Torrent is a p2p file
sharing service commonly used among users for downloading audio and
video files. The campus labels torrent as a non-academic purpose flow
and deeper analysis of the torrent flow within the campus is the main
purpose of this paper.
Recent works of network flow analysis has been conducted on a
time basis but fail to produce any meaningful pattern from the users.
Reference to previous studies and similar studies with explanation will be
added to this section on a later date.
The rest of the paper is organized as follows. We first give an
overview of the data collected from the network. We then discuss the
data transformation steps, which include time base patterns, and parsing
each such pattern to construct statistical features. We report the
experiments of using clustering techniques on the network flow data. We
then describe how to utilize the knowledge extracted from experiments.
Collected Data from Campus Network
Figure 1: Diagram of network flow data collection
Korea University Sejong Campus data center collects each IP’s
network flow information on a minute-to-minute basis. There is about
5,000 IPs designated for desktop connections and a separate data center
for collecting mobile devices connection through WI-FI. The type of
information collected can be seen on table 1. Service and protocol level
features tells us the type of service being used by the users on the local
IP address. The type of service varies from web browsing, video
streaming, e- mail, and torrent, which is the type of service we are
interested in for this paper. Other features in table 1 collect information
on the connection to the local IP. The network flow information is stored
in a table like structure, e.g. relational database. However, the size of the
entire network flow exceeds 3TB and required a different type of
database to store and extract data. Hadoop Distributed File System
(HDFS) was used to store the enormous amount of data and applications
were used accordingly to extract data.
Table 1: Network connection features
Feature
Name
Feature Meaning
Feature
Name
Feature
Meaning
flow_I flow index o_octet
outbound byte
count
con_I
continue index(flow's parent
index)
i_pkt
inbound pkt
count
sip local ip i_octet
inbound byte
count
sport local port flag
flag information
flag
prot transmission protocol sr_lvl1 service level 1
dip remote ip sr_lvl2 service level 2
dport remote port sr_lvl3 service level 3
time time of flow start pt_lvl1 protocol level 1
o_pkt outbound pkt count pt_lvl2 protocol level 2
Data Transformation
The bandwidth limitation of the campus network has a limit at a given
time. We are more interested in torrent usage according to time blocks.
Detecting torrent usage patterns by time will allow the network administrator to
prepare for heavy loads of network use at peak times. The data transformation
step parses data into weekdays, hours, and by flow, incoming, outgoing data
amount.
Figure 2: Heatmap visualization of data transformation
Figure 2 shows one local IP’s torrent flow usage over time. The
horizontal axis represents weekday and vertical axis represents the hour of day.
The red region indicates relative heavy usage compared to the green regions.
Through this data transformation process we are able to detect patterns from
local IP torrent usage by time blocks. This information will be helpful for the
network administrator to prevent heavy bandwidth loading caused by torrent
usage.
Experiment Technique
k -means clustering
k -means clustering is a method of vector quantization, that is popular
for cluster analysis in data mining. The goal of this algorithm is to partition n
observations into k clusters in which each observation belongs to the cluster
with the nearest mean. In this paper we set each observation as a local IP, and
observe how they partition.
The algorithm uses an iterative refinement technique. Given an
randomly initialized set of k means
(1) (1)
1 , , km m , the algorithm proceeds by
iterating two steps, assignment step and update step. The assignment step
assigns each observation to the cluster with the nearest centroid. Equation 1
assign each observation px to exactly one
( )t
S . Once each observations are
assigned to a cluster, the update step calculates the centroids of the cluster with
newly updated observations. Equation 2 sums up the features of each
observation and generate a new centroid.
 2 2( ) ( ) ( )
: ,1t t t
i p p i p jS x x m x m j j k       (1)
( )
( 1)
( )
1
t
j i
t
i jt
x Si
m x
S


  (2)
Given the number of clusters to be partitioned, the objective of the
algorithm is to find equation 3. The cluster orientation which minimizes the
within sum of squares (WSS). WSS is also used to determine the appropriate
number of clusters, but inappropriate due to nature of decreasing WSS value as
number of clusters increases.
2
1
arg min
i
k
S i
i x S
x 
 
 (3)
Calinski-Harabasz Index
k -means clustering is a fast and efficient clustering technique. However,
one drawback is the pre-determination of k before the algorithms proceed. To
optimize the clustering results, WSS is a common scoring method to determine
k . As mentioned above, the nature of decreasing WSS value as k increases
makes it difficult for optimization. Calinski-Harabasz index is best suited in
finding the optimal number of cluster for k-means clustering solutions.
The Calinski-Harabasz index is also called the variance ratio criterion.
Equation 4 calculates the variance between clusters and equation 5 the variance
between observations within a cluster. Well-defined clusters have a large
between-cluster variance and a small within-cluster variance. The larger the
Calinski-Harabasz index, the better the data partition. To determine the optimal
number of clusters, the best practice is to maximize equation 6 respect to k .
2
1
Overall between-cluster variance
k
B i i
i
SS n m m

   (4)
2
1
Overall within-cluster variance
i
k
W i
i x S
SS x m
 
   (5)
 
 1
B
W
N kSS
CH
SS k

 

(6)
In theory, determining k which maximizes the Calinski-Harabasz index
results in the best partitioned cluster. However, in practice the number of clusters
tend to be lower than the amount of clusters needed for distinct pattern analysis.
Principal Component Analysis
Determining k for clustering data may not always lead to desired
results. WSS and CH scores are a method to determine the optimal number of
clusters for k -means clustering. In practice, a visual determination of optimal
number of clusters helps with the decision process. Principal component analysis
(PCA) is a statistical procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated variables into a set of values
of linearly uncorrelated variables called principal components. The number of
principal components allowed are determined by the number of features given
in the data set. By selecting the first two principal components, the data goes
through data reduction to be represented on a lower dimensional space.
Figure 3: Data reduction on data set by PCA
Figure 3 represents 4,111 local IP’s with 168 features on a two
dimensional space. Due to the orthogonal transformation gone through PCA,
the values on the first and second principal components do not provide
statistically significant meaning. However, the dimension reduction visualization
provides relative nearness from observation to observation. We induce
observations closer to one another share similar feature values and hence
belong in the same partition for cluster analysis. PCA is used as a visual tool to
determine the fitness of our cluster analysis.
Cluster Stability Assessment
Even the most sophisticated clustering algorithms tend to generate
partitions even for fairly homogeneous data sets. Cluster validation is very
important in cluster analysis. An important aspect of cluster validity is stability.
Stability means that a meaningful valid cluster should not disappear if the data
set is altered in a non-typical way. There could be serval conceptions what a
“non-typical alternation” of the data set is. In terms of this paper, data set
extracted from the underlying database should give rise to more or less the
same clustering.
The stability of the clusters formed is tested through the methods
provided by Hennig (2007). Equation 7 calculates the Jaccard coefficient between
two clusters as a measure of similarity. Jaccard coefficient does not depend on
two clusters having identical amount of observations for the measurement to be
valid. This is an advantage for cluster analysis because clusters with var ying
observation counts may form.
 , , ,a b
a b a b
a b
S S
S S S S x
S S


 

(7)
Resampling methods is quite natural to assess cluster stability. Bootstrap
is performed to resample new data sets from the original data set. The desired
clustering algorithm is applied and the similarity values are recorded for every
original cluster to the most similar cluster in the proceeding iteration. The
stability of every single cluster is taken by the mean similarity taken over the
resampled data sets. Equation 8 calculates the mean similarity of clusters over
bootstrapped data sets. *B is the number of bootstrap replications.
,
1
1
*
B
S S i
iB
 

  (8)
Experiment Results
The experiment to extract torrent usage pattern by local IP begins with
transforming the data to represent weekday to time usage. The reason for this
transformation is due to the natural limitation of bandwidth within the campus
network system. Identifying heavy loads of non-academic network usage by time
is necessary for efficient management of the network system.
One drawback of k -means clustering is the pre-determination of k .
This paper first find candidates for suitable k using WSS and CH. Figure shows
the WSS and CH score respect to increasing k . WSS naturally decreases as the
number of clusters increases, but the pattern we are looking for is a significant
drop of WSS from a k value to the next. Just looking at WSS, a significant drop
can be observed when 3,4,7k  . Although the drop from 6 clusters to 7 is not
as steep compared to others we are searching for appropriate candidates at this
point. CH measures the ratio of variance between clusters and variance between
observations within clusters. In literature, the maximum value of CH respect to
k leads to the best partition. 3k  stands out as the best candidate, but
restricting to 3 clusters may lead to lack to distinct torrent usage patterns.
Minimizing the torrent usage to 3 types may not as be helpful to network
administrators and lead to unnecessary restriction on network usage to local IPs.
The next candidate we look for is a bump in CH value compares to the overall
trend. 7k  shows a small bump compared to the trend given by previous and
after CH values. This narrows our k candidates to 3, and 7.
PCA is used as a dimension reduction technique to visualize the relative
Figure 4: WSS vs CH to determine optimal k
positions of observations. When the number of candidates for k can not be
reduced to one, PCA can help settle the tie through visualization. Figure 5 shows
the clusters when 3k  . The relative positioning of observations how each
clusters are closely formed, and cluster 2 show observations spread out the
entire reduced dimensional space. This finding is difficult to conclude distinct
patterns within the cluster. Compared to cluster 2, the other clusters are relatively
closely located to other observations. This leads to a distinct pattern within the
cluster while the other does not.
Figure 5: PCA visualization for 3 clusters
Figure 6 shows a different formation of clusters compared to figure 5.
The 7 clusters formed in figure 6 show relative closeness compared to when
there are 3 clusters. Cluster 3 remains the same while the other clusters in figure
5 are partition by 3. The clusters formed in figure 6 show closely partitioned
observations. Although 3k  may have a higher CH score compared to 7k  ,
the results from PCA visualization tells us a different story. Due to the objective
of this paper, extracting more patterns for the network administrator will lead to
useful knowledge in determining heavy torrent usage by time. The spread of the
observations in figure 6 also tells us relative closeness between observations
within the cluster. When applying k -means clustering, we narrowed possible
candidates for k by comparing WSS and CH scores respect to k . When there
are multiple possible candidates or the purpose of experiment requires more
clusters to be formed, PCA is able to perform dimension reduction to visualize
the cluster formation and compare observation closeness. For this data set and
experiment, the number of clusters to perform further analysis is 7k  .
Figure 6: PCA visualization for 7 clusters
Table 2: Cluster stability
Cluster Stability Observations
1 0.842 76
2 0.957 86
3 0.808 91
4 0.773 273
5 0.652 101
6 0.985 3314
7 0.692 170
Non-typical alteration with the data may lead to different clusters
forming. Testing the stability is very important in cluster analysis. When we are
detecting network patterns with the cluster formation, the stability and validity of
these patterns occurring with a different data set must be tested. We use the
Jaccard coefficient as a measurement of similarity and generate multiple data
sets by the bootstrap resampling method. We generated 7 clusters with the k -
means clustering algorithm and generated a stability value by the mean of
Jaccard coefficients over 10,000 bootstrap resampled data sets. The stability
value varies between 0-1. Values over 0.75 are considered stable and the same
cluster forming with non-typical alteration.
5 out of 7 clusters form stable clusters to derive meaningful partitioning.
The highest stability comes from cluster 6 with a value close to 1. The number of
local IPs belonging in cluster 6 is also the highest containing about 80% of the
total local IPs. Cluster 5 and 7 fails to reach a stability of 0.75 or higher. The
patterns of stable clusters will be visually examined.
Figure 7: Torrent usage pattern for cluster 1
Cluster 1 shows a regular usage of torrent during working hours. The
campus network is used by students, researchers, and professors, but they are
also used by office staffs as well. The torrent usage seems to follow the user ’s
computer usage time, and indicate the start of the week the most concentrated.
This could be explained by users downloading shows over the weekend.
Figure 8: Torrent usage pattern for cluster 2
Cluster 2 takes up most of the torrent usage over all observations. The
86 local IPs belonging to cluster 2 are responsible for loading the network
system with heavy torrent usage. Figure 8 makes it quite obvious to see the
reason why. Unlike cluster 1 with specific time usage, cluster 2 does not know
when to stop torrent usage. Managing and curving down the non-academic
network usage of these 86 local IPs is the key for making more room for
network bandwidth.
Cluster 3 tells us a similar pattern to cluster 1 but with heavier intensity.
The difference between cluster 1 and 3 is the moment of peak usage. Cluster 1
show a gradual decrease in usage as the weekday goes on, but cluster 3 show
peak on Mon, Thurs, and Friday. The maximum flow visible through the legend is
2-3 folds the maximum of cluster 1. The usage pattern for cluster 1 and cluster 3
share common relationship between working hours and torrent usage, but differ
in moment of intensity and maximum amount.
Figure 9: Torrent usage pattern for cluster 3
Cluster 4 and 6 share common patterns which show the lack of torrent
activity. This is odd discovery to be revealed from this experiment. The current
network system is overloaded with the usage of torrent activity, but the clusters
pattern show the contrary when looking at the number of local IPs actively using
the service. Cluster 4 and 6 take more than 80% of local IPs, but the torrent
usage seems to be nearly none. This discovery gives heavier emphasis to the
pattern shown in cluster 2. Small amount of users are responsible for the
majority of network load within this campus. Cluster 4 shows a peak on Saturday
morning, but this is not an important feature to worry as a network
administrator because the overall network usage during that time is low anyway.
The peak is within cluster 4 but compared to other clusters the intensity is
significant lower than others. The pattern from cluster 4 and 6 allows us to
conclude that most of the local IPs do not require special administration to
curve down the usage of non-academic network usage such as torrent.
Figure 10: Torrent usage pattern for cluster 4
Figure 11: Torrent usage pattern for cluster 6
Cluster 5 and 7 fail to achieve stability from the mean of Jaccard
coefficient by bootstrap resample data sets. The patterns detected are hard to
assume stability and hence can change with non-typical alteration to extracted
data set.
Conclusion
The network system within an entity requires management for efficiency.
This paper experimented with a method to detect patterns from a massive data
set collected from network flow. The steps we took was first to transform the
data by time blocks and extract network flow data which are labeled as torrent.
Torrent is a non-academic network service which takes a big burden to the
network system. WSS and CH were calculated within the transformed data set to
narrow down the possible number of clusters, and the clusters were visually
reexamined through PCA. The dimension reduction technique allows us to
observe relative positioning from one observation to another. The stability of
clusters formed by k-means clustering was validated through bootstrap
resampling and taking the mean of the Jaccard coefficient of the most similar
clusters formed over 10,000 iterations.
The patterns detected from each clusters are visualized by heatmap to
easily detect peaks in usage over time. The main discovery of this cluster
analysis reveals a small amount of users are responsible for the majority of
network load to the entire system. 80% or more of the users barely load the
network with non-academic network usage. As a network administrator, the
patterns detected can be used for special administration for users with a history
of heavy usage. Because the users who belong to this field are low, the job of
the network administrator can be lessened. Further analysis of the clusters show
relative peak, and this information can also prepare a network administrator for
the specific time slot where network usage will increase. The proposed method
can also be applied other network services as well.
Reference
Hartigan, John A., and Manchek A. Wong. "Algorithm AS 136: A k-means clustering
algorithm." Applied statistics (1979): 100-108.
Likas, Aristidis, Nikos Vlassis, and Jakob J. Verbeek. "The global k-means
clustering algorithm." Pattern recognition 36.2 (2003): 451-461.
Ding, Chris, and Xiaofeng He. "K-means clustering via principal component
analysis." Proceedings of the twenty-first international conference on Machine
learning. ACM, 2004.
Caliński, Tadeusz, and Jerzy Harabasz. "A dendrite method for cluster
analysis." Communications in Statistics-theory and Methods 3.1 (1974): 1-27.
Wold, Svante, Kim Esbensen, and Paul Geladi. "Principal component
analysis." Chemometrics and intelligent laboratory systems 2.1 (1987): 37-52.
Abdi, Hervé, and Lynne J. Williams. "Principal component analysis." Wiley
Interdisciplinary Reviews: Computational Statistics 2.4 (2010): 433-459.
Becker, R. A., J. M. Chambers, and A. R. Wilks. "The New S Language Pacific
Grove CA: Wadsworth & Brooks/Cole." BeckerThe New S Language1988(1988).
Mardia, Kantilal Varichand, John T. Kent, and John M. Bibby. Multivariate analysis.
Academic press, 1979.
Venables, William N., and Brian D. Ripley. Modern applied statistics with S. Springer
Science & Business Media, 2002.
Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational
Statistics and Data Analysis, 52, 258-271.
Hennig, C. (2008) Dissolution point and isolation robustness: robustness criteria for
general cluster analysis methods. Journal of Multivariate Analysis 99, 1154-1176.

Contenu connexe

Tendances

APM project meeting - June 13, 2012 - LBNL, Berkeley, CA
APM project meeting - June 13, 2012 - LBNL, Berkeley, CAAPM project meeting - June 13, 2012 - LBNL, Berkeley, CA
APM project meeting - June 13, 2012 - LBNL, Berkeley, CA
balmanme
 
9.distributive energy efficient adaptive clustering protocol for wireless sen...
9.distributive energy efficient adaptive clustering protocol for wireless sen...9.distributive energy efficient adaptive clustering protocol for wireless sen...
9.distributive energy efficient adaptive clustering protocol for wireless sen...
Chính Cao
 
Internet Traffic Forecasting using Time Series Methods
Internet Traffic Forecasting using Time Series MethodsInternet Traffic Forecasting using Time Series Methods
Internet Traffic Forecasting using Time Series Methods
Ajay Ohri
 
A location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applicationsA location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applications
IAEME Publication
 

Tendances (17)

An Efficient top- k Query Processing in Distributed Wireless Sensor Networks
An Efficient top- k Query Processing in Distributed Wireless  Sensor NetworksAn Efficient top- k Query Processing in Distributed Wireless  Sensor Networks
An Efficient top- k Query Processing in Distributed Wireless Sensor Networks
 
APM project meeting - June 13, 2012 - LBNL, Berkeley, CA
APM project meeting - June 13, 2012 - LBNL, Berkeley, CAAPM project meeting - June 13, 2012 - LBNL, Berkeley, CA
APM project meeting - June 13, 2012 - LBNL, Berkeley, CA
 
Clustering-based Analysis for Heavy-Hitter Flow Detection
Clustering-based Analysis for Heavy-Hitter Flow DetectionClustering-based Analysis for Heavy-Hitter Flow Detection
Clustering-based Analysis for Heavy-Hitter Flow Detection
 
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
 
Solve Production Allocation and Reconciliation Problems using the same Network
Solve Production Allocation and Reconciliation Problems using the same NetworkSolve Production Allocation and Reconciliation Problems using the same Network
Solve Production Allocation and Reconciliation Problems using the same Network
 
K045036871
K045036871K045036871
K045036871
 
Distributed web systems performance forecasting
Distributed web systems performance forecastingDistributed web systems performance forecasting
Distributed web systems performance forecasting
 
9.distributive energy efficient adaptive clustering protocol for wireless sen...
9.distributive energy efficient adaptive clustering protocol for wireless sen...9.distributive energy efficient adaptive clustering protocol for wireless sen...
9.distributive energy efficient adaptive clustering protocol for wireless sen...
 
Ijcatr04051012
Ijcatr04051012Ijcatr04051012
Ijcatr04051012
 
Internet Traffic Forecasting using Time Series Methods
Internet Traffic Forecasting using Time Series MethodsInternet Traffic Forecasting using Time Series Methods
Internet Traffic Forecasting using Time Series Methods
 
A location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applicationsA location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applications
 
Clustering Algorithms for Data Stream
Clustering Algorithms for Data StreamClustering Algorithms for Data Stream
Clustering Algorithms for Data Stream
 
Journal paper 1
Journal paper 1Journal paper 1
Journal paper 1
 
Query optimization for_sensor_networks
Query optimization for_sensor_networksQuery optimization for_sensor_networks
Query optimization for_sensor_networks
 
Ijsrp p8593
Ijsrp p8593Ijsrp p8593
Ijsrp p8593
 
Enchancing the Data Collection in Tree based Wireless Sensor Networks
Enchancing the Data Collection in Tree based Wireless Sensor NetworksEnchancing the Data Collection in Tree based Wireless Sensor Networks
Enchancing the Data Collection in Tree based Wireless Sensor Networks
 
A Survey on Balancing the Network Load Using Geographic Hash Tables
A Survey on Balancing the Network Load Using Geographic Hash TablesA Survey on Balancing the Network Load Using Geographic Hash Tables
A Survey on Balancing the Network Load Using Geographic Hash Tables
 

En vedette

Panorama audiovisual
Panorama audiovisualPanorama audiovisual
Panorama audiovisual
Laura1012
 
Top gear(1)
Top gear(1)Top gear(1)
Top gear(1)
3S 2010
 
Donación, mecenazgo y patrocinio como técnicas de relaciones públicas al serv...
Donación, mecenazgo y patrocinio como técnicas de relaciones públicas al serv...Donación, mecenazgo y patrocinio como técnicas de relaciones públicas al serv...
Donación, mecenazgo y patrocinio como técnicas de relaciones públicas al serv...
EfiaulaOpenSchool
 
Annuaire inverse gratuit
Annuaire inverse gratuitAnnuaire inverse gratuit
Annuaire inverse gratuit
Simon Caron
 
Els Llapis De Colors
Els Llapis De ColorsEls Llapis De Colors
Els Llapis De Colors
munsha.reines
 
Paquetes turisticos
Paquetes turisticosPaquetes turisticos
Paquetes turisticos
Anyi Stefany
 
Sistema operativo centos
Sistema operativo centosSistema operativo centos
Sistema operativo centos
cris_bar
 

En vedette (20)

Estudio de Hábitos de Usuarios de Internet Mexico 2012
Estudio de Hábitos de Usuarios de Internet Mexico 2012Estudio de Hábitos de Usuarios de Internet Mexico 2012
Estudio de Hábitos de Usuarios de Internet Mexico 2012
 
Viateurs info l'association progresse...
Viateurs info   l'association progresse...Viateurs info   l'association progresse...
Viateurs info l'association progresse...
 
Panorama audiovisual
Panorama audiovisualPanorama audiovisual
Panorama audiovisual
 
Más allá del recordatorio y la formación. Estrategias avanzadas para mejorar ...
Más allá del recordatorio y la formación. Estrategias avanzadas para mejorar ...Más allá del recordatorio y la formación. Estrategias avanzadas para mejorar ...
Más allá del recordatorio y la formación. Estrategias avanzadas para mejorar ...
 
Make it
Make itMake it
Make it
 
Darks
DarksDarks
Darks
 
Top gear(1)
Top gear(1)Top gear(1)
Top gear(1)
 
BB
BBBB
BB
 
Aprender chino
Aprender chinoAprender chino
Aprender chino
 
fortress annual report
fortress annual reportfortress annual report
fortress annual report
 
La historia escolar de mi familia.
La historia escolar de mi familia. La historia escolar de mi familia.
La historia escolar de mi familia.
 
Donación, mecenazgo y patrocinio como técnicas de relaciones públicas al serv...
Donación, mecenazgo y patrocinio como técnicas de relaciones públicas al serv...Donación, mecenazgo y patrocinio como técnicas de relaciones públicas al serv...
Donación, mecenazgo y patrocinio como técnicas de relaciones públicas al serv...
 
Annuaire inverse gratuit
Annuaire inverse gratuitAnnuaire inverse gratuit
Annuaire inverse gratuit
 
Els Llapis De Colors
Els Llapis De ColorsEls Llapis De Colors
Els Llapis De Colors
 
MyRhythme revue_samu
MyRhythme revue_samuMyRhythme revue_samu
MyRhythme revue_samu
 
Recommendations for Mix Design Using RAP/RAS
Recommendations for Mix Design Using RAP/RASRecommendations for Mix Design Using RAP/RAS
Recommendations for Mix Design Using RAP/RAS
 
Paquetes turisticos
Paquetes turisticosPaquetes turisticos
Paquetes turisticos
 
Niños soldado(II)
Niños soldado(II)Niños soldado(II)
Niños soldado(II)
 
Massey Ferguson MF 186 parts catalog
Massey Ferguson MF 186 parts catalogMassey Ferguson MF 186 parts catalog
Massey Ferguson MF 186 parts catalog
 
Sistema operativo centos
Sistema operativo centosSistema operativo centos
Sistema operativo centos
 

Similaire à Network Flow Pattern Extraction by Clustering Eugine Kang

Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
journalBEEI
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
fmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensorsfmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensors
Fridtjof Melle
 

Similaire à Network Flow Pattern Extraction by Clustering Eugine Kang (20)

Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
 
JAVA 2013 IEEE DATAMINING PROJECT Distributed processing of probabilistic top...
JAVA 2013 IEEE DATAMINING PROJECT Distributed processing of probabilistic top...JAVA 2013 IEEE DATAMINING PROJECT Distributed processing of probabilistic top...
JAVA 2013 IEEE DATAMINING PROJECT Distributed processing of probabilistic top...
 
Distributed processing of probabilistic top k queries in wireless sensor netw...
Distributed processing of probabilistic top k queries in wireless sensor netw...Distributed processing of probabilistic top k queries in wireless sensor netw...
Distributed processing of probabilistic top k queries in wireless sensor netw...
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
An Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream ClusteringAn Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream Clustering
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Optimizing the Data Collection in Wireless Sensor Network
Optimizing the Data Collection in Wireless Sensor NetworkOptimizing the Data Collection in Wireless Sensor Network
Optimizing the Data Collection in Wireless Sensor Network
 
Performance Analysis and Parallelization of CosineSimilarity of Documents
Performance Analysis and Parallelization of CosineSimilarity of DocumentsPerformance Analysis and Parallelization of CosineSimilarity of Documents
Performance Analysis and Parallelization of CosineSimilarity of Documents
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Resource Allocation Optimization for Heterogeneous Network
Resource Allocation Optimization for Heterogeneous NetworkResource Allocation Optimization for Heterogeneous Network
Resource Allocation Optimization for Heterogeneous Network
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
fmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensorsfmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensors
 
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribeEfficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run time
 
Handover Algorithm based VLP using Mobility Prediction Database for Vehicular...
Handover Algorithm based VLP using Mobility Prediction Database for Vehicular...Handover Algorithm based VLP using Mobility Prediction Database for Vehicular...
Handover Algorithm based VLP using Mobility Prediction Database for Vehicular...
 

Network Flow Pattern Extraction by Clustering Eugine Kang

  • 1. Network Flow Pattern Extraction by Clustering Eugine Kang, Seojoon Jin Korea University Abstract Network is a system which allows users to connect with the internet. Although the internet seems to be ever increasing and boundless, the infrastructure to provide the internet has physical limitations. Network flow analysis tools are developed for efficient network management. This paper discusses a statistical learning method for efficient network management. Network flow data is first transformed by time block and then followed by calculating Within Sum of Squares (WSS) and Calinski-Harabasz index (CH) for optimal number of clusters. The clustering algorithms used for this paper is k- means clustering. Principal Component Analysis (PCA) allows dimension reduction of the data set to visually confer the clusters by relative position of observations. The stability of the clusters is measured by the mean of Jaccard coefficient from 10,000 bootstrap resampled data sets. Keywords: Calinski-Harabasz index, principal component analysis, Jaccard coefficient, bootstrap, k-means clustering Introduction Data is collected from all fields and the amount of data collected is exponentially growing. Richer datasets allow researchers and entities to extract meaningful knowledge with the use of machine learning algorithms. In many business and engineering applications, raw data collected from systems needs to be processed into various fo rms of structured and semantically rich records before data analysis tasks can produce accurate, useful, and understandable results. However, the difficulty comes when the raw data is difficult to understand and
  • 2. manipulate. This paper walks through the network flow data collected at Korea University Sejong Campus, and transform the data to extract meaningful knowledge. This project first started under the goal to increase network flow efficiency within the campus. The flow of data is labeled with the type of application and trying to minimize or detect the use of non-academic purpose flow was essential. Torrent is a p2p file sharing service commonly used among users for downloading audio and video files. The campus labels torrent as a non-academic purpose flow and deeper analysis of the torrent flow within the campus is the main purpose of this paper. Recent works of network flow analysis has been conducted on a time basis but fail to produce any meaningful pattern from the users. Reference to previous studies and similar studies with explanation will be added to this section on a later date. The rest of the paper is organized as follows. We first give an overview of the data collected from the network. We then discuss the data transformation steps, which include time base patterns, and parsing each such pattern to construct statistical features. We report the experiments of using clustering techniques on the network flow data. We then describe how to utilize the knowledge extracted from experiments. Collected Data from Campus Network Figure 1: Diagram of network flow data collection
  • 3. Korea University Sejong Campus data center collects each IP’s network flow information on a minute-to-minute basis. There is about 5,000 IPs designated for desktop connections and a separate data center for collecting mobile devices connection through WI-FI. The type of information collected can be seen on table 1. Service and protocol level features tells us the type of service being used by the users on the local IP address. The type of service varies from web browsing, video streaming, e- mail, and torrent, which is the type of service we are interested in for this paper. Other features in table 1 collect information on the connection to the local IP. The network flow information is stored in a table like structure, e.g. relational database. However, the size of the entire network flow exceeds 3TB and required a different type of database to store and extract data. Hadoop Distributed File System (HDFS) was used to store the enormous amount of data and applications were used accordingly to extract data. Table 1: Network connection features Feature Name Feature Meaning Feature Name Feature Meaning flow_I flow index o_octet outbound byte count con_I continue index(flow's parent index) i_pkt inbound pkt count sip local ip i_octet inbound byte count sport local port flag flag information flag prot transmission protocol sr_lvl1 service level 1 dip remote ip sr_lvl2 service level 2 dport remote port sr_lvl3 service level 3 time time of flow start pt_lvl1 protocol level 1 o_pkt outbound pkt count pt_lvl2 protocol level 2 Data Transformation The bandwidth limitation of the campus network has a limit at a given
  • 4. time. We are more interested in torrent usage according to time blocks. Detecting torrent usage patterns by time will allow the network administrator to prepare for heavy loads of network use at peak times. The data transformation step parses data into weekdays, hours, and by flow, incoming, outgoing data amount. Figure 2: Heatmap visualization of data transformation Figure 2 shows one local IP’s torrent flow usage over time. The horizontal axis represents weekday and vertical axis represents the hour of day. The red region indicates relative heavy usage compared to the green regions. Through this data transformation process we are able to detect patterns from local IP torrent usage by time blocks. This information will be helpful for the network administrator to prevent heavy bandwidth loading caused by torrent usage. Experiment Technique k -means clustering k -means clustering is a method of vector quantization, that is popular
  • 5. for cluster analysis in data mining. The goal of this algorithm is to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In this paper we set each observation as a local IP, and observe how they partition. The algorithm uses an iterative refinement technique. Given an randomly initialized set of k means (1) (1) 1 , , km m , the algorithm proceeds by iterating two steps, assignment step and update step. The assignment step assigns each observation to the cluster with the nearest centroid. Equation 1 assign each observation px to exactly one ( )t S . Once each observations are assigned to a cluster, the update step calculates the centroids of the cluster with newly updated observations. Equation 2 sums up the features of each observation and generate a new centroid.  2 2( ) ( ) ( ) : ,1t t t i p p i p jS x x m x m j j k       (1) ( ) ( 1) ( ) 1 t j i t i jt x Si m x S     (2) Given the number of clusters to be partitioned, the objective of the algorithm is to find equation 3. The cluster orientation which minimizes the within sum of squares (WSS). WSS is also used to determine the appropriate number of clusters, but inappropriate due to nature of decreasing WSS value as number of clusters increases. 2 1 arg min i k S i i x S x     (3) Calinski-Harabasz Index k -means clustering is a fast and efficient clustering technique. However, one drawback is the pre-determination of k before the algorithms proceed. To optimize the clustering results, WSS is a common scoring method to determine
  • 6. k . As mentioned above, the nature of decreasing WSS value as k increases makes it difficult for optimization. Calinski-Harabasz index is best suited in finding the optimal number of cluster for k-means clustering solutions. The Calinski-Harabasz index is also called the variance ratio criterion. Equation 4 calculates the variance between clusters and equation 5 the variance between observations within a cluster. Well-defined clusters have a large between-cluster variance and a small within-cluster variance. The larger the Calinski-Harabasz index, the better the data partition. To determine the optimal number of clusters, the best practice is to maximize equation 6 respect to k . 2 1 Overall between-cluster variance k B i i i SS n m m     (4) 2 1 Overall within-cluster variance i k W i i x S SS x m      (5)    1 B W N kSS CH SS k     (6) In theory, determining k which maximizes the Calinski-Harabasz index results in the best partitioned cluster. However, in practice the number of clusters tend to be lower than the amount of clusters needed for distinct pattern analysis. Principal Component Analysis Determining k for clustering data may not always lead to desired results. WSS and CH scores are a method to determine the optimal number of clusters for k -means clustering. In practice, a visual determination of optimal number of clusters helps with the decision process. Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of
  • 7. principal components allowed are determined by the number of features given in the data set. By selecting the first two principal components, the data goes through data reduction to be represented on a lower dimensional space. Figure 3: Data reduction on data set by PCA Figure 3 represents 4,111 local IP’s with 168 features on a two dimensional space. Due to the orthogonal transformation gone through PCA, the values on the first and second principal components do not provide statistically significant meaning. However, the dimension reduction visualization provides relative nearness from observation to observation. We induce observations closer to one another share similar feature values and hence belong in the same partition for cluster analysis. PCA is used as a visual tool to determine the fitness of our cluster analysis. Cluster Stability Assessment Even the most sophisticated clustering algorithms tend to generate partitions even for fairly homogeneous data sets. Cluster validation is very important in cluster analysis. An important aspect of cluster validity is stability. Stability means that a meaningful valid cluster should not disappear if the data
  • 8. set is altered in a non-typical way. There could be serval conceptions what a “non-typical alternation” of the data set is. In terms of this paper, data set extracted from the underlying database should give rise to more or less the same clustering. The stability of the clusters formed is tested through the methods provided by Hennig (2007). Equation 7 calculates the Jaccard coefficient between two clusters as a measure of similarity. Jaccard coefficient does not depend on two clusters having identical amount of observations for the measurement to be valid. This is an advantage for cluster analysis because clusters with var ying observation counts may form.  , , ,a b a b a b a b S S S S S S x S S      (7) Resampling methods is quite natural to assess cluster stability. Bootstrap is performed to resample new data sets from the original data set. The desired clustering algorithm is applied and the similarity values are recorded for every original cluster to the most similar cluster in the proceeding iteration. The stability of every single cluster is taken by the mean similarity taken over the resampled data sets. Equation 8 calculates the mean similarity of clusters over bootstrapped data sets. *B is the number of bootstrap replications. , 1 1 * B S S i iB      (8) Experiment Results The experiment to extract torrent usage pattern by local IP begins with transforming the data to represent weekday to time usage. The reason for this transformation is due to the natural limitation of bandwidth within the campus network system. Identifying heavy loads of non-academic network usage by time is necessary for efficient management of the network system.
  • 9. One drawback of k -means clustering is the pre-determination of k . This paper first find candidates for suitable k using WSS and CH. Figure shows the WSS and CH score respect to increasing k . WSS naturally decreases as the number of clusters increases, but the pattern we are looking for is a significant drop of WSS from a k value to the next. Just looking at WSS, a significant drop can be observed when 3,4,7k  . Although the drop from 6 clusters to 7 is not as steep compared to others we are searching for appropriate candidates at this point. CH measures the ratio of variance between clusters and variance between observations within clusters. In literature, the maximum value of CH respect to k leads to the best partition. 3k  stands out as the best candidate, but restricting to 3 clusters may lead to lack to distinct torrent usage patterns. Minimizing the torrent usage to 3 types may not as be helpful to network administrators and lead to unnecessary restriction on network usage to local IPs. The next candidate we look for is a bump in CH value compares to the overall trend. 7k  shows a small bump compared to the trend given by previous and after CH values. This narrows our k candidates to 3, and 7. PCA is used as a dimension reduction technique to visualize the relative Figure 4: WSS vs CH to determine optimal k
  • 10. positions of observations. When the number of candidates for k can not be reduced to one, PCA can help settle the tie through visualization. Figure 5 shows the clusters when 3k  . The relative positioning of observations how each clusters are closely formed, and cluster 2 show observations spread out the entire reduced dimensional space. This finding is difficult to conclude distinct patterns within the cluster. Compared to cluster 2, the other clusters are relatively closely located to other observations. This leads to a distinct pattern within the cluster while the other does not. Figure 5: PCA visualization for 3 clusters Figure 6 shows a different formation of clusters compared to figure 5. The 7 clusters formed in figure 6 show relative closeness compared to when there are 3 clusters. Cluster 3 remains the same while the other clusters in figure 5 are partition by 3. The clusters formed in figure 6 show closely partitioned observations. Although 3k  may have a higher CH score compared to 7k  , the results from PCA visualization tells us a different story. Due to the objective of this paper, extracting more patterns for the network administrator will lead to useful knowledge in determining heavy torrent usage by time. The spread of the
  • 11. observations in figure 6 also tells us relative closeness between observations within the cluster. When applying k -means clustering, we narrowed possible candidates for k by comparing WSS and CH scores respect to k . When there are multiple possible candidates or the purpose of experiment requires more clusters to be formed, PCA is able to perform dimension reduction to visualize the cluster formation and compare observation closeness. For this data set and experiment, the number of clusters to perform further analysis is 7k  . Figure 6: PCA visualization for 7 clusters Table 2: Cluster stability Cluster Stability Observations 1 0.842 76 2 0.957 86 3 0.808 91 4 0.773 273 5 0.652 101 6 0.985 3314 7 0.692 170
  • 12. Non-typical alteration with the data may lead to different clusters forming. Testing the stability is very important in cluster analysis. When we are detecting network patterns with the cluster formation, the stability and validity of these patterns occurring with a different data set must be tested. We use the Jaccard coefficient as a measurement of similarity and generate multiple data sets by the bootstrap resampling method. We generated 7 clusters with the k - means clustering algorithm and generated a stability value by the mean of Jaccard coefficients over 10,000 bootstrap resampled data sets. The stability value varies between 0-1. Values over 0.75 are considered stable and the same cluster forming with non-typical alteration. 5 out of 7 clusters form stable clusters to derive meaningful partitioning. The highest stability comes from cluster 6 with a value close to 1. The number of local IPs belonging in cluster 6 is also the highest containing about 80% of the total local IPs. Cluster 5 and 7 fails to reach a stability of 0.75 or higher. The patterns of stable clusters will be visually examined. Figure 7: Torrent usage pattern for cluster 1
  • 13. Cluster 1 shows a regular usage of torrent during working hours. The campus network is used by students, researchers, and professors, but they are also used by office staffs as well. The torrent usage seems to follow the user ’s computer usage time, and indicate the start of the week the most concentrated. This could be explained by users downloading shows over the weekend. Figure 8: Torrent usage pattern for cluster 2 Cluster 2 takes up most of the torrent usage over all observations. The 86 local IPs belonging to cluster 2 are responsible for loading the network system with heavy torrent usage. Figure 8 makes it quite obvious to see the reason why. Unlike cluster 1 with specific time usage, cluster 2 does not know when to stop torrent usage. Managing and curving down the non-academic network usage of these 86 local IPs is the key for making more room for network bandwidth. Cluster 3 tells us a similar pattern to cluster 1 but with heavier intensity. The difference between cluster 1 and 3 is the moment of peak usage. Cluster 1 show a gradual decrease in usage as the weekday goes on, but cluster 3 show
  • 14. peak on Mon, Thurs, and Friday. The maximum flow visible through the legend is 2-3 folds the maximum of cluster 1. The usage pattern for cluster 1 and cluster 3 share common relationship between working hours and torrent usage, but differ in moment of intensity and maximum amount. Figure 9: Torrent usage pattern for cluster 3 Cluster 4 and 6 share common patterns which show the lack of torrent activity. This is odd discovery to be revealed from this experiment. The current network system is overloaded with the usage of torrent activity, but the clusters pattern show the contrary when looking at the number of local IPs actively using the service. Cluster 4 and 6 take more than 80% of local IPs, but the torrent usage seems to be nearly none. This discovery gives heavier emphasis to the pattern shown in cluster 2. Small amount of users are responsible for the majority of network load within this campus. Cluster 4 shows a peak on Saturday morning, but this is not an important feature to worry as a network administrator because the overall network usage during that time is low anyway. The peak is within cluster 4 but compared to other clusters the intensity is significant lower than others. The pattern from cluster 4 and 6 allows us to
  • 15. conclude that most of the local IPs do not require special administration to curve down the usage of non-academic network usage such as torrent. Figure 10: Torrent usage pattern for cluster 4 Figure 11: Torrent usage pattern for cluster 6
  • 16. Cluster 5 and 7 fail to achieve stability from the mean of Jaccard coefficient by bootstrap resample data sets. The patterns detected are hard to assume stability and hence can change with non-typical alteration to extracted data set. Conclusion The network system within an entity requires management for efficiency. This paper experimented with a method to detect patterns from a massive data set collected from network flow. The steps we took was first to transform the data by time blocks and extract network flow data which are labeled as torrent. Torrent is a non-academic network service which takes a big burden to the network system. WSS and CH were calculated within the transformed data set to narrow down the possible number of clusters, and the clusters were visually reexamined through PCA. The dimension reduction technique allows us to observe relative positioning from one observation to another. The stability of clusters formed by k-means clustering was validated through bootstrap resampling and taking the mean of the Jaccard coefficient of the most similar clusters formed over 10,000 iterations. The patterns detected from each clusters are visualized by heatmap to easily detect peaks in usage over time. The main discovery of this cluster analysis reveals a small amount of users are responsible for the majority of network load to the entire system. 80% or more of the users barely load the network with non-academic network usage. As a network administrator, the patterns detected can be used for special administration for users with a history of heavy usage. Because the users who belong to this field are low, the job of the network administrator can be lessened. Further analysis of the clusters show relative peak, and this information can also prepare a network administrator for the specific time slot where network usage will increase. The proposed method can also be applied other network services as well.
  • 17. Reference Hartigan, John A., and Manchek A. Wong. "Algorithm AS 136: A k-means clustering algorithm." Applied statistics (1979): 100-108. Likas, Aristidis, Nikos Vlassis, and Jakob J. Verbeek. "The global k-means clustering algorithm." Pattern recognition 36.2 (2003): 451-461. Ding, Chris, and Xiaofeng He. "K-means clustering via principal component analysis." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004. Caliński, Tadeusz, and Jerzy Harabasz. "A dendrite method for cluster analysis." Communications in Statistics-theory and Methods 3.1 (1974): 1-27. Wold, Svante, Kim Esbensen, and Paul Geladi. "Principal component analysis." Chemometrics and intelligent laboratory systems 2.1 (1987): 37-52. Abdi, Hervé, and Lynne J. Williams. "Principal component analysis." Wiley Interdisciplinary Reviews: Computational Statistics 2.4 (2010): 433-459. Becker, R. A., J. M. Chambers, and A. R. Wilks. "The New S Language Pacific Grove CA: Wadsworth & Brooks/Cole." BeckerThe New S Language1988(1988). Mardia, Kantilal Varichand, John T. Kent, and John M. Bibby. Multivariate analysis. Academic press, 1979. Venables, William N., and Brian D. Ripley. Modern applied statistics with S. Springer Science & Business Media, 2002. Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271. Hennig, C. (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis 99, 1154-1176.