SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
www.ijmer.com

International Journal of Modern Engineering Research (IJMER)
Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826
ISSN: 2249-6645

A Novel Clustering Method for Similarity Measuring in
Text Documents
Preethi Priyanka Thella1, G. Sridevi2
1

M.Tech, Nimra College of Engineering & Technology, Vijayawada, A.P., India.
Assoc.Professor, Dept.of CSE, Nimra College of Engineering & Technology, Vijayawada, A.P., India.

2

ABSTRACT: Clustering is the process of grouping data into subsets in such a manner that identical instances are
collected together, while different instances belong to different groups. The instances are thereby arranged into an efficient
depiction that characterizes the populace that is being sampled. A general move towards the clustering process is to treat it
as an optimization process. A best partition is found by optimizing an exacting function of similarity, or distance, among
data. Basically, there is a hidden assumption that the true inherent structure of data could be correctly describe by using the
similarity formula defined and fixed in the clustering decisive factor. In this paper, we introduce clustering with multi- view
points based on different similarity measures. The multi- view point approach to learning is one in which we have ‘views’ of
the data (sometimes in a rather abstract sense) and the goal is to use the relationship between these views to alleviate the
difficulty of a learning problem of interest.

Keywords: Clustering, Text mining, Similarity measure, View point.
I.

INTRODUCTION

Clustering[1] or cluster analysis is the task of grouping a set of objects in such a way that objects in the same group
(called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main
task of explorative data mining techniques, and a common technique for statistical data analysis used in many fields,
including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis
itself is not one specific algorithm or procedure, but the general task to be solved. It can be achieved by using various
algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular
notions of clusters include groups with low distances among the cluster members, intervals or particular statistical
distributions, dense areas of the data space. Clustering can therefore be formulated as a Multi- objective optimization
process.
The appropriate clustering algorithm and parameter settings, including values such as the distance function to use, a
density threshold or the number of expected clusters, depend on the individual data set and intended use of the results.
Clustering as such is not an automatic task, but an iterative process of Knowledge discovery or interactive multi- objective
optimization that involves trial and failure. It will often be necessary to modify parameters and preprocessing until the result
achieves the desired properties. Cluster analysis can be considered the most important unsupervised learning problem; so, as
every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of
clustering process could be “the process of organizing objects into groups whose members are similar in some way”. A
cluster is therefore a collection of objects or items which are “similar” between them and are “dissimilar” to the objects
belonging to other clusters. Figure 1 shows clustering process.

Figure 1: Clustering Process
In this case we easily identify the four clusters into which the data can be divided; the similarity criterion is
distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case
geometrical distance). This is called as distance based clustering. Another kind of clustering is called conceptual clustering:
two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words,
objects are grouped according to their fit to descriptive concepts, not according to the simple similarity measures. The multiview point approach to learning is one in which we have „views‟ of the data (sometimes in a rather abstract sense) and the
goal is to use the relationship between these views to alleviate the difficulty of a learning problem of interest.
www.ijmer.com

2823 | Page
www.ijmer.com

International Journal of Modern Engineering Research (IJMER)
Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826
ISSN: 2249-6645

II.

RELATED WORK

Text clustering is required in the real world applications such as web search engines. It comes under text mining
process. It is meant for grouping text documents into various clusters. These clusters are used by various applications in the
real world, for example, search engines. A text document is treated as an object a word in the document is referred as a term.
A vector is built to represent each text document. The total number of terms in the text document is represented by m. Some
kind of weighting schemes like Term Frequency – Inverse Document Frequency (TF-IDF) is used to represent document
vectors. There are many approaches for text document clustering. They include probabilistic based methods [2], nonnegative
matrix factorization [3] and information theoretic co-clustering [4]. These approaches are not using a particular measure for
finding similarity among text documents. In this paper, we make use of multi- view point similarity measure for finding the
similarity. As found it literature, a measure widely used in text document clustering is ED (Euclidian Distance).
K-Means algorithm is most widely used clustering algorithm due to its ease of use and simplicity. Euclidian
distance is the measure used in K-Means algorithm to measure the distance between objects to make them into clusters. In
this case the cluster centroid is computed as follows:

Another similarity measure being used for text document mining is cosine similarity measure. It is best useful in hidimensional documents [5]. This measure is also being used in Spherical K-Means which is a variant of K-Means algorithm.
The difference between the two flavors of K-Means algorithm that use cosine similarity measure and ED measure
respectively is that the former focuses on vector directions while the latter focuses on vector magnitudes. Graph partitioning
is yet another approach which is very popular. It considers the text document corpus as graph and uses min-max cut
algorithm which represents centriod as follows:

There is a software package called CLUTO [6] which is meant for document clustering. It makes use of the graph
partitioning approach. Based on the nearest neighbor graph it builds, it text documents are clustered. It is based on the
Jacquard coefficient which is computed as follows:

Jacquard coefficients use both magnitude and direction which is not the case with Euclidian distance and cosine
similarity. However, it is similarity to cosine similarity when the documents are represented as unit vectors. In [7] there is
comparison between the two techniques namely Jacquard and Pearson correlation. It also concludes that both of them are
best used in clustering process of web documents. For tsxt document clustering other approaches can be used which are
phrase based and concept based. In phrase based approach is found while in [8] tree similarity based approach is found. The
common procedure used by both of them is “Hierarchical agglomerative Clustering”. The drawback of these approaches is
that their computational cost is too high. For clustering XML documents also there are some measures. One such measure is
called “Structural Similarity” which differs from text document clustering. This paper focuses on a new multi-view point
based similarity measure for text clustering.

III.

PROPOSED WORK

In proposed work, our approach in finding similarity between documents or objects while performing clustering is
multi-view based similarity. It makes use of more than one point of reference as opposed to existing algorithms used for text
document clustering. As per our approach the similarity between two documents is calculated as follows:

sim(d i , d j ) 
d i , d j S r

1
n  nr

 sim(d

d h S  S r

i

 dh , d j  dh )

Consider two point “di” and “dj” in the cluster Sr. The similarity between those two points is viewed from a point
“dh” which is outside the cluster. Such similarity is equal to the product of the cosine angle between those points with
respect to Euclidean distance between the points. An assumption on which this definition is based on is “dh” is not the same
cluster as “di” and “dj”. When distances are very small, then the chances are higher that the “dh” is in the same cluster.
Though various viewpoints are useful in increasing the accuracy of the similarity measure there is a possibility of having that
give negative result. However the possibility of such a drawback can be ignored provided plenty of documents to be
clustered.
Now we have to carry out the validity test for the cosine similarity and multi view based similarity as follows. For each
type of the similarity measure, a similarity matrix called A = {aij}n×n is created. For CS, this is very simple, as aij = dti dj .
The algorithm for building Multi view Similarity (MVS) matrix is described in Algorithm 1.

www.ijmer.com

2824 | Page
International Journal of Modern Engineering Research (IJMER)
www.ijmer.com
Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826
ISSN: 2249-6645
ALGORITHM 1: BUILDMVSMATRIX(A)
Step 1: for r ← 1 : c do
Step 2:

DS Sr ←

d

d i S r

i

Step 3: nS Sr←|S  Sr|
Step 4: end for
Step 5: for i ← 1 : n do
Step 6:
r ← class of di
Step 7:
for j ← 1 : n do
Step 8:
if dj  Sr then

Step 9:
Step 10:

else

Step 11:
Step 12: end if
Step 13: end for
Step 14: end for
Step 15: return A = {aij}n×n
First, the outer composite with respect to each class is determined. Then, for each row ai of “A”, i = 1, . . . , n, if the pair of
text documents di and dj, j = 1, . . . , n are in the same class, aij is calculated as in line 9. Otherwise, dj is assumed to be in
di‟s class, and aij is calculated as shown in line 11.
After matrix “A” is formed, the code in Algorithm 2 is used to get its validity score:
ALGORITHM 2: GETVALIDITY(validity,A, percentage)
Step 1: for r ← 1 : c do
Step 2: qr ← floor(percentage × nr)
Step 3: if qr = 0 then
Step 4:
qr ← 1
Step 5: end if
Step 6: end for
Step 7: for i ← 1 : n do
Step 8: {aiv[1], . . . , aiv[n] } ←Sort {ai1, . . . , ain}
Step 9: s.t. aiv[1] ≥ aiv[2] ≥ . . . ≥ aiv[n]
{v[1], . . . , v[n]} ← permute {1, . . . , n}
Step 10: r ← class of di

Step 11:
Step 12: end for

Step 13:
Step 14: return validity
For each document “di” corresponding to row “ai” of matrix A, we select “qr” documents closest to point “di”. The
value of “qr” is chosen relatively as the percentage of the size of the class r that contains “di”, where percentage  (0, 1].
Then, validity with respect to “di” is calculated by the fraction of these “qr” documents having the same class label with
“di”, as shown in line 11. The final validity is determined by averaging the over all the rows of matrix A, as shown in line
13. It is clear that the validity score is bounded within values 0 and 1. The higher validity score a similarity measure has, the
more suitable it should be useful for the clustering process.

IV.

INCREMENTAL CLUSTERING ALGORITHM

The main goal of this algorithm is to perform text document clustering by optimizing

www.ijmer.com

I R and I V as shown below:

2825 | Page
www.ijmer.com

International Journal of Modern Engineering Research (IJMER)
Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826
ISSN: 2249-6645

With this general form, the incremental optimization algorithm, which has two major steps Initialization and Refinement, is
shown in Algorithm 3 and Algorithm 4.
ALGORITHM 3: INITIALIZATION
Step 1: Select k seeds s1, . . . , sk randomly
Step 2:
Step 3:
Step 4: end
ALGORITHM 4: REFINEMENT
Step 1: repeat
Step 2: {v[1 : n]} ← random permutation of {1, . . ., n}
Step 3: for j ← 1 : n do
Step 4: i ← v[j]
Step 5: p ← cluster[di]
Step 6:
Step 7:

Step 8:
Step 9: if
then
Step 10: Move di to cluster q: cluster[di] ← q
Step 11: Update Dp, np,Dq, nq
Step 12: end if
Step 13: end for
Step 14: until No move for all n documents
Step 15: end
At Initialization, “k” arbitrary documents are selected to be the seeds from which initial partitions are formed.
Refinement is a process that consists of a number of iterations. During each iteration, the “n” text documents are visited one
by one in a totally random order. Each text document is checked if its move to another cluster results in improvement of the
objective function. If yes, then the text document is moved to the cluster that leads to the highest improvement. If no clusters
are better than the current cluster, the text document is not moved. The clustering process terminates when iteration
completes without any text documents being moved to new clusters.

V.

CONCLUSION

In the view point of data engineering, a cluster is a group of objects with similar nature. The grouping mechanism is
called as clustering process. The similar text documents are grouped together in a cluster, if their cosine similarity measure is
less than a specified threshold. In this paper we mainly focuses on view points and we introduce a novel multi-viewpoint
based similarity measure for text mining. The nature of similarity measure plays a very important role in the success or
failure of the clustering method. From the proposed similarity measure, we then formulate new clustering criterion functions
and introduce their respective clustering algorithms, which are fast and scalable like k-means algorithm, but are also capable
of providing high quality and consistent performance.

REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]

I. Guyon, U. von Luxburg, and R. C. Williamson, “Clustering: Science or Art?” , ‖ NIPS‟09 Workshop on Clustering Theory, 2009.
Leo
Wanner
(2004).
“Introduction
to
Clustering
Techniques”.
Available
online
at:
http://www.iula.upf.edu/materials/040701wanner.pdf [viewed: 16 August 2012]
D. Ienco, R. G. Pensa, and R. Meo, “Context-based distance learning for categorical data clustering,” in Proc. of the 8th Int. Symp.
IDA, 2009, pp. 83–94.
I. Guyon, U. von Luxburg, and R. C. Williamson, “Clustering: Science or Art?” NIPS‟09 Workshop on Clustering Theory, 2009.
C. D. Manning, P. Raghavan, and H. Sch ¨ utze, An Introduction to Information Retrieval. Press, Cambridge U., 2009.
X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M.
Steinbach, D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” Knowl.Inf. Syst., vol. 14, no. 1, pp. 1–37, 2007.
W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative matrix factorization,” in SIGIR, 2003, pp. 267–273.
S. Zhong, “Efficient online spherical K-means clustering,” in IEEE IJCNN, 2005, pp. 3180–3185.

www.ijmer.com

2826 | Page

Contenu connexe

Tendances

Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
 
Ijartes v1-i2-006
Ijartes v1-i2-006Ijartes v1-i2-006
Ijartes v1-i2-006IJARTES
 
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesianAhmad Amri
 
B colouring
B colouringB colouring
B colouringxs76250
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrievalBasma Gamal
 
Semi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data StructureSemi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...IJCSIS Research Publications
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...IRJET Journal
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
 

Tendances (17)

Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
A0310112
A0310112A0310112
A0310112
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
Ijartes v1-i2-006
Ijartes v1-i2-006Ijartes v1-i2-006
Ijartes v1-i2-006
 
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
 
Du35687693
Du35687693Du35687693
Du35687693
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesian
 
B colouring
B colouringB colouring
B colouring
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
 
Semi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data StructureSemi-Supervised Discriminant Analysis Based On Data Structure
Semi-Supervised Discriminant Analysis Based On Data Structure
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 

En vedette

Digital Citizenship Webquest
Digital Citizenship WebquestDigital Citizenship Webquest
Digital Citizenship Webquestmdonel
 
Significance Assessment of Architectural Heritage Monuments in Old-Goa
Significance Assessment of Architectural Heritage Monuments in Old-GoaSignificance Assessment of Architectural Heritage Monuments in Old-Goa
Significance Assessment of Architectural Heritage Monuments in Old-GoaIJMER
 
Virtualization Technology using Virtual Machines for Cloud Computing
Virtualization Technology using Virtual Machines for Cloud ComputingVirtualization Technology using Virtual Machines for Cloud Computing
Virtualization Technology using Virtual Machines for Cloud ComputingIJMER
 
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...IJMER
 
Finite Element Analysis of Human RIB Cage
Finite Element Analysis of Human RIB CageFinite Element Analysis of Human RIB Cage
Finite Element Analysis of Human RIB CageIJMER
 
E04011 03 3339
E04011 03 3339E04011 03 3339
E04011 03 3339IJMER
 
Ρωμιοσύνη
ΡωμιοσύνηΡωμιοσύνη
ΡωμιοσύνηPopi Kaza
 
Education set for collecting and visualizing data using sensor system based ...
Education set for collecting and visualizing data using sensor  system based ...Education set for collecting and visualizing data using sensor  system based ...
Education set for collecting and visualizing data using sensor system based ...IJMER
 
The creative arts –
The creative arts –The creative arts –
The creative arts –Ylenia Vella
 
Monitor the Unmeasurable
Monitor the UnmeasurableMonitor the Unmeasurable
Monitor the UnmeasurableJennifer Davis
 
Bu32888890
Bu32888890Bu32888890
Bu32888890IJMER
 
I0502 01 4856
I0502 01 4856I0502 01 4856
I0502 01 4856IJMER
 
Cb31324330
Cb31324330Cb31324330
Cb31324330IJMER
 
Bw31297301
Bw31297301Bw31297301
Bw31297301IJMER
 
Bs32869883
Bs32869883Bs32869883
Bs32869883IJMER
 
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSGTracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSGIJMER
 
On Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
On Characterizations of NANO RGB-Closed Sets in NANO Topological SpacesOn Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
On Characterizations of NANO RGB-Closed Sets in NANO Topological SpacesIJMER
 
Fisheries Enumerator - Ojas.guj.nic.in
Fisheries Enumerator - Ojas.guj.nic.inFisheries Enumerator - Ojas.guj.nic.in
Fisheries Enumerator - Ojas.guj.nic.inojasgujnicin
 
Www youtube com_watch_v_p_n_huv50kbfe
Www youtube com_watch_v_p_n_huv50kbfeWww youtube com_watch_v_p_n_huv50kbfe
Www youtube com_watch_v_p_n_huv50kbfeRancyna James
 

En vedette (20)

Digital Citizenship Webquest
Digital Citizenship WebquestDigital Citizenship Webquest
Digital Citizenship Webquest
 
Significance Assessment of Architectural Heritage Monuments in Old-Goa
Significance Assessment of Architectural Heritage Monuments in Old-GoaSignificance Assessment of Architectural Heritage Monuments in Old-Goa
Significance Assessment of Architectural Heritage Monuments in Old-Goa
 
Virtualization Technology using Virtual Machines for Cloud Computing
Virtualization Technology using Virtual Machines for Cloud ComputingVirtualization Technology using Virtual Machines for Cloud Computing
Virtualization Technology using Virtual Machines for Cloud Computing
 
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
Application of Parabolic Trough Collectorfor Reduction of Pressure Drop in Oi...
 
Finite Element Analysis of Human RIB Cage
Finite Element Analysis of Human RIB CageFinite Element Analysis of Human RIB Cage
Finite Element Analysis of Human RIB Cage
 
E04011 03 3339
E04011 03 3339E04011 03 3339
E04011 03 3339
 
Ρωμιοσύνη
ΡωμιοσύνηΡωμιοσύνη
Ρωμιοσύνη
 
Education set for collecting and visualizing data using sensor system based ...
Education set for collecting and visualizing data using sensor  system based ...Education set for collecting and visualizing data using sensor  system based ...
Education set for collecting and visualizing data using sensor system based ...
 
The creative arts –
The creative arts –The creative arts –
The creative arts –
 
Monitor the Unmeasurable
Monitor the UnmeasurableMonitor the Unmeasurable
Monitor the Unmeasurable
 
Bu32888890
Bu32888890Bu32888890
Bu32888890
 
I0502 01 4856
I0502 01 4856I0502 01 4856
I0502 01 4856
 
Reno x tkja
Reno x tkjaReno x tkja
Reno x tkja
 
Cb31324330
Cb31324330Cb31324330
Cb31324330
 
Bw31297301
Bw31297301Bw31297301
Bw31297301
 
Bs32869883
Bs32869883Bs32869883
Bs32869883
 
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSGTracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
Tracking of Maximum Power from Wind Using Fuzzy Logic Controller Based On PMSG
 
On Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
On Characterizations of NANO RGB-Closed Sets in NANO Topological SpacesOn Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
On Characterizations of NANO RGB-Closed Sets in NANO Topological Spaces
 
Fisheries Enumerator - Ojas.guj.nic.in
Fisheries Enumerator - Ojas.guj.nic.inFisheries Enumerator - Ojas.guj.nic.in
Fisheries Enumerator - Ojas.guj.nic.in
 
Www youtube com_watch_v_p_n_huv50kbfe
Www youtube com_watch_v_p_n_huv50kbfeWww youtube com_watch_v_p_n_huv50kbfe
Www youtube com_watch_v_p_n_huv50kbfe
 

Similaire à A Novel Clustering Method for Similarity Measuring in Text Documents

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Editor IJARCET
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Bs31267274
Bs31267274Bs31267274
Bs31267274IJMER
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Editor IJMTER
 
Recent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewRecent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewIOSRjournaljce
 
Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesIOSR Journals
 
An Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data MiningAn Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data MiningGina Rizzo
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data miningeSAT Journals
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data miningeSAT Publishing House
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationeSAT Journals
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongeSAT Publishing House
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...IJECEIAES
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
 

Similaire à A Novel Clustering Method for Similarity Measuring in Text Documents (20)

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
 
50120130406022
5012013040602250120130406022
50120130406022
 
Recent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewRecent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A Review
 
Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering Techniques
 
Ir3116271633
Ir3116271633Ir3116271633
Ir3116271633
 
F04463437
F04463437F04463437
F04463437
 
An Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data MiningAn Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data Mining
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representation
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
 

Plus de IJMER

A Study on Translucent Concrete Product and Its Properties by Using Optical F...
A Study on Translucent Concrete Product and Its Properties by Using Optical F...A Study on Translucent Concrete Product and Its Properties by Using Optical F...
A Study on Translucent Concrete Product and Its Properties by Using Optical F...IJMER
 
Developing Cost Effective Automation for Cotton Seed Delinting
Developing Cost Effective Automation for Cotton Seed DelintingDeveloping Cost Effective Automation for Cotton Seed Delinting
Developing Cost Effective Automation for Cotton Seed DelintingIJMER
 
Study & Testing Of Bio-Composite Material Based On Munja Fibre
Study & Testing Of Bio-Composite Material Based On Munja FibreStudy & Testing Of Bio-Composite Material Based On Munja Fibre
Study & Testing Of Bio-Composite Material Based On Munja FibreIJMER
 
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)IJMER
 
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...IJMER
 
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...IJMER
 
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...IJMER
 
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...IJMER
 
Static Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
Static Analysis of Go-Kart Chassis by Analytical and Solid Works SimulationStatic Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
Static Analysis of Go-Kart Chassis by Analytical and Solid Works SimulationIJMER
 
High Speed Effortless Bicycle
High Speed Effortless BicycleHigh Speed Effortless Bicycle
High Speed Effortless BicycleIJMER
 
Integration of Struts & Spring & Hibernate for Enterprise Applications
Integration of Struts & Spring & Hibernate for Enterprise ApplicationsIntegration of Struts & Spring & Hibernate for Enterprise Applications
Integration of Struts & Spring & Hibernate for Enterprise ApplicationsIJMER
 
Microcontroller Based Automatic Sprinkler Irrigation System
Microcontroller Based Automatic Sprinkler Irrigation SystemMicrocontroller Based Automatic Sprinkler Irrigation System
Microcontroller Based Automatic Sprinkler Irrigation SystemIJMER
 
On some locally closed sets and spaces in Ideal Topological Spaces
On some locally closed sets and spaces in Ideal Topological SpacesOn some locally closed sets and spaces in Ideal Topological Spaces
On some locally closed sets and spaces in Ideal Topological SpacesIJMER
 
Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...IJMER
 
Natural Language Ambiguity and its Effect on Machine Learning
Natural Language Ambiguity and its Effect on Machine LearningNatural Language Ambiguity and its Effect on Machine Learning
Natural Language Ambiguity and its Effect on Machine LearningIJMER
 
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcess
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcessEvolvea Frameworkfor SelectingPrime Software DevelopmentProcess
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcessIJMER
 
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded CylindersMaterial Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded CylindersIJMER
 
Studies On Energy Conservation And Audit
Studies On Energy Conservation And AuditStudies On Energy Conservation And Audit
Studies On Energy Conservation And AuditIJMER
 
An Implementation of I2C Slave Interface using Verilog HDL
An Implementation of I2C Slave Interface using Verilog HDLAn Implementation of I2C Slave Interface using Verilog HDL
An Implementation of I2C Slave Interface using Verilog HDLIJMER
 
Discrete Model of Two Predators competing for One Prey
Discrete Model of Two Predators competing for One PreyDiscrete Model of Two Predators competing for One Prey
Discrete Model of Two Predators competing for One PreyIJMER
 

Plus de IJMER (20)

A Study on Translucent Concrete Product and Its Properties by Using Optical F...
A Study on Translucent Concrete Product and Its Properties by Using Optical F...A Study on Translucent Concrete Product and Its Properties by Using Optical F...
A Study on Translucent Concrete Product and Its Properties by Using Optical F...
 
Developing Cost Effective Automation for Cotton Seed Delinting
Developing Cost Effective Automation for Cotton Seed DelintingDeveloping Cost Effective Automation for Cotton Seed Delinting
Developing Cost Effective Automation for Cotton Seed Delinting
 
Study & Testing Of Bio-Composite Material Based On Munja Fibre
Study & Testing Of Bio-Composite Material Based On Munja FibreStudy & Testing Of Bio-Composite Material Based On Munja Fibre
Study & Testing Of Bio-Composite Material Based On Munja Fibre
 
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
Hybrid Engine (Stirling Engine + IC Engine + Electric Motor)
 
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
Fabrication & Characterization of Bio Composite Materials Based On Sunnhemp F...
 
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
Geochemistry and Genesis of Kammatturu Iron Ores of Devagiri Formation, Sandu...
 
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
Experimental Investigation on Characteristic Study of the Carbon Steel C45 in...
 
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
Non linear analysis of Robot Gun Support Structure using Equivalent Dynamic A...
 
Static Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
Static Analysis of Go-Kart Chassis by Analytical and Solid Works SimulationStatic Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
Static Analysis of Go-Kart Chassis by Analytical and Solid Works Simulation
 
High Speed Effortless Bicycle
High Speed Effortless BicycleHigh Speed Effortless Bicycle
High Speed Effortless Bicycle
 
Integration of Struts & Spring & Hibernate for Enterprise Applications
Integration of Struts & Spring & Hibernate for Enterprise ApplicationsIntegration of Struts & Spring & Hibernate for Enterprise Applications
Integration of Struts & Spring & Hibernate for Enterprise Applications
 
Microcontroller Based Automatic Sprinkler Irrigation System
Microcontroller Based Automatic Sprinkler Irrigation SystemMicrocontroller Based Automatic Sprinkler Irrigation System
Microcontroller Based Automatic Sprinkler Irrigation System
 
On some locally closed sets and spaces in Ideal Topological Spaces
On some locally closed sets and spaces in Ideal Topological SpacesOn some locally closed sets and spaces in Ideal Topological Spaces
On some locally closed sets and spaces in Ideal Topological Spaces
 
Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...Intrusion Detection and Forensics based on decision tree and Association rule...
Intrusion Detection and Forensics based on decision tree and Association rule...
 
Natural Language Ambiguity and its Effect on Machine Learning
Natural Language Ambiguity and its Effect on Machine LearningNatural Language Ambiguity and its Effect on Machine Learning
Natural Language Ambiguity and its Effect on Machine Learning
 
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcess
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcessEvolvea Frameworkfor SelectingPrime Software DevelopmentProcess
Evolvea Frameworkfor SelectingPrime Software DevelopmentProcess
 
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded CylindersMaterial Parameter and Effect of Thermal Load on Functionally Graded Cylinders
Material Parameter and Effect of Thermal Load on Functionally Graded Cylinders
 
Studies On Energy Conservation And Audit
Studies On Energy Conservation And AuditStudies On Energy Conservation And Audit
Studies On Energy Conservation And Audit
 
An Implementation of I2C Slave Interface using Verilog HDL
An Implementation of I2C Slave Interface using Verilog HDLAn Implementation of I2C Slave Interface using Verilog HDL
An Implementation of I2C Slave Interface using Verilog HDL
 
Discrete Model of Two Predators competing for One Prey
Discrete Model of Two Predators competing for One PreyDiscrete Model of Two Predators competing for One Prey
Discrete Model of Two Predators competing for One Prey
 

Dernier

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Dernier (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

A Novel Clustering Method for Similarity Measuring in Text Documents

  • 1. www.ijmer.com International Journal of Modern Engineering Research (IJMER) Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826 ISSN: 2249-6645 A Novel Clustering Method for Similarity Measuring in Text Documents Preethi Priyanka Thella1, G. Sridevi2 1 M.Tech, Nimra College of Engineering & Technology, Vijayawada, A.P., India. Assoc.Professor, Dept.of CSE, Nimra College of Engineering & Technology, Vijayawada, A.P., India. 2 ABSTRACT: Clustering is the process of grouping data into subsets in such a manner that identical instances are collected together, while different instances belong to different groups. The instances are thereby arranged into an efficient depiction that characterizes the populace that is being sampled. A general move towards the clustering process is to treat it as an optimization process. A best partition is found by optimizing an exacting function of similarity, or distance, among data. Basically, there is a hidden assumption that the true inherent structure of data could be correctly describe by using the similarity formula defined and fixed in the clustering decisive factor. In this paper, we introduce clustering with multi- view points based on different similarity measures. The multi- view point approach to learning is one in which we have ‘views’ of the data (sometimes in a rather abstract sense) and the goal is to use the relationship between these views to alleviate the difficulty of a learning problem of interest. Keywords: Clustering, Text mining, Similarity measure, View point. I. INTRODUCTION Clustering[1] or cluster analysis is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of explorative data mining techniques, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm or procedure, but the general task to be solved. It can be achieved by using various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with low distances among the cluster members, intervals or particular statistical distributions, dense areas of the data space. Clustering can therefore be formulated as a Multi- objective optimization process. The appropriate clustering algorithm and parameter settings, including values such as the distance function to use, a density threshold or the number of expected clusters, depend on the individual data set and intended use of the results. Clustering as such is not an automatic task, but an iterative process of Knowledge discovery or interactive multi- objective optimization that involves trial and failure. It will often be necessary to modify parameters and preprocessing until the result achieves the desired properties. Cluster analysis can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering process could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects or items which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Figure 1 shows clustering process. Figure 1: Clustering Process In this case we easily identify the four clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called as distance based clustering. Another kind of clustering is called conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to the simple similarity measures. The multiview point approach to learning is one in which we have „views‟ of the data (sometimes in a rather abstract sense) and the goal is to use the relationship between these views to alleviate the difficulty of a learning problem of interest. www.ijmer.com 2823 | Page
  • 2. www.ijmer.com International Journal of Modern Engineering Research (IJMER) Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826 ISSN: 2249-6645 II. RELATED WORK Text clustering is required in the real world applications such as web search engines. It comes under text mining process. It is meant for grouping text documents into various clusters. These clusters are used by various applications in the real world, for example, search engines. A text document is treated as an object a word in the document is referred as a term. A vector is built to represent each text document. The total number of terms in the text document is represented by m. Some kind of weighting schemes like Term Frequency – Inverse Document Frequency (TF-IDF) is used to represent document vectors. There are many approaches for text document clustering. They include probabilistic based methods [2], nonnegative matrix factorization [3] and information theoretic co-clustering [4]. These approaches are not using a particular measure for finding similarity among text documents. In this paper, we make use of multi- view point similarity measure for finding the similarity. As found it literature, a measure widely used in text document clustering is ED (Euclidian Distance). K-Means algorithm is most widely used clustering algorithm due to its ease of use and simplicity. Euclidian distance is the measure used in K-Means algorithm to measure the distance between objects to make them into clusters. In this case the cluster centroid is computed as follows: Another similarity measure being used for text document mining is cosine similarity measure. It is best useful in hidimensional documents [5]. This measure is also being used in Spherical K-Means which is a variant of K-Means algorithm. The difference between the two flavors of K-Means algorithm that use cosine similarity measure and ED measure respectively is that the former focuses on vector directions while the latter focuses on vector magnitudes. Graph partitioning is yet another approach which is very popular. It considers the text document corpus as graph and uses min-max cut algorithm which represents centriod as follows: There is a software package called CLUTO [6] which is meant for document clustering. It makes use of the graph partitioning approach. Based on the nearest neighbor graph it builds, it text documents are clustered. It is based on the Jacquard coefficient which is computed as follows: Jacquard coefficients use both magnitude and direction which is not the case with Euclidian distance and cosine similarity. However, it is similarity to cosine similarity when the documents are represented as unit vectors. In [7] there is comparison between the two techniques namely Jacquard and Pearson correlation. It also concludes that both of them are best used in clustering process of web documents. For tsxt document clustering other approaches can be used which are phrase based and concept based. In phrase based approach is found while in [8] tree similarity based approach is found. The common procedure used by both of them is “Hierarchical agglomerative Clustering”. The drawback of these approaches is that their computational cost is too high. For clustering XML documents also there are some measures. One such measure is called “Structural Similarity” which differs from text document clustering. This paper focuses on a new multi-view point based similarity measure for text clustering. III. PROPOSED WORK In proposed work, our approach in finding similarity between documents or objects while performing clustering is multi-view based similarity. It makes use of more than one point of reference as opposed to existing algorithms used for text document clustering. As per our approach the similarity between two documents is calculated as follows: sim(d i , d j )  d i , d j S r 1 n  nr  sim(d d h S S r i  dh , d j  dh ) Consider two point “di” and “dj” in the cluster Sr. The similarity between those two points is viewed from a point “dh” which is outside the cluster. Such similarity is equal to the product of the cosine angle between those points with respect to Euclidean distance between the points. An assumption on which this definition is based on is “dh” is not the same cluster as “di” and “dj”. When distances are very small, then the chances are higher that the “dh” is in the same cluster. Though various viewpoints are useful in increasing the accuracy of the similarity measure there is a possibility of having that give negative result. However the possibility of such a drawback can be ignored provided plenty of documents to be clustered. Now we have to carry out the validity test for the cosine similarity and multi view based similarity as follows. For each type of the similarity measure, a similarity matrix called A = {aij}n×n is created. For CS, this is very simple, as aij = dti dj . The algorithm for building Multi view Similarity (MVS) matrix is described in Algorithm 1. www.ijmer.com 2824 | Page
  • 3. International Journal of Modern Engineering Research (IJMER) www.ijmer.com Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826 ISSN: 2249-6645 ALGORITHM 1: BUILDMVSMATRIX(A) Step 1: for r ← 1 : c do Step 2: DS Sr ← d d i S r i Step 3: nS Sr←|S Sr| Step 4: end for Step 5: for i ← 1 : n do Step 6: r ← class of di Step 7: for j ← 1 : n do Step 8: if dj  Sr then Step 9: Step 10: else Step 11: Step 12: end if Step 13: end for Step 14: end for Step 15: return A = {aij}n×n First, the outer composite with respect to each class is determined. Then, for each row ai of “A”, i = 1, . . . , n, if the pair of text documents di and dj, j = 1, . . . , n are in the same class, aij is calculated as in line 9. Otherwise, dj is assumed to be in di‟s class, and aij is calculated as shown in line 11. After matrix “A” is formed, the code in Algorithm 2 is used to get its validity score: ALGORITHM 2: GETVALIDITY(validity,A, percentage) Step 1: for r ← 1 : c do Step 2: qr ← floor(percentage × nr) Step 3: if qr = 0 then Step 4: qr ← 1 Step 5: end if Step 6: end for Step 7: for i ← 1 : n do Step 8: {aiv[1], . . . , aiv[n] } ←Sort {ai1, . . . , ain} Step 9: s.t. aiv[1] ≥ aiv[2] ≥ . . . ≥ aiv[n] {v[1], . . . , v[n]} ← permute {1, . . . , n} Step 10: r ← class of di Step 11: Step 12: end for Step 13: Step 14: return validity For each document “di” corresponding to row “ai” of matrix A, we select “qr” documents closest to point “di”. The value of “qr” is chosen relatively as the percentage of the size of the class r that contains “di”, where percentage  (0, 1]. Then, validity with respect to “di” is calculated by the fraction of these “qr” documents having the same class label with “di”, as shown in line 11. The final validity is determined by averaging the over all the rows of matrix A, as shown in line 13. It is clear that the validity score is bounded within values 0 and 1. The higher validity score a similarity measure has, the more suitable it should be useful for the clustering process. IV. INCREMENTAL CLUSTERING ALGORITHM The main goal of this algorithm is to perform text document clustering by optimizing www.ijmer.com I R and I V as shown below: 2825 | Page
  • 4. www.ijmer.com International Journal of Modern Engineering Research (IJMER) Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2823-2826 ISSN: 2249-6645 With this general form, the incremental optimization algorithm, which has two major steps Initialization and Refinement, is shown in Algorithm 3 and Algorithm 4. ALGORITHM 3: INITIALIZATION Step 1: Select k seeds s1, . . . , sk randomly Step 2: Step 3: Step 4: end ALGORITHM 4: REFINEMENT Step 1: repeat Step 2: {v[1 : n]} ← random permutation of {1, . . ., n} Step 3: for j ← 1 : n do Step 4: i ← v[j] Step 5: p ← cluster[di] Step 6: Step 7: Step 8: Step 9: if then Step 10: Move di to cluster q: cluster[di] ← q Step 11: Update Dp, np,Dq, nq Step 12: end if Step 13: end for Step 14: until No move for all n documents Step 15: end At Initialization, “k” arbitrary documents are selected to be the seeds from which initial partitions are formed. Refinement is a process that consists of a number of iterations. During each iteration, the “n” text documents are visited one by one in a totally random order. Each text document is checked if its move to another cluster results in improvement of the objective function. If yes, then the text document is moved to the cluster that leads to the highest improvement. If no clusters are better than the current cluster, the text document is not moved. The clustering process terminates when iteration completes without any text documents being moved to new clusters. V. CONCLUSION In the view point of data engineering, a cluster is a group of objects with similar nature. The grouping mechanism is called as clustering process. The similar text documents are grouped together in a cluster, if their cosine similarity measure is less than a specified threshold. In this paper we mainly focuses on view points and we introduce a novel multi-viewpoint based similarity measure for text mining. The nature of similarity measure plays a very important role in the success or failure of the clustering method. From the proposed similarity measure, we then formulate new clustering criterion functions and introduce their respective clustering algorithms, which are fast and scalable like k-means algorithm, but are also capable of providing high quality and consistent performance. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] I. Guyon, U. von Luxburg, and R. C. Williamson, “Clustering: Science or Art?” , ‖ NIPS‟09 Workshop on Clustering Theory, 2009. Leo Wanner (2004). “Introduction to Clustering Techniques”. Available online at: http://www.iula.upf.edu/materials/040701wanner.pdf [viewed: 16 August 2012] D. Ienco, R. G. Pensa, and R. Meo, “Context-based distance learning for categorical data clustering,” in Proc. of the 8th Int. Symp. IDA, 2009, pp. 83–94. I. Guyon, U. von Luxburg, and R. C. Williamson, “Clustering: Science or Art?” NIPS‟09 Workshop on Clustering Theory, 2009. C. D. Manning, P. Raghavan, and H. Sch ¨ utze, An Introduction to Information Retrieval. Press, Cambridge U., 2009. X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” Knowl.Inf. Syst., vol. 14, no. 1, pp. 1–37, 2007. W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative matrix factorization,” in SIGIR, 2003, pp. 267–273. S. Zhong, “Efficient online spherical K-means clustering,” in IEEE IJCNN, 2005, pp. 3180–3185. www.ijmer.com 2826 | Page