SlideShare a Scribd company logo
1 of 13
Download to read offline
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
DOI:10.5121/ijitca.2012.2401 1
XCLS++: A new algorithm to improve XCLS+ for
clustering XML documents
Ahmad khodayar1
and Hassan naderi2
1
Department of Computer Engineering, Shabestar Islamic Azad University of Science
and Technology, Shabestar, Iran
ahmad_kho2000@yahoo.com
2
Department of Computer Engineering, Iran University of Science and Technology,
Resalat st., Tehran, Iran
naderi@iust.ac.ir
Abstract
The purpose of this paper is to offer a method for clustering of XML documents. Many different ways of
clustering were discussed for clustering documents which can be divided to structure, content and
combination of structure and content. One method that has been proposed to solve this problem is XCLS
which be improved with XCLS+ later. XCLS+ is efficient algorithm for clustering XML documents and
Exposure into structure based on clustering category. The conducted survey showed which XCLS+ method
has problem that makes away from its optimal value too. This paper presents a method with name XCLS++
which has not related problem XCLS+ and its efficiency are diagnosed more than XCLS+.
Keywords
Clustering, XML documents, XCLS+ algorithm, Content similarity of document, Structure similarity of
document
1. Introduction
The XML documents are currently devoted for exchanging the largest volume of textual
information on the internet. Interesting characteristics of XML, including ability to describe the
information (which makes information store in any computer), be its textual (been processed with
any operating system and software) and have been understood with machines and humans, have
caused particular popularity of this format in the IT community. On the other hand, because this
technology is relatively new, high-performance systems that enable to process it, are still being
developed. XML documents easy management and access to content with giving structure to
content. So because XML document is composed of content and structure therefore in processing
of XML documents should are considered these two parts.
With increased spread of information, for its storage, retrieval and transmission through internet,
we need to manage that information basically. Clustering is very useful in above processes and
for summarizing and reducing size of documents. Clustering XML documents also not an
exception and as many applications are faced with problem of clustering. For example, with
representative of each cluster obtained with an efficient clustering algorithm can very large
volume set of documents to store more briefly. Formula for calculating is the first appliances
required for clustering set of documents.
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
2
Methods to determine similarity two XML documents are generally divided into three categories:
content [10], structure [2] [5] [11] and combination of mentioned methods [1] [3] [4] [8] [9].
Clustering XML documents base on content is very similar to clustering textual documents which
a lot of works have been performed in this field. On the other words in this kind of clustering,
document tags omitted or considered as content, and clustered with planned algorithms easily. In
this way, criterion for clustering is content and if documents have content similarity then has been
laid in same cluster. Content based on clustering used for searching textual documents with
Google and AltaVista engines. Clustering base on structure tries to find documents which have
structure similarity and puts them in similar cluster. Combined clustering uses advantages two
types of content and structural clustering.
One of the methods of structural clustering XML document that has good efficiency is XCLS [5].
Performance of this method is to find similar tag name of nodes in levels of tree correspondence
to document and then calculating similar number. But fundamental problem XCLS is neglect
relationship FATHER-NODE that makes order of nodes is not preserved. XCLS+ method are
created for improving and solving the problem mentioned XCLS method [11]. XCLS+ method
has tried by adding a factor as relationship FATHER-NODE in XCLS related formula and
solved the problem and improved previous method. But with careful study have been observed
which the XCLS+ method because not considering duplicate nodes has problem and is way from
optimality. Therefore this article tries to improve similar formula calculation XCLS+ and a new
method called XCLS++ is established. In section2 tree model in XML documents will be
reviewed. Section3 discusses previous works and then in section4 detail analysis of the XCLS+
method will be expressed. In section5 the relative bug of XCLS+ method will be offered. The
proposed method will be offered in section6 for solving the bug XCLS+ method. Then in next
section, results of proposed method are presented and compared with the XCLS+ method and will
be fond that optimality of the proposed method is good in comparison with XCLS+ method. In
final section summary and conclusions will be brought.
2. Tree model of XML documents
As the XML documents are the tree format, so documents can to model in tree case. In this case,
problem of clustering the XML documents will decrease into clustering trees. The structural
methods benefit of these trees. The XCLS+ method that uses of tree model is an example which is
successful in clustering the XML documents. The XCLS+ method optimized the XCLS method.
Despite improvements with XCLS+ on XCLS, but XCLS+ has bug too. The main reason this
method which is not optimal (from now on we talk only about XCLS+), neglecting duplicate
nodes in the tree levels which in this article have been tried to fix it.
3. Previous works
Clustering criteria is to find similar documents. Similarity of documents should be found with a
certain similarity and clustering is done base on similarity. As mentioned above three methods for
finding similarities has been suggested and are: 1- content [10] 2- structure [2] [5] [11] 3- content
with structure [1] [3] [4] [8] [9]. As we know, each XML document can be changed to tree and
clustering operations can be done with these trees. The structural methods considered only the
tree structure and do not work with content. XCLS+ is the structural approach which in this
article has focused on this type clustering. As previous mentioned, this method have been created
to optimize the XCLS method. Experiments view that the XCLS+ method despite good
performance compared to the XCLS method has a basic problem that decrease efficiencies in
some circumstances. with more studies was concluded that main reason for low efficiency of this
method is to neglect duplicate similar nodes in trees which in this paper tried to present solution
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
3
could solve bug and improve the XCLS+ algorithm. More details in this article will be brought in
the next. In this paper, first XCLS+ have been studied then proposal algorithm offered. Finally
the results compared with proposed method and have been seen which optimality of the proposed
method is better than the XCLS+ method.
4. XCLS+ method
This method uses a similar structure and acts in more detail base on tag similarity between two
levels of tree nodes related to document tags. In this way the incoming XML document compared
with clustered documents and if there is acceptable similar number between those, new XML
document is placed beside relevant clustered document. Acceptable new incoming document
combined with document in cluster and form a new tree. In other words, in each cluster there is a
single tree. If similarity value is not acceptable in this case a new cluster has created and
incoming document is placed in it. This practice will continue until entrance final document. The
similarity calculation is done based on Formula1. This formula is created for improving the
XCLS method. In this way operations taken when document arrival and how execute it include:
first root node of incoming document has compared with root node of clustered documents then if
there are similar, factors the Formula1 is calculated and comparing levels nodes in each document
have been continued in a level down for similarity calculation. In the absence of similarity, the
next comparison is done between node in the clustered document in a level down and the same
node of the incoming document. This compares has been continued to the leaves levels of
document. As previously noted, similarity value have been calculated with:
sim base on XCLS൅ ൌ
0.5 ‫כ‬ ∑ ሺCNଵ
୧
൅ CPଵ
୧୪ିଵ
୧ୀ଴ ሻ ‫כ‬ r୪ି୧ିଵ
൅ 0.5 ‫כ‬ ∑ ሺCNଵ
୨
൅ CPଵ
୨୪ିଵ
୨ୀ଴ ሻ ‫כ‬ r୪ି୨ିଵ
ሺ∑ N୩୪ିଵ
୩ୀ଴ ‫כ‬ ሺrሻ୪ି୩ିଵሻ ‫כ‬ z ൅ 0.5 ‫כ‬ ൫∑ CPଵ
୧
‫כ‬ r୪ି୧ିଵ ൅ ∑ CPଵ
୨୪ିଵ
୨ୀ଴
୪ିଵ
୧ୀ଴ ‫כ‬ r୪ି୨ିଵ൯
Formula1. The XCLS+ method based on formula
Numeric Value Formula1 is variable between zero and one. This value will change relatively on
similar.
The variables in the above formula are:
1- Z cluster size or in other words, the number of documents within cluster.
2- CNi
is the total of similar nodes between level i of the new incoming document and level j of
the clustered documents.
3- CNj
is the total of similar nodes between level j of the clustered documents and level i of the
new incoming document.
4- CPi
is the total of similar nodes between level i of the new incoming document and level j of
the clustered documents as have the same father.
5- CPj
is the total of similar nodes between in level j of the clustered documents and in level i of
the new incoming document as have the same father.
6- N in Nk
is the number of elements in level k of the incoming tree.
7- l is high of tree in the each document.
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
4
8- i, j are desired number of level.
9- r is the incremental factor, which is considered number 2.
10- k equal with 2.
The above formula is active in every jump between levels document and the similarity Value
calculated is sum of obtained factors. This practice continues until Level one of documents ended.
Overall algorithm the XCLS+ method is given in Section 4.1.
4.1. XCLS+ method algorithm
1- Start to look same in node two tree node. If a node is found then does the calculation of
formula then move to step2, otherwise go to Step3.
2- If depth of trees are move toward the lower level in the both trees. Search the same node as
step1. If there is same node, calculate the formula and repeat step2, otherwise go to step3.
3- If depth of tree is (usually in clustered document), move down level in the clustered
document and stay in the same level of the new incoming document. Search again the same
nodes. If there is the same node, calculate the formula And then repeat step2, otherwise repeat
step3.
4.2. Example of XCLS+ method
For understanding the algorithm an example is given in this section. Goal is to find similarity
between tree1 and tree2 base on the XCLS+ method. It should be noted that the tree1 referred to
the incoming document and the tree2 referred to the clustered documents. Arrows of up to down
are indicative the order of execution of algorithm. In this example in the Figure1 for facility the
variable values are calculated and placed on cut arrows
Figure1. An example for showing how operation of the XCLS+ method
After obtaining above factors, the similarity value base on the XCLS+method is:
sim base on XCLS൅ൌ
଴.ହ‫כ‬ቀሺ଴ା଴ሻ‫כ‬ଶమାሺଵା଴ሻ‫כ‬ଶభାሺଶାଶሻ‫כ‬ଶబቁା଴.ହ‫כ‬ቀሺ଴ା଴ሻ‫כ‬ଶమାሺଵା଴ሻ‫כ‬ଶభାሺଶାଶሻ‫כ‬ଶబቁ
ሺଵ‫כ‬ଶభାଷ‫כ‬ଶబሻା଴.ହ‫כ‬ሺ଴‫כ‬ଶమା଴‫כ‬ଶభାଶ‫כ‬ଶబሻ
ൌ0.85
1
2 43
32
1
0
cn=1 cp=0
cn=2 cp=2
L= 0
L= 1
L= 2
tree1
cn=0 cp=0
tree2
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
5
5. Problem for XCLS+ method
With more detail study of algorithms and testing various examples observed that the XCLS+
formula has problem because neglecting repeated nodes. For more clarify presented an example
in the Figure2, 3.
Figure2. Same nodes of two trees are equal in hierarchies
After calculating above factors similar number is:
sim base on XCLS൅ൌ
଴.ହ‫∑כ‬ ሺେ୒భ
౟ ାେ୔భ
౟రషభ
౟సబ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ
ౠ
ାେ୔భ
ౠరషభ
౟సబ ሻ‫כ‬ଶరషౠషభ
∑ ୒ౡరషభ
ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ େ୔భ
౟ ‫כ‬୰రష౟షభା∑ େ୔భ
ౠరషభ
ౠసబ
రషభ
౟సబ ‫כ‬୰రషౠషభቁ
ൌ0.85
For another example in the Figure3 similar number calculated based on the XCLS+ method is:
sim base on XCLS൅ൌ
଴.ହ‫∑כ‬ ሺ஼ேభ
೔ ା஼௉భ
೔రషభ
೔సబ ሻ‫כ‬ଶరష೔షభା଴.ହ‫∑כ‬ ሺ஼ேభ
ೕ
ା஼௉భ
ೕరషభ
೔సబ ሻ‫כ‬ଶరషೕషభ
∑ ேೖరషభ
ೖసబ ‫כ‬ଶరషೖషభା଴.ହ‫כ‬ቀ∑ ஼௉భ
೔‫כ‬௥రష೔షభା∑ ஼௉భ
ೕరషభ
ೕసబ
రషభ
೔సబ ‫כ‬௥రషೕషభቁ
ൌ0.85
As a result calculated for the Figure2 also for trees in the Figure3 similar numbers are equal and
how calculation similar number for trees in the Figure3 is same similar number for trees in the
Figure2. Factors Figure3 for calculating similar number on arrows in Figure3 written. As see all
factors are same with factors in Figure2. Despite significant difference between In the Figure3, 2
have been seen which variable values are equal. So calculated similarity numbers with the
XCLS+ method is equal in Figure2, 3. By comparing two above examples, have been concluded
that similar numbers of first instance should be greater than second example. The results have
obtained with XCLS+ are away from logic and should be different. The primary cause of this
problem is ignoring repeated nodes in original formula with the XCLS+ method. Considering cp
factor regardless of repeated nodes causes order of modified nodes in levels changed and for
different trees, same similarity number be earned. So should method have been proposed which
repeated cases are also included. This paper proposed a method called XCLS++ which tries with
adding a new factor into the formula1 and solving problem the XCLS+ method in finding similar
number be more effective. In continuation proposed method is mentioned in this article.
1
2
4 5
6 54
23 2
01
0
cn=0 cp=0
cn=1 cp=1
cn=1 cp=0
cn=2 cp=2
L= 0
L= 3
L= 2
L= 1
tree2tree1
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
6
Figure3. Same nodes of two trees are not equal in hierarchies
6. XCLS++: proposed method
As mentioned above the XCLS+ method has fundamental problem in calculating similarity
number. This method calculates equal similarity number for trees in the Figure2, 3. So should a
method have been proposed as their calculated similarity numbers be different. In this article tried
a method proposed for solving mentioned problem and gaining optimized results compared with
the XCLS+ method. In this article tried the problem will solve with two steps. Reason of unreal
calculation for same trees in Figure 2, 3 is duplicate nodes. This reason cases knocked hierarchy
of tree nodes. Therefore must factor added to formula which preserves order of nodes. In this
paper, in first step an incremental change are added to the original formula and then in second
step replacement change are apply. Finally the proposed formula will be optimized formula in this
paper. As will see final formula obtained is more reasonable and more realistic and have good
optimality in compare with the XCLS+ method. Steps are mentioned bring in continue.
6-1. step1
In step1 in New mentioned method for solving problem factor FATHER_CHILD_NODE added
in the formua1. This factor in proposed method formula is cc and proposed formula mentioned in
formula2 is:
sim base on XCLS ൅ ൅
ൌ
0.5 ‫כ‬ ∑ ሺCNଵ
୧
൅ CPଵ
୧୪ିଵ
୧ୀ଴ ൅ CCଵ
୧
ሻ ‫כ‬ r୪ି୧ିଵ
൅ 0.5 ‫כ‬ ∑ ሺCNଵ
୨
൅ CPଵ
୨୪ିଵ
୨ୀ଴ ൅ CCଵ
୨
ሻ ‫כ‬ r୪ି୨ିଵ
ሺ∑ ܰ௞௟ିଵ
௞ୀ଴ ‫כ‬ ሺ‫ݎ‬ሻ௟ି௞ିଵሻ ‫כ‬ ‫ݖ‬ ൅ 0.5 ‫כ‬ ൫∑ ሺ‫ܲܥ‬ଵ
௜
൅ ‫ܥܥ‬ଵ
௜
ሻ ‫כ‬ ‫ݎ‬௟ି௜ିଵ ൅ ∑ ሺ‫ܲܥ‬ଵ
௝
൅ ‫ܥܥ‬ଵ
௝
ሻ௟ିଵ
௝ୀ଴
௟ିଵ
௜ୀ଴ ‫כ‬ ‫ݎ‬௟ି௝ିଵ൯
formula2. Step1 stage formula of XCLS++ proposed method
In the above formula, all variables equal with variables in Formula1 and only cc is new factor.
New Factor represents number of same nodes that have same father and same children. In
neglecting this factor in ways that is XCLS+ causes to grow tree toward duplicate child node and
create unreasonable results. Considering above mentioned factor causes problem solved. Because
in this case growing tree of repeated node was determined and proportionate value of cc is placed.
For proving base on optimality of proposed method, similarity numbers for examples in
Figure2, 3 obtained again with proposed formula provided in formula2 and shown that proposed
method for two above mentioned examples which have significant difference, give better results
than the XCLS+ method. The similarity of trees Figure2 and Figure3 is calculated again
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
7
sequentially in Figures4, 5. Result are seen after related Figures. Note which factors without ()
belong for both trees.
Figure4.calculating similarity number for trees base on the step1 optimization
sim base on XCLS ൅ ൅ൌ
଴.ହ‫∑כ‬ ሺେ୒భ
౟ ାେ୔భ
౟రషభ
౟సబ ାେେభ
౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ
ౠ
ାେ୔భ
ౠరషభ
౟సబ ାେେభ
ౠ
ሻ‫כ‬ଶరషౠషభ
∑ ୒ౡరషభ
ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୔భ
౟ ାେେభ
౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୔భ
ౠ
ାେେభ
ౠ
ሻరషభ
ౠసబ
రషభ
౟సబ ‫כ‬୰రషౠషభቁ
ൌ0.85
Figure5.calculating similarity number for trees base on the step1 optimization
sim base on XCLS ൅ ൅ൌ
଴.ହ‫∑כ‬ ሺେ୒భ
౟ ାେ୔భ
౟రషభ
౟సబ ାେେభ
౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ
ౠ
ାେ୔భ
ౠరషభ
౟సబ ାେେభ
ౠ
ሻ‫כ‬ଶరషౠషభ
∑ ୒ౡరషభ
ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୔భ
౟ ାେେభ
౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୔భ
ౠ
ାେେభ
ౠ
ሻరషభ
ౠసబ
రషభ
౟సబ ‫כ‬୰రషౠషభቁ
ൌ0.83
With comparing the two observed numbers contrary the XCLS+ method, the proposed method
able to distinct difference between above mentioned cases. So factor cc able to solve problem
XCLS+. But with next examples have been seen which this factor has problem too and were not
solved problem completely. In continue mentioned examples will be brought.
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
8
6-2. problem for step1
Also step1 able to solve problem of xcls+ but for some examples are problem too. For example
for trees in figure6, 7 similarity numbers calculated with step1 are 0.96 , 0.95 sequentially. Result
are seen after related Figures.
Figure6.calculating similarity number for trees base on step1 optimization
sim base on XCLS ൅ ൅ൌ
଴.ହ‫∑כ‬ ሺେ୒భ
౟ ାେ୔భ
౟రషభ
౟సబ ାେେభ
౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ
ౠ
ାେ୔భ
ౠరషభ
౟సబ ାେେభ
ౠ
ሻ‫כ‬ଶరషౠషభ
∑ ୒ౡరషభ
ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୔భ
౟ ାେେభ
౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୔భ
ౠ
ାେେభ
ౠ
ሻరషభ
ౠసబ
రషభ
౟సబ ‫כ‬୰రషౠషభቁ
ൌ0.96
Figure7.calculating similarity number for others trees base on the step1 optimization
sim base on XCLS ൅ ൅ൌ
଴.ହ‫∑כ‬ ሺେ୒భ
౟ ାେ୔భ
౟రషభ
౟సబ ାେେభ
౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ
ౠ
ାେ୔భ
ౠరషభ
౟సబ ାେେభ
ౠ
ሻ‫כ‬ଶరషౠషభ
∑ ୒ౡరషభ
ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୔భ
౟ ାେେభ
౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୔భ
ౠ
ାେେభ
ౠ
ሻరషభ
ౠసబ
రషభ
౟సబ ‫כ‬୰రషౠషభቁ
ൌ0.95
As see two up examples show step1 only do not ables to solve xcls+ problem completely and for
tow group’s trees calculates unreasonable similarity numbers. So must changed another factor in
formula for earning optimal similarity numbers. This work will be brought in step2.
0
2 3
1
32
0
cn=1 cp=1 cc=0
cn=2 cp=2 cc=2
L= 0
L= 1
L= 2
tree1 tree2
11
cn(tree1)=2 cn(tree2)=1 cp(tree1)=2 cp(tree2)=1
cc=1
0
2 3
1
32
1
0
cn=1 cp=1 cc=0
cn(tree1)=2 cn(tree2)=1 cp(tree1)=2
cp(tree2)=1 cc=0
cn=2 cp=2 cc=2
L= 0
L= 1
L= 2
tree1 tree2
1
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
9
6-3. step2
For latest optimizing proposed method must a factor changed too. Factor which cases unreason
result is same father. In this step rather than it, factor same brother replaced. Latest optimized
formula is:
sim base on XCLS ൅ ൅
ൌ
0.5 ‫כ‬ ∑ ሺCNଵ
୧
൅ CBଵ
୧ସିଵ
୧ୀ଴ ൅ CCଵ
୧
ሻ ‫כ‬ 2ସି୧ିଵ
൅ 0.5 ‫כ‬ ∑ ሺCNଵ
୨
൅ CBଵ
୨ସିଵ
୧ୀ଴ ൅ CCଵ
୨
ሻ ‫כ‬ 2ସି୨ିଵ
∑ N୩ସିଵ
୩ୀ଴ ‫כ‬ 2ସି୩ିଵ ൅ 0.5 ‫כ‬ ൫∑ ሺCBଵ
୧
൅ CCଵ
୧
ሻ ‫כ‬ rସି୧ିଵ ൅ ∑ ሺCBଵ
୨
൅ CCଵ
୨
ሻସିଵ
୨ୀ଴
ସିଵ
୧ୀ଴ ‫כ‬ rସି୨ିଵ൯
formula3. Latest formula of XCLS++ proposed method
Now for trees in Figure6, 7 similarity numbers again calculated with formula3 in Figures8, 9.
Results of calculated are after related Figures:
Figure8.calculating similarity number for Figure6 trees base on step2 optimization
sim base on XCLS ൅ ൅ൌ
଴.ହ‫∑כ‬ ሺେ୒భ
౟ ାେ୆భ
౟రషభ
౟సబ ାେେభ
౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ
ౠ
ାେ୆భ
ౠరషభ
౟సబ ାେେభ
ౠ
ሻ‫כ‬ଶరషౠషభ
∑ ୒ౡరషభ
ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୆భ
౟ ାେେభ
౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୆భ
ౠ
ାେେభ
ౠ
ሻరషభ
ౠసబ
రషభ
౟సబ ‫כ‬୰రషౠషభቁ
ൌ0.86
Figure9.calculating similarity number for Figure7 trees base on step2 optimization
0
2 3
1
32
0
cn=1 cb=0 cc=0
cn=2 cb=2 cc=2
L= 0
L= 1
tree1 tree2
11
cn(tree1)=2 cn(tree2)=1 cb(tree1)=2
cb(tree2)=0 cc=1
0
2 3
1
32
1
0
cn=1 cb=0 cc=0
cn(tree1)=2 cn(tree2)=1 cb(tree1)=2
cb(tree2)=0 cc=0
L= 0
L= 1
L=2
tree1 tree2
1
cn(tree1)=2 cn(tree2)=2 cb(tree1)=0
cb(tree2)=2 cc=2
L= 2
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
10
sim base on XCLS ൅ ൅ൌ
଴.ହ‫∑כ‬ ሺେ୒భ
౟ ାେ୆భ
౟రషభ
౟సబ ାେେభ
౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ
ౠ
ାେ୆భ
ౠరషభ
౟సబ ାେେభ
ౠ
ሻ‫כ‬ଶరషౠషభ
∑ ୒ౡరషభ
ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୆భ
౟ ାେେభ
౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୆భ
ౠ
ାେେభ
ౠ
ሻరషభ
ౠసబ
రషభ
౟సబ ‫כ‬୰రషౠషభቁ
ൌ0.94
As see results calculated with latest optimized formula are reasonable. For clarifying
effectiveness the above factors the XCLS++ and the XCLS+ algorithms implemented and have
been seen that how much this factor plays a fundamental role in applied environment. This work
will be taken in next sections.
7. Evaluating algorithms and comparing methods
As was shown in above examples the XCLS++ approach in comparison with the XCLS+ method
is good. For proving above sentence in this section the XCLS+ and XCLS++ algorithms have
been implemented. Both of them were implemented with C language in DOS environment on a
machine with 2.4 GHZ Intel Celeron CPU and 512 MB of RAM. The evolution criteria were
implemented too in same conditions for evaluating xml files in different types. The results of
experiments like above examples, confirm optimality of the proposed algorithm and efficiency of
the XCLS++ algorithm is higher than the XCLS+ method.
For evaluating similarity diagnostic algorithms and clustering XML documents, must first set of
the incoming files have been specified and after clustering accuracy of algorithms will be
calculated with existent criteria.
7.1. Set data
For evaluating, files divided in two categories for best analysis and both algorithms executed on
same files in same conditions for efficiency comparing. Two mentioned categories are
homogeneous files (from one type DTD) [6] and heterogeneous file (multi type of DTD) [7]. The
results for both categories will be shown separately.
7.2. Evaluation criteria
There are three items for calculating accuracy clustering algorithms: 1-entropy 2-purity 3-fscore
7.2.1 Entropy
Entropy is sum documents which located in the cluster i which are of the class r. The entropy
formula is:
‫ݕ݌݋ݎݐ݊ܧ‬ ൌ ෍
݊௜
N
௞
௜ୀଵ
‫ܧ‬ሺ‫ܥ‬௜ሻ
ܽ‫ݏ‬
‫ܧ‬ሺ‫ܥ‬௜ሻ ൌ
ଵ
௟௢௚௞
∑
୬౟
౨
௡೔
݈‫݃݋‬௞
௜ୀଵ
୬౟
౨
௡೔
In above formulas,‫ܥ‬௜, N, k, ݊௜ and n୧
୰
are respectively ith cluster, total number of incoming
documents, number of clusters, number clustered documents in cluster i and number clustered
documents in cluster i of class r. The entropy value if be closer to zero is better and has good
efficiency.
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
11
7.2.2 Purity
Purity is sum maximum documents which located in the cluster i which are of the class r. The
purity formula is:
ܲ‫ݕݐ݅ݎݑ‬ ൌ ෍
݊௜
N
௞
௜ୀଵ
ܲሺ‫ܥ‬௜ሻ
ܽ‫ݏ‬
ܲሺ‫ܥ‬௜ሻ ൌ
ଵ
௡೔
maxሺn୧
୰
ሻ
The entropy value if be closer to one is better and has good efficiency.
7.2.3 Fscore
Fscore is another item created by combination of above two items and is:
‫݁ݎ݋ܿܵܨ‬ ൌ
∑ ݊௥‫ܨ‬ሺܼ௥, ‫ܥ‬௜ሻ௞
௥ୀଵ
ܰ
As
‫ܨ‬ሺܼ௥, ‫ܥ‬௜ሻ ൌ
ܲሺܼ௥, ‫ܥ‬௜ሻ ‫כ‬ ‫ݎ‬ሺܼ௥, ‫ܥ‬௜ሻ
ܲሺܼ௥, ‫ܥ‬௜ሻ ൅ ‫ݎ‬ሺܼ௥, ‫ܥ‬௜ሻ
ൌ
2 ‫כ‬ n୧
୰
݊௜ ൅ ݊௥
,
‫ݎ‬ሺܼ௥, ‫ܥ‬௜ሻ ൌ
n୧
୰
݊௥
,
ܲሺܼ௥, ‫ܥ‬௜ሻ ൌ
n୧
୰
݊௜
The fscore value if be closer to one is better and has good efficiency.
In this paper, after implementation algorithms and above criteria for both categories, results are
calculated and compared for analyzing. For testing implemented program, XML files consists of
100 different classes, such as medical files, colleges, shops, cars, insurance, etc... have been
considered. As above mentioned, input files divided into two parts and with the XCLS+ and
XCLS++ algorithms were evaluated separately. The results of algorithm on the homogeneous
files are in Table 1 and include:
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
12
Table1. Results of algorithms executed on homogeneous files
The results of algorithm on the heterogeneous files are in Table 2 and include:
Table2. Results of algorithms executed on heterogeneous files
As previous section results, the results obtained in this section shows which the XCLS++ method
has higher efficiency than the XCLS+ method. Also results show which proposed method has
good efficiency for homogeneous files in comparison with heterogeneous files. Because probably
for existence repeated node in homogeneous file is up.
8. Result and conclusion
The purpose of this paper is classification XML documents in order to expedite search and others
benefits that classification has them. Criteria for classifying is structural or content and structure
with content. The XCLS+ method is a method of classification methods which criteria for
classification is done base on structure. Despite good performance for the XCLS+ method in
compared with the XCLS method, ignoring repeated nodes in some documents causes it be
inefficient. Therefore, this paper proposed FATHER-NODE-CHAILD and BROTHER-NODE
factors to the XCLS+ formula for achieving good efficiency. The results proposed idea as
entropy, purity and Fscore show which the proposed method works better than the previous
method. In future weight of levels will be changed for obtaining similarity number actually and
better than proposed method too. Also the new method will be exam on much more documents in
future and with comparing those, will be able to obtain better results.
9. References
International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012
13
[1] Ilwan Choi, Bongki Moon, Hyoung-Joo Kin, (2006) Data & Knowledge Engineering, A clustering
method based on path similarities of XML data.
[2] Andrewdn, jag, (2007 (Information systems engineering, Evaluting Structural Similarity in XML
DocumentWISE’07 Proceedings of the 8th international conference on Web Information.
[3] Tien Tran, Richi, Peter, (2008 (Data Mining, Combining Structure and Content Similarities for
XML Document Clustering, Conference 27-28 November, Glenelg, South Australia.
[4] WOOSAENG KIM, (2008( Computer Engineering and Applications, XML document similarity
measure in terms of the structure and contents, CEA'08 Proceedings of the 2nd WSEAS
International Conference.
[5] G.R.Nayak, (2008) “Fast and effective clustering of XML data using structural information,“
knowl. Inf. Syst.
[6] The XML data repository. Accessed from: http://www.infomatik.uni-trire.de
[7] The XML data repository. Accessd from: http://www.cs.washington.edu
[8] Waraporn Viyanon, Sanjay K.Madria, Sourav S.Bhowmick, (2008) Management of Data,
XML Data Integration Based on Content and Structure Similarity Using Keys.
[9] aptarshi Ghosh and Pabitra Mitra, (2008) Pattern Recognition, ICPR Combining Content and
Similarity for XML Document Classification using Composite SVM Kernels, , 19th
International Conference.
[10] jing PengDong Qing Yang Shi Wei Tang et al, (2008) similarity in chinese text processing, A New
Similarity competing method based on concept, series F: Information science, 51(9): P1212-1230, 9.
[11] Mohamad Alishahi, Mohmoud Naghibzadeh and Baharak Shakeri Aski, (2010) International
Journal of Computer and Electrical Engineering ,Tag Name Structure-based Clustering of XML
Documents, VOL. 2, NO. 1, February.
10. Authors
Ahmad Khodayar received master science of computer engineering in 2010 from Islamic
Azad University of Shabestar, Iran. His current research area is data mining and specially
clustering; now he is working on a project about new way in clustering.
Hassan Naderi received his PhD degree in 2006 from INSA-LYON university of France.
His current research areas are text mining, search engine and massive data processing.

More Related Content

What's hot

Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnIOSR Journals
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
 
Paper Explained: Understanding the wiring evolution in differentiable neural ...
Paper Explained: Understanding the wiring evolution in differentiable neural ...Paper Explained: Understanding the wiring evolution in differentiable neural ...
Paper Explained: Understanding the wiring evolution in differentiable neural ...Devansh16
 
Efficiency of TreeMatch Algorithm in XML Tree Pattern Matching
Efficiency of TreeMatch Algorithm in XML Tree Pattern  MatchingEfficiency of TreeMatch Algorithm in XML Tree Pattern  Matching
Efficiency of TreeMatch Algorithm in XML Tree Pattern MatchingIOSR Journals
 
Duplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathDuplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathiosrjce
 
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...ijcsitcejournal
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
 
Information retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic modelsInformation retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic modelsVaibhav Khanna
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documentssubash chandra
 
Language independent document
Language independent documentLanguage independent document
Language independent documentijcsit
 
PAGOdA paper
PAGOdA paperPAGOdA paper
PAGOdA paperDBOnto
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
 
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
16. Algo analysis & Design - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Data Structure
Data StructureData Structure
Data Structuresheraz1
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
 
Converting UML Class Diagrams into Temporal Object Relational DataBase
Converting UML Class Diagrams into Temporal Object Relational DataBase Converting UML Class Diagrams into Temporal Object Relational DataBase
Converting UML Class Diagrams into Temporal Object Relational DataBase IJECEIAES
 
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...widespreadpromotion
 

What's hot (19)

Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
Paper Explained: Understanding the wiring evolution in differentiable neural ...
Paper Explained: Understanding the wiring evolution in differentiable neural ...Paper Explained: Understanding the wiring evolution in differentiable neural ...
Paper Explained: Understanding the wiring evolution in differentiable neural ...
 
Efficiency of TreeMatch Algorithm in XML Tree Pattern Matching
Efficiency of TreeMatch Algorithm in XML Tree Pattern  MatchingEfficiency of TreeMatch Algorithm in XML Tree Pattern  Matching
Efficiency of TreeMatch Algorithm in XML Tree Pattern Matching
 
Duplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathDuplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPath
 
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
Information retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic modelsInformation retrieval 15 alternative algebraic models
Information retrieval 15 alternative algebraic models
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documents
 
Language independent document
Language independent documentLanguage independent document
Language independent document
 
PAGOdA paper
PAGOdA paperPAGOdA paper
PAGOdA paper
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
 
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Data Structure
Data StructureData Structure
Data Structure
 
mlss
mlssmlss
mlss
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
 
Converting UML Class Diagrams into Temporal Object Relational DataBase
Converting UML Class Diagrams into Temporal Object Relational DataBase Converting UML Class Diagrams into Temporal Object Relational DataBase
Converting UML Class Diagrams into Temporal Object Relational DataBase
 
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
 

Similar to XCLS++: A new algorithm to improve XCLS+ for clustering XML documents

Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015ijcsbi
 
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSXML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSijdms
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTSINVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTScsandit
 
Xml data clustering an overview
Xml data clustering an overviewXml data clustering an overview
Xml data clustering an overviewunyil96
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmIRJET Journal
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
RELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESRELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESijwscjournal
 
RELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESRELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESijwscjournal
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
 
Xml based data exchange in the
Xml based data exchange in theXml based data exchange in the
Xml based data exchange in theIJwest
 
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive Data
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive DataMulti-GranularityUser Friendly List Locking Protocol for XML Repetitive Data
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive DataWaqas Tariq
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
 

Similar to XCLS++: A new algorithm to improve XCLS+ for clustering XML documents (20)

Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015
 
Ijert semi 1
Ijert semi 1Ijert semi 1
Ijert semi 1
 
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSXML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
 
J017616976
J017616976J017616976
J017616976
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTSINVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS
 
Xml data clustering an overview
Xml data clustering an overviewXml data clustering an overview
Xml data clustering an overview
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch Algorithm
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
RELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESRELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULES
 
RELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULESRELATIONAL STORAGE FOR XML RULES
RELATIONAL STORAGE FOR XML RULES
 
A survey of xml tree patterns
A survey of xml tree patternsA survey of xml tree patterns
A survey of xml tree patterns
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
A survey of xml tree patterns
A survey of xml tree patternsA survey of xml tree patterns
A survey of xml tree patterns
 
Xml based data exchange in the
Xml based data exchange in theXml based data exchange in the
Xml based data exchange in the
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive Data
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive DataMulti-GranularityUser Friendly List Locking Protocol for XML Repetitive Data
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive Data
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
 

More from IJITCA Journal

The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)IJITCA Journal
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)IJITCA Journal
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)IJITCA Journal
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)IJITCA Journal
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...IJITCA Journal
 

More from IJITCA Journal (20)

The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)
 
International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)International Journal of Information Technology, Control and Automation (IJITCA)
International Journal of Information Technology, Control and Automation (IJITCA)
 
The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...The International Journal of Information Technology, Control and Automation (...
The International Journal of Information Technology, Control and Automation (...
 

Recently uploaded

Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 

Recently uploaded (20)

Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdf
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 

XCLS++: A new algorithm to improve XCLS+ for clustering XML documents

  • 1. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 DOI:10.5121/ijitca.2012.2401 1 XCLS++: A new algorithm to improve XCLS+ for clustering XML documents Ahmad khodayar1 and Hassan naderi2 1 Department of Computer Engineering, Shabestar Islamic Azad University of Science and Technology, Shabestar, Iran ahmad_kho2000@yahoo.com 2 Department of Computer Engineering, Iran University of Science and Technology, Resalat st., Tehran, Iran naderi@iust.ac.ir Abstract The purpose of this paper is to offer a method for clustering of XML documents. Many different ways of clustering were discussed for clustering documents which can be divided to structure, content and combination of structure and content. One method that has been proposed to solve this problem is XCLS which be improved with XCLS+ later. XCLS+ is efficient algorithm for clustering XML documents and Exposure into structure based on clustering category. The conducted survey showed which XCLS+ method has problem that makes away from its optimal value too. This paper presents a method with name XCLS++ which has not related problem XCLS+ and its efficiency are diagnosed more than XCLS+. Keywords Clustering, XML documents, XCLS+ algorithm, Content similarity of document, Structure similarity of document 1. Introduction The XML documents are currently devoted for exchanging the largest volume of textual information on the internet. Interesting characteristics of XML, including ability to describe the information (which makes information store in any computer), be its textual (been processed with any operating system and software) and have been understood with machines and humans, have caused particular popularity of this format in the IT community. On the other hand, because this technology is relatively new, high-performance systems that enable to process it, are still being developed. XML documents easy management and access to content with giving structure to content. So because XML document is composed of content and structure therefore in processing of XML documents should are considered these two parts. With increased spread of information, for its storage, retrieval and transmission through internet, we need to manage that information basically. Clustering is very useful in above processes and for summarizing and reducing size of documents. Clustering XML documents also not an exception and as many applications are faced with problem of clustering. For example, with representative of each cluster obtained with an efficient clustering algorithm can very large volume set of documents to store more briefly. Formula for calculating is the first appliances required for clustering set of documents.
  • 2. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 2 Methods to determine similarity two XML documents are generally divided into three categories: content [10], structure [2] [5] [11] and combination of mentioned methods [1] [3] [4] [8] [9]. Clustering XML documents base on content is very similar to clustering textual documents which a lot of works have been performed in this field. On the other words in this kind of clustering, document tags omitted or considered as content, and clustered with planned algorithms easily. In this way, criterion for clustering is content and if documents have content similarity then has been laid in same cluster. Content based on clustering used for searching textual documents with Google and AltaVista engines. Clustering base on structure tries to find documents which have structure similarity and puts them in similar cluster. Combined clustering uses advantages two types of content and structural clustering. One of the methods of structural clustering XML document that has good efficiency is XCLS [5]. Performance of this method is to find similar tag name of nodes in levels of tree correspondence to document and then calculating similar number. But fundamental problem XCLS is neglect relationship FATHER-NODE that makes order of nodes is not preserved. XCLS+ method are created for improving and solving the problem mentioned XCLS method [11]. XCLS+ method has tried by adding a factor as relationship FATHER-NODE in XCLS related formula and solved the problem and improved previous method. But with careful study have been observed which the XCLS+ method because not considering duplicate nodes has problem and is way from optimality. Therefore this article tries to improve similar formula calculation XCLS+ and a new method called XCLS++ is established. In section2 tree model in XML documents will be reviewed. Section3 discusses previous works and then in section4 detail analysis of the XCLS+ method will be expressed. In section5 the relative bug of XCLS+ method will be offered. The proposed method will be offered in section6 for solving the bug XCLS+ method. Then in next section, results of proposed method are presented and compared with the XCLS+ method and will be fond that optimality of the proposed method is good in comparison with XCLS+ method. In final section summary and conclusions will be brought. 2. Tree model of XML documents As the XML documents are the tree format, so documents can to model in tree case. In this case, problem of clustering the XML documents will decrease into clustering trees. The structural methods benefit of these trees. The XCLS+ method that uses of tree model is an example which is successful in clustering the XML documents. The XCLS+ method optimized the XCLS method. Despite improvements with XCLS+ on XCLS, but XCLS+ has bug too. The main reason this method which is not optimal (from now on we talk only about XCLS+), neglecting duplicate nodes in the tree levels which in this article have been tried to fix it. 3. Previous works Clustering criteria is to find similar documents. Similarity of documents should be found with a certain similarity and clustering is done base on similarity. As mentioned above three methods for finding similarities has been suggested and are: 1- content [10] 2- structure [2] [5] [11] 3- content with structure [1] [3] [4] [8] [9]. As we know, each XML document can be changed to tree and clustering operations can be done with these trees. The structural methods considered only the tree structure and do not work with content. XCLS+ is the structural approach which in this article has focused on this type clustering. As previous mentioned, this method have been created to optimize the XCLS method. Experiments view that the XCLS+ method despite good performance compared to the XCLS method has a basic problem that decrease efficiencies in some circumstances. with more studies was concluded that main reason for low efficiency of this method is to neglect duplicate similar nodes in trees which in this paper tried to present solution
  • 3. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 3 could solve bug and improve the XCLS+ algorithm. More details in this article will be brought in the next. In this paper, first XCLS+ have been studied then proposal algorithm offered. Finally the results compared with proposed method and have been seen which optimality of the proposed method is better than the XCLS+ method. 4. XCLS+ method This method uses a similar structure and acts in more detail base on tag similarity between two levels of tree nodes related to document tags. In this way the incoming XML document compared with clustered documents and if there is acceptable similar number between those, new XML document is placed beside relevant clustered document. Acceptable new incoming document combined with document in cluster and form a new tree. In other words, in each cluster there is a single tree. If similarity value is not acceptable in this case a new cluster has created and incoming document is placed in it. This practice will continue until entrance final document. The similarity calculation is done based on Formula1. This formula is created for improving the XCLS method. In this way operations taken when document arrival and how execute it include: first root node of incoming document has compared with root node of clustered documents then if there are similar, factors the Formula1 is calculated and comparing levels nodes in each document have been continued in a level down for similarity calculation. In the absence of similarity, the next comparison is done between node in the clustered document in a level down and the same node of the incoming document. This compares has been continued to the leaves levels of document. As previously noted, similarity value have been calculated with: sim base on XCLS൅ ൌ 0.5 ‫כ‬ ∑ ሺCNଵ ୧ ൅ CPଵ ୧୪ିଵ ୧ୀ଴ ሻ ‫כ‬ r୪ି୧ିଵ ൅ 0.5 ‫כ‬ ∑ ሺCNଵ ୨ ൅ CPଵ ୨୪ିଵ ୨ୀ଴ ሻ ‫כ‬ r୪ି୨ିଵ ሺ∑ N୩୪ିଵ ୩ୀ଴ ‫כ‬ ሺrሻ୪ି୩ିଵሻ ‫כ‬ z ൅ 0.5 ‫כ‬ ൫∑ CPଵ ୧ ‫כ‬ r୪ି୧ିଵ ൅ ∑ CPଵ ୨୪ିଵ ୨ୀ଴ ୪ିଵ ୧ୀ଴ ‫כ‬ r୪ି୨ିଵ൯ Formula1. The XCLS+ method based on formula Numeric Value Formula1 is variable between zero and one. This value will change relatively on similar. The variables in the above formula are: 1- Z cluster size or in other words, the number of documents within cluster. 2- CNi is the total of similar nodes between level i of the new incoming document and level j of the clustered documents. 3- CNj is the total of similar nodes between level j of the clustered documents and level i of the new incoming document. 4- CPi is the total of similar nodes between level i of the new incoming document and level j of the clustered documents as have the same father. 5- CPj is the total of similar nodes between in level j of the clustered documents and in level i of the new incoming document as have the same father. 6- N in Nk is the number of elements in level k of the incoming tree. 7- l is high of tree in the each document.
  • 4. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 4 8- i, j are desired number of level. 9- r is the incremental factor, which is considered number 2. 10- k equal with 2. The above formula is active in every jump between levels document and the similarity Value calculated is sum of obtained factors. This practice continues until Level one of documents ended. Overall algorithm the XCLS+ method is given in Section 4.1. 4.1. XCLS+ method algorithm 1- Start to look same in node two tree node. If a node is found then does the calculation of formula then move to step2, otherwise go to Step3. 2- If depth of trees are move toward the lower level in the both trees. Search the same node as step1. If there is same node, calculate the formula and repeat step2, otherwise go to step3. 3- If depth of tree is (usually in clustered document), move down level in the clustered document and stay in the same level of the new incoming document. Search again the same nodes. If there is the same node, calculate the formula And then repeat step2, otherwise repeat step3. 4.2. Example of XCLS+ method For understanding the algorithm an example is given in this section. Goal is to find similarity between tree1 and tree2 base on the XCLS+ method. It should be noted that the tree1 referred to the incoming document and the tree2 referred to the clustered documents. Arrows of up to down are indicative the order of execution of algorithm. In this example in the Figure1 for facility the variable values are calculated and placed on cut arrows Figure1. An example for showing how operation of the XCLS+ method After obtaining above factors, the similarity value base on the XCLS+method is: sim base on XCLS൅ൌ ଴.ହ‫כ‬ቀሺ଴ା଴ሻ‫כ‬ଶమାሺଵା଴ሻ‫כ‬ଶభାሺଶାଶሻ‫כ‬ଶబቁା଴.ହ‫כ‬ቀሺ଴ା଴ሻ‫כ‬ଶమାሺଵା଴ሻ‫כ‬ଶభାሺଶାଶሻ‫כ‬ଶబቁ ሺଵ‫כ‬ଶభାଷ‫כ‬ଶబሻା଴.ହ‫כ‬ሺ଴‫כ‬ଶమା଴‫כ‬ଶభାଶ‫כ‬ଶబሻ ൌ0.85 1 2 43 32 1 0 cn=1 cp=0 cn=2 cp=2 L= 0 L= 1 L= 2 tree1 cn=0 cp=0 tree2
  • 5. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 5 5. Problem for XCLS+ method With more detail study of algorithms and testing various examples observed that the XCLS+ formula has problem because neglecting repeated nodes. For more clarify presented an example in the Figure2, 3. Figure2. Same nodes of two trees are equal in hierarchies After calculating above factors similar number is: sim base on XCLS൅ൌ ଴.ହ‫∑כ‬ ሺେ୒భ ౟ ାେ୔భ ౟రషభ ౟సబ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ ౠ ାେ୔భ ౠరషభ ౟సబ ሻ‫כ‬ଶరషౠషభ ∑ ୒ౡరషభ ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ େ୔భ ౟ ‫כ‬୰రష౟షభା∑ େ୔భ ౠరషభ ౠసబ రషభ ౟సబ ‫כ‬୰రషౠషభቁ ൌ0.85 For another example in the Figure3 similar number calculated based on the XCLS+ method is: sim base on XCLS൅ൌ ଴.ହ‫∑כ‬ ሺ஼ேభ ೔ ା஼௉భ ೔రషభ ೔సబ ሻ‫כ‬ଶరష೔షభା଴.ହ‫∑כ‬ ሺ஼ேభ ೕ ା஼௉భ ೕరషభ ೔సబ ሻ‫כ‬ଶరషೕషభ ∑ ேೖరషభ ೖసబ ‫כ‬ଶరషೖషభା଴.ହ‫כ‬ቀ∑ ஼௉భ ೔‫כ‬௥రష೔షభା∑ ஼௉భ ೕరషభ ೕసబ రషభ ೔సబ ‫כ‬௥రషೕషభቁ ൌ0.85 As a result calculated for the Figure2 also for trees in the Figure3 similar numbers are equal and how calculation similar number for trees in the Figure3 is same similar number for trees in the Figure2. Factors Figure3 for calculating similar number on arrows in Figure3 written. As see all factors are same with factors in Figure2. Despite significant difference between In the Figure3, 2 have been seen which variable values are equal. So calculated similarity numbers with the XCLS+ method is equal in Figure2, 3. By comparing two above examples, have been concluded that similar numbers of first instance should be greater than second example. The results have obtained with XCLS+ are away from logic and should be different. The primary cause of this problem is ignoring repeated nodes in original formula with the XCLS+ method. Considering cp factor regardless of repeated nodes causes order of modified nodes in levels changed and for different trees, same similarity number be earned. So should method have been proposed which repeated cases are also included. This paper proposed a method called XCLS++ which tries with adding a new factor into the formula1 and solving problem the XCLS+ method in finding similar number be more effective. In continuation proposed method is mentioned in this article. 1 2 4 5 6 54 23 2 01 0 cn=0 cp=0 cn=1 cp=1 cn=1 cp=0 cn=2 cp=2 L= 0 L= 3 L= 2 L= 1 tree2tree1
  • 6. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 6 Figure3. Same nodes of two trees are not equal in hierarchies 6. XCLS++: proposed method As mentioned above the XCLS+ method has fundamental problem in calculating similarity number. This method calculates equal similarity number for trees in the Figure2, 3. So should a method have been proposed as their calculated similarity numbers be different. In this article tried a method proposed for solving mentioned problem and gaining optimized results compared with the XCLS+ method. In this article tried the problem will solve with two steps. Reason of unreal calculation for same trees in Figure 2, 3 is duplicate nodes. This reason cases knocked hierarchy of tree nodes. Therefore must factor added to formula which preserves order of nodes. In this paper, in first step an incremental change are added to the original formula and then in second step replacement change are apply. Finally the proposed formula will be optimized formula in this paper. As will see final formula obtained is more reasonable and more realistic and have good optimality in compare with the XCLS+ method. Steps are mentioned bring in continue. 6-1. step1 In step1 in New mentioned method for solving problem factor FATHER_CHILD_NODE added in the formua1. This factor in proposed method formula is cc and proposed formula mentioned in formula2 is: sim base on XCLS ൅ ൅ ൌ 0.5 ‫כ‬ ∑ ሺCNଵ ୧ ൅ CPଵ ୧୪ିଵ ୧ୀ଴ ൅ CCଵ ୧ ሻ ‫כ‬ r୪ି୧ିଵ ൅ 0.5 ‫כ‬ ∑ ሺCNଵ ୨ ൅ CPଵ ୨୪ିଵ ୨ୀ଴ ൅ CCଵ ୨ ሻ ‫כ‬ r୪ି୨ିଵ ሺ∑ ܰ௞௟ିଵ ௞ୀ଴ ‫כ‬ ሺ‫ݎ‬ሻ௟ି௞ିଵሻ ‫כ‬ ‫ݖ‬ ൅ 0.5 ‫כ‬ ൫∑ ሺ‫ܲܥ‬ଵ ௜ ൅ ‫ܥܥ‬ଵ ௜ ሻ ‫כ‬ ‫ݎ‬௟ି௜ିଵ ൅ ∑ ሺ‫ܲܥ‬ଵ ௝ ൅ ‫ܥܥ‬ଵ ௝ ሻ௟ିଵ ௝ୀ଴ ௟ିଵ ௜ୀ଴ ‫כ‬ ‫ݎ‬௟ି௝ିଵ൯ formula2. Step1 stage formula of XCLS++ proposed method In the above formula, all variables equal with variables in Formula1 and only cc is new factor. New Factor represents number of same nodes that have same father and same children. In neglecting this factor in ways that is XCLS+ causes to grow tree toward duplicate child node and create unreasonable results. Considering above mentioned factor causes problem solved. Because in this case growing tree of repeated node was determined and proportionate value of cc is placed. For proving base on optimality of proposed method, similarity numbers for examples in Figure2, 3 obtained again with proposed formula provided in formula2 and shown that proposed method for two above mentioned examples which have significant difference, give better results than the XCLS+ method. The similarity of trees Figure2 and Figure3 is calculated again
  • 7. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 7 sequentially in Figures4, 5. Result are seen after related Figures. Note which factors without () belong for both trees. Figure4.calculating similarity number for trees base on the step1 optimization sim base on XCLS ൅ ൅ൌ ଴.ହ‫∑כ‬ ሺେ୒భ ౟ ାେ୔భ ౟రషభ ౟సబ ାେେభ ౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ ౠ ାେ୔భ ౠరషభ ౟సబ ାେେభ ౠ ሻ‫כ‬ଶరషౠషభ ∑ ୒ౡరషభ ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୔భ ౟ ାେେభ ౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୔భ ౠ ାେେభ ౠ ሻరషభ ౠసబ రషభ ౟సబ ‫כ‬୰రషౠషభቁ ൌ0.85 Figure5.calculating similarity number for trees base on the step1 optimization sim base on XCLS ൅ ൅ൌ ଴.ହ‫∑כ‬ ሺେ୒భ ౟ ାେ୔భ ౟రషభ ౟సబ ାେେభ ౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ ౠ ାେ୔భ ౠరషభ ౟సబ ାେେభ ౠ ሻ‫כ‬ଶరషౠషభ ∑ ୒ౡరషభ ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୔భ ౟ ାେେభ ౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୔భ ౠ ାେେభ ౠ ሻరషభ ౠసబ రషభ ౟సబ ‫כ‬୰రషౠషభቁ ൌ0.83 With comparing the two observed numbers contrary the XCLS+ method, the proposed method able to distinct difference between above mentioned cases. So factor cc able to solve problem XCLS+. But with next examples have been seen which this factor has problem too and were not solved problem completely. In continue mentioned examples will be brought.
  • 8. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 8 6-2. problem for step1 Also step1 able to solve problem of xcls+ but for some examples are problem too. For example for trees in figure6, 7 similarity numbers calculated with step1 are 0.96 , 0.95 sequentially. Result are seen after related Figures. Figure6.calculating similarity number for trees base on step1 optimization sim base on XCLS ൅ ൅ൌ ଴.ହ‫∑כ‬ ሺେ୒భ ౟ ାେ୔భ ౟రషభ ౟సబ ାେେభ ౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ ౠ ାେ୔భ ౠరషభ ౟సబ ାେେభ ౠ ሻ‫כ‬ଶరషౠషభ ∑ ୒ౡరషభ ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୔భ ౟ ାେେభ ౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୔భ ౠ ାେେభ ౠ ሻరషభ ౠసబ రషభ ౟సబ ‫כ‬୰రషౠషభቁ ൌ0.96 Figure7.calculating similarity number for others trees base on the step1 optimization sim base on XCLS ൅ ൅ൌ ଴.ହ‫∑כ‬ ሺେ୒భ ౟ ାେ୔భ ౟రషభ ౟సబ ାେେభ ౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ ౠ ାେ୔భ ౠరషభ ౟సబ ାେେభ ౠ ሻ‫כ‬ଶరషౠషభ ∑ ୒ౡరషభ ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୔భ ౟ ାେେభ ౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୔భ ౠ ାେେభ ౠ ሻరషభ ౠసబ రషభ ౟సబ ‫כ‬୰రషౠషభቁ ൌ0.95 As see two up examples show step1 only do not ables to solve xcls+ problem completely and for tow group’s trees calculates unreasonable similarity numbers. So must changed another factor in formula for earning optimal similarity numbers. This work will be brought in step2. 0 2 3 1 32 0 cn=1 cp=1 cc=0 cn=2 cp=2 cc=2 L= 0 L= 1 L= 2 tree1 tree2 11 cn(tree1)=2 cn(tree2)=1 cp(tree1)=2 cp(tree2)=1 cc=1 0 2 3 1 32 1 0 cn=1 cp=1 cc=0 cn(tree1)=2 cn(tree2)=1 cp(tree1)=2 cp(tree2)=1 cc=0 cn=2 cp=2 cc=2 L= 0 L= 1 L= 2 tree1 tree2 1
  • 9. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 9 6-3. step2 For latest optimizing proposed method must a factor changed too. Factor which cases unreason result is same father. In this step rather than it, factor same brother replaced. Latest optimized formula is: sim base on XCLS ൅ ൅ ൌ 0.5 ‫כ‬ ∑ ሺCNଵ ୧ ൅ CBଵ ୧ସିଵ ୧ୀ଴ ൅ CCଵ ୧ ሻ ‫כ‬ 2ସି୧ିଵ ൅ 0.5 ‫כ‬ ∑ ሺCNଵ ୨ ൅ CBଵ ୨ସିଵ ୧ୀ଴ ൅ CCଵ ୨ ሻ ‫כ‬ 2ସି୨ିଵ ∑ N୩ସିଵ ୩ୀ଴ ‫כ‬ 2ସି୩ିଵ ൅ 0.5 ‫כ‬ ൫∑ ሺCBଵ ୧ ൅ CCଵ ୧ ሻ ‫כ‬ rସି୧ିଵ ൅ ∑ ሺCBଵ ୨ ൅ CCଵ ୨ ሻସିଵ ୨ୀ଴ ସିଵ ୧ୀ଴ ‫כ‬ rସି୨ିଵ൯ formula3. Latest formula of XCLS++ proposed method Now for trees in Figure6, 7 similarity numbers again calculated with formula3 in Figures8, 9. Results of calculated are after related Figures: Figure8.calculating similarity number for Figure6 trees base on step2 optimization sim base on XCLS ൅ ൅ൌ ଴.ହ‫∑כ‬ ሺେ୒భ ౟ ାେ୆భ ౟రషభ ౟సబ ାେେభ ౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ ౠ ାେ୆భ ౠరషభ ౟సబ ାେେభ ౠ ሻ‫כ‬ଶరషౠషభ ∑ ୒ౡరషభ ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୆భ ౟ ାେେభ ౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୆భ ౠ ାେେభ ౠ ሻరషభ ౠసబ రషభ ౟సబ ‫כ‬୰రషౠషభቁ ൌ0.86 Figure9.calculating similarity number for Figure7 trees base on step2 optimization 0 2 3 1 32 0 cn=1 cb=0 cc=0 cn=2 cb=2 cc=2 L= 0 L= 1 tree1 tree2 11 cn(tree1)=2 cn(tree2)=1 cb(tree1)=2 cb(tree2)=0 cc=1 0 2 3 1 32 1 0 cn=1 cb=0 cc=0 cn(tree1)=2 cn(tree2)=1 cb(tree1)=2 cb(tree2)=0 cc=0 L= 0 L= 1 L=2 tree1 tree2 1 cn(tree1)=2 cn(tree2)=2 cb(tree1)=0 cb(tree2)=2 cc=2 L= 2
  • 10. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 10 sim base on XCLS ൅ ൅ൌ ଴.ହ‫∑כ‬ ሺେ୒భ ౟ ାେ୆భ ౟రషభ ౟సబ ାେେభ ౟ ሻ‫כ‬ଶరష౟షభା଴.ହ‫∑כ‬ ሺେ୒భ ౠ ାେ୆భ ౠరషభ ౟సబ ାେେభ ౠ ሻ‫כ‬ଶరషౠషభ ∑ ୒ౡరషభ ౡసబ ‫כ‬ଶరషౡషభା଴.ହ‫כ‬ቀ∑ ሺେ୆భ ౟ ାେେభ ౟ ሻ‫כ‬୰రష౟షభା∑ ሺେ୆భ ౠ ାେେభ ౠ ሻరషభ ౠసబ రషభ ౟సబ ‫כ‬୰రషౠషభቁ ൌ0.94 As see results calculated with latest optimized formula are reasonable. For clarifying effectiveness the above factors the XCLS++ and the XCLS+ algorithms implemented and have been seen that how much this factor plays a fundamental role in applied environment. This work will be taken in next sections. 7. Evaluating algorithms and comparing methods As was shown in above examples the XCLS++ approach in comparison with the XCLS+ method is good. For proving above sentence in this section the XCLS+ and XCLS++ algorithms have been implemented. Both of them were implemented with C language in DOS environment on a machine with 2.4 GHZ Intel Celeron CPU and 512 MB of RAM. The evolution criteria were implemented too in same conditions for evaluating xml files in different types. The results of experiments like above examples, confirm optimality of the proposed algorithm and efficiency of the XCLS++ algorithm is higher than the XCLS+ method. For evaluating similarity diagnostic algorithms and clustering XML documents, must first set of the incoming files have been specified and after clustering accuracy of algorithms will be calculated with existent criteria. 7.1. Set data For evaluating, files divided in two categories for best analysis and both algorithms executed on same files in same conditions for efficiency comparing. Two mentioned categories are homogeneous files (from one type DTD) [6] and heterogeneous file (multi type of DTD) [7]. The results for both categories will be shown separately. 7.2. Evaluation criteria There are three items for calculating accuracy clustering algorithms: 1-entropy 2-purity 3-fscore 7.2.1 Entropy Entropy is sum documents which located in the cluster i which are of the class r. The entropy formula is: ‫ݕ݌݋ݎݐ݊ܧ‬ ൌ ෍ ݊௜ N ௞ ௜ୀଵ ‫ܧ‬ሺ‫ܥ‬௜ሻ ܽ‫ݏ‬ ‫ܧ‬ሺ‫ܥ‬௜ሻ ൌ ଵ ௟௢௚௞ ∑ ୬౟ ౨ ௡೔ ݈‫݃݋‬௞ ௜ୀଵ ୬౟ ౨ ௡೔ In above formulas,‫ܥ‬௜, N, k, ݊௜ and n୧ ୰ are respectively ith cluster, total number of incoming documents, number of clusters, number clustered documents in cluster i and number clustered documents in cluster i of class r. The entropy value if be closer to zero is better and has good efficiency.
  • 11. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 11 7.2.2 Purity Purity is sum maximum documents which located in the cluster i which are of the class r. The purity formula is: ܲ‫ݕݐ݅ݎݑ‬ ൌ ෍ ݊௜ N ௞ ௜ୀଵ ܲሺ‫ܥ‬௜ሻ ܽ‫ݏ‬ ܲሺ‫ܥ‬௜ሻ ൌ ଵ ௡೔ maxሺn୧ ୰ ሻ The entropy value if be closer to one is better and has good efficiency. 7.2.3 Fscore Fscore is another item created by combination of above two items and is: ‫݁ݎ݋ܿܵܨ‬ ൌ ∑ ݊௥‫ܨ‬ሺܼ௥, ‫ܥ‬௜ሻ௞ ௥ୀଵ ܰ As ‫ܨ‬ሺܼ௥, ‫ܥ‬௜ሻ ൌ ܲሺܼ௥, ‫ܥ‬௜ሻ ‫כ‬ ‫ݎ‬ሺܼ௥, ‫ܥ‬௜ሻ ܲሺܼ௥, ‫ܥ‬௜ሻ ൅ ‫ݎ‬ሺܼ௥, ‫ܥ‬௜ሻ ൌ 2 ‫כ‬ n୧ ୰ ݊௜ ൅ ݊௥ , ‫ݎ‬ሺܼ௥, ‫ܥ‬௜ሻ ൌ n୧ ୰ ݊௥ , ܲሺܼ௥, ‫ܥ‬௜ሻ ൌ n୧ ୰ ݊௜ The fscore value if be closer to one is better and has good efficiency. In this paper, after implementation algorithms and above criteria for both categories, results are calculated and compared for analyzing. For testing implemented program, XML files consists of 100 different classes, such as medical files, colleges, shops, cars, insurance, etc... have been considered. As above mentioned, input files divided into two parts and with the XCLS+ and XCLS++ algorithms were evaluated separately. The results of algorithm on the homogeneous files are in Table 1 and include:
  • 12. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 12 Table1. Results of algorithms executed on homogeneous files The results of algorithm on the heterogeneous files are in Table 2 and include: Table2. Results of algorithms executed on heterogeneous files As previous section results, the results obtained in this section shows which the XCLS++ method has higher efficiency than the XCLS+ method. Also results show which proposed method has good efficiency for homogeneous files in comparison with heterogeneous files. Because probably for existence repeated node in homogeneous file is up. 8. Result and conclusion The purpose of this paper is classification XML documents in order to expedite search and others benefits that classification has them. Criteria for classifying is structural or content and structure with content. The XCLS+ method is a method of classification methods which criteria for classification is done base on structure. Despite good performance for the XCLS+ method in compared with the XCLS method, ignoring repeated nodes in some documents causes it be inefficient. Therefore, this paper proposed FATHER-NODE-CHAILD and BROTHER-NODE factors to the XCLS+ formula for achieving good efficiency. The results proposed idea as entropy, purity and Fscore show which the proposed method works better than the previous method. In future weight of levels will be changed for obtaining similarity number actually and better than proposed method too. Also the new method will be exam on much more documents in future and with comparing those, will be able to obtain better results. 9. References
  • 13. International Journal of Information Technology, Control and Automation (IJITCA) Vol.2, No.4, October 2012 13 [1] Ilwan Choi, Bongki Moon, Hyoung-Joo Kin, (2006) Data & Knowledge Engineering, A clustering method based on path similarities of XML data. [2] Andrewdn, jag, (2007 (Information systems engineering, Evaluting Structural Similarity in XML DocumentWISE’07 Proceedings of the 8th international conference on Web Information. [3] Tien Tran, Richi, Peter, (2008 (Data Mining, Combining Structure and Content Similarities for XML Document Clustering, Conference 27-28 November, Glenelg, South Australia. [4] WOOSAENG KIM, (2008( Computer Engineering and Applications, XML document similarity measure in terms of the structure and contents, CEA'08 Proceedings of the 2nd WSEAS International Conference. [5] G.R.Nayak, (2008) “Fast and effective clustering of XML data using structural information,“ knowl. Inf. Syst. [6] The XML data repository. Accessed from: http://www.infomatik.uni-trire.de [7] The XML data repository. Accessd from: http://www.cs.washington.edu [8] Waraporn Viyanon, Sanjay K.Madria, Sourav S.Bhowmick, (2008) Management of Data, XML Data Integration Based on Content and Structure Similarity Using Keys. [9] aptarshi Ghosh and Pabitra Mitra, (2008) Pattern Recognition, ICPR Combining Content and Similarity for XML Document Classification using Composite SVM Kernels, , 19th International Conference. [10] jing PengDong Qing Yang Shi Wei Tang et al, (2008) similarity in chinese text processing, A New Similarity competing method based on concept, series F: Information science, 51(9): P1212-1230, 9. [11] Mohamad Alishahi, Mohmoud Naghibzadeh and Baharak Shakeri Aski, (2010) International Journal of Computer and Electrical Engineering ,Tag Name Structure-based Clustering of XML Documents, VOL. 2, NO. 1, February. 10. Authors Ahmad Khodayar received master science of computer engineering in 2010 from Islamic Azad University of Shabestar, Iran. His current research area is data mining and specially clustering; now he is working on a project about new way in clustering. Hassan Naderi received his PhD degree in 2006 from INSA-LYON university of France. His current research areas are text mining, search engine and massive data processing.