In this paper, nature inspired methods are proposed for solving problems in the field of Semantic Web mining, namely the clustering of Web resources based on their metadata, as well as the automatic classification of Web pages.
Food processing presentation for bsc agriculture hons
Semantic Web mining using nature inspired optimization methods
1. Semantic Web mining using nature inspired
optimization methods
Diana Andreea Gorea, Lucian Bentea
Faculty of Computer Science, “A.I. Cuza” University, Ia¸i, Romania
s
Abstract. In this paper, nature inspired methods are proposed for solv-
ing problems in the field of Semantic Web mining, namely the clustering
of Web resources based on their metadata, as well as the automatic clas-
sification of Web pages.
1 Introduction
This paper proposes the use of nature inspired methods when solving the problem
of RDF clustering, as well as that of the automatic classification of Web pages.
The most promising methods that the authors found are those belonging to the
Ant Colony Optimization (ACO) framework. While this paper does not aim
to give an introduction ACO, the interested reader can refer to [3] for further
information.
The paper is organized as follows. Section 2 describes efficient heuristics in
two different cases - when the number of clusters is predetermined, or when it is
unknown and is part of the solution. By clustering Semantic Web resources, it is
possible to find representatives for a set of similar resources and thus be able to
reduce the size of large ontologies. This would also bring insight into the main
concepts that an ontology contains. Section 3 summarizes the paper [6] and also
brings further insight into how ACO heuristics can be used to find classification
rules for Web pages. Section 4 draws the conclusions and suggests subjects for
further research.
2 Clustering of Semantic Web data
The data clustering problem refers to grouping a set of data into several nonempty
subsets whose members are considered similar, with respect to some similarity
measure. In the context of Semantic Web data, which can be represented through
RDF graphs, the clustering problem becomes that of grouping individuals in the
graph. An individual, also called an instance in [5], is a single resource node
together with some of its neighbouring nodes, forming a subgraph that is rel-
evant to that resource node. Several instance extraction methods are proposed
in [5]: Immediate Properties, Concise Bounded Description (CBD)1 , or Depth
1
Concise Bounded Description: http://www.w3.org/Submission/CBD/
2. Limited Crawling. The optimal method to use depends on the type of data to be
processed, e.g. RDF data coming converted from a relational database, FOAF
documents, etc., and the structure of its associated RDF graph. The same crite-
rion holds when choosing the optimal similarity measure; the authors of [5] also
propose three distance measures, one based on feature vectors (denoted simFV),
one based on conceptual graphs, inspired by the similarity measure of concep-
tual graphs introduced in [10], and another being an ontology based measure
(denoted simOnt).
2.1 Predetermined number of clusters (the ACOC algorithm)
Assuming a set Ω := {X1 , X2 , . . . , Xm } of individuals is extracted from an RDF
graph G and without giving an explicit formula for the above similarity measures,
the RDF data clustering problem can be formally described as the following
discrete optimization problem. Let sim be a similarity measure, e.g. simFV or
simOnt above. Also let n ≥ 1 be the predetermined number of clusters into
which the data is to be grouped and denote by C1 , C2 , . . . , Cn ∈ Ω the variables
to be determined as the centers of each cluster. By defining the variable wij
through
1, the individual Xi belongs to cluster j,
wij := (1)
0, otherwise,
for i = 1, . . . , m and j = 1, . . . , n, the aim is to
m n
Maximize wij sim(Xi , Cj ), (2)
i=1 j=1
such that each individual belongs to only one cluster,
n
wij = 1, i = 1, . . . , m, (3)
j=1
and there are no empty clusters,
m
wij ≥ 1, j = 1, . . . , n. (4)
i=1
To the best of the authors’ knowledge, there is no proof related to the NP-hard
complexity of this general clustering problem. The most recent results on this
subject is the article [8], which proves that the clustering problem, also known as
the k-means problem, is NP-hard, in the restricted case of planar graphs. How-
ever, as is the case with most discrete optimization problems, clustering of RDF
data is also computationally expensive and solution approximation methods are
preferred.
One of the most promising algorithms for solving the previous optimization
problem is Ant Colony Optimization for Clustering (ACOC), introduced in [7],
3. which is an alternative to the classic k-means algorithm, known to have sev-
eral drawbacks. The numerical results in [7] show that ACOC obtains the best
results, on several test cases, among various approximation methods, including
the k-means algorithm. It also achieves this with the highest convergence rate,
therefore only requiring a few iteration steps to detect the optimum. Since ACOC
is part of the Ant Colony Optimization framework, the idea is to have several
ants “foraging” for the optimum, thus avoiding premature convergence due to
local optima. Apart from using the idea of pheromone trails, each node to be
explored also contains a heuristic value, representing the estimated global gain
from picking that node; this is used to accelerate the convergence of the algo-
rithm. Eventually, ants are grouped into clusters and a solution to the original
RDF clustering problem can be obtained through a decoding algorithm.
2.2 Variable number of clusters (from SSCFL to RDF clustering)
In the case when the number of clusters is not predetermined, but only a fixed
number of individuals are allowed to live in each cluster, the previous problem
can be formulated as a Single Source Capacitated Facility Location (SSCFL)
problem, which can be described as follows. Consider several facilities (e.g. med-
ical or telecommunications facilities) that are installed at different locations in
a city. These facilities provide goods to a number of customers, whose demands
are known beforehand. Each facility comes with the necessary logistics to create
a physical network that would allow customers to connect to the facility. How-
ever, each facility only provides a fixed amount of resources to the customers
who connect to it. The available amount of resources corresponding to a facility
is also called its capacity; hence the adjective capacitated in the name of this
optimization problem. The question is which of the facilities to open and which
customers should be assigned to each open facility, so that the total costs of
opening the facilities and of creating the physical networks are minimized, while
making sure that each customer’s demand is satisfied by exactly one facility.
In Figure 1, a solution to a particular SSCFL problem is represented. The
customers are the light green round rectangles, while the facilities are the light
red circles. The arrows denote assignment relations - the tip of the arrow points
to the facility to which the customer is assigned. The number on each facility
node designates its capacity, while the number on each customer node represents
its demand. Notice that the given solution is feasible, i.e. the total demand of
the customers assigned to a facility does not exceed its maximum capacity and
no customers are left unassigned. Also, in this case, it was decided that three
facilities (having capacities 1, 6, 10) remain closed.
In order to adapt the SSCFL problem to RDF clustering, customers are the
same with the individuals that need to be grouped and the facilities represent
the center of the clusters, which can be activated or not. Thus, consider the
variable wij defined as in the previous subsection and let yi ∈ {0, 1} be the
Boolean variable specifying whether the i-th facility is to be opened or not, for
all i. Also, denote by αi the cost of opening the i-th facility, which is the same
with the cost of taking the individual Xi to be a cluster center, and by αij the
4. 1.5 2.2 1.3 2
3 2.5
10 8
2.5 5 2 1
1.2 6 1.7
Fig. 1. Solution to a particular SSCFL problem
cost of assigning the j-th customer to the i-th facility, for all i, j with 1 ≤ i ≤ m
and 1 ≤ j ≤ n. In the case of RDF data clustering, the costs αij represent the
opposite of the similarity measure between the individual Xi and the cluster
center Cj and they are given by:
αij = −sim(Xi , Cj ), i = 1, . . . , m, j = 1, . . . , n. (5)
Provided that the facilities (the potential cluster centers) have corresponding
capacities u1 , u2 , . . . , um ∈ R+ , the aim of this adapted SSCFL problem is then
to
m n m
Minimize αi yi + αij wij , (6)
i=1 i=1 j=1
subject to the following constraints:
- each customer is assigned to exactly one facility (each individual Xi is as-
signed to exactly one cluster)
n
wij = 1, i = 1, . . . , m, (7)
j=1
- provided that a facility is open (a cluster center is activated), the total
demand of the customers assigned to it (the demand of a group of individuals
to belong to the corresponding cluster) cannot exceed its capacity; also, a
customer cannot be assigned to a facility that is closed (an individual cannot
be represented by a cluster center that is not activated),
m
di wij ≤ uj yj , j = 1, . . . , n, (8)
i=1
5. - a customer can either be assigned or not to a facility (an individual can
either be included or not in a group),
wij ∈ {0, 1}, i = 1, . . . , m, j = 1, . . . , n. (9)
- facilities can either be open or close (cluster centers can either be activated
or not),
yi ∈ {0, 1}, i = 1, . . . , m. (10)
Note: Before carrying on, notice that in a solution to this problem, there may
be individuals that remain ungrouped, which is not necessarily a drawback. On
the contrary, this may provide more realistic solutions to the clustering problem.
The previous integer programming problem is proven in [9] to be NP-hard
and therefore, heuristic solution techniques need to be created to handle its com-
plexity. A survey of the more recent heuristics is given in [1], where the methods
of Tabu Search, Simulated Annealing and Genetic Algorithms are compared
on account of their efficiency with respect to different parameters. An alterna-
tive solution based on Genetic Algorithms is also the subject of [2], in which
two special crossover operators are defined, guaranteeing the feasibility of the
approximations. Also, the Particle Swarm Optimization algorithm described in
[11] and the Ant Colony Optimization algorithm in [13] have the potential to be
adapted to the RDF clustering problem.
3 Web page classification using Ant Colony Optimization
Semantic Web is a combination of data from different sources integrated in a
common format as opposed to the original Web, concentrated mainly on the
exchange of documents. It also has a format that connects data to objects from
the real world. By doing so, the information seeker may jump from one database
to another, just because they are linked because they share knowledge on the
same thing [12].
However, these are all made by human knowledge and so we can also take into
account the factor of subjectivism and the errors that may occur in placement,
content or classification of knowledge. If in the case of user-less web pages (like
portfolio sites or advertising pages) the desire to provide quality content lays only
in the hands of the site owner who may or may not be aware of the mistakes,
once other users appear (that have rights to upload, tag, write content) the task
of keeping the information provided as accurate as possible becomes harder than
ever.
A study we found, shows the way and the results of how general web content
can be sorted by using an Ant Colony Algorithm. We will present the study and
try to connect its findings with what we know that may apply for semantic web
as well.
6. 3.1 Preprocessing
The challenge when dealing with web pages is that the developers do not follow
every time a standardized way of creating web pages. This has many reasons:
design implementation issues that may require certain tricks (fully flash based
sites have no <h1> tag), lack of interest or knowledge in applying them, no
or badly chosen <meta> tags (too much or not related to page content), generic
<title> tags (all pages have the same title). At least regarding meta tags things
started to improve once everyone realised the advantages of being well ranked on
search engines. This generated a higher rate of attention to the content of those
tags and a very high interest in SEO (search engine optimisation). In general,
this would not be an issue for Semantic Web just because they are standardized
and not yet very popular so that, in theory at least, exceptions from the rules
are few.
The contents of web pages can be filtered using texts preprocessing methods
to obtain fewer relevant word to search for and a more human like understanding
of the given text. The most difficult aspect that the methods described above
must provide is the ability to handle well homographs (is one of a group of words
that share the same spelling but have different meanings [14]; ex: stalk - part
of a plant) and stalk (follow/harass a person); left (opposite of right) and left
(past tense of leave) [15]) .
For the study they used WordNet (a lexical program that offers some rela-
tionships between words [4]) to filter the information. From it, they selected:
- the morphological preprocessor (to combine words like: make, made, making
into one word make) to reduce the number of words to search in
- to identify all nouns from the text, as they may offer some relevant search
information. But there is an interesting fact that nouns may have
- the same spelling as verbs (a large number of examples describing this may
be found in [16])
- the words lexical family. If the text has words like: roof, window and door,
they may all apply to house. This is a questionable technique, as for some
associated words the result may not be a real link between them (this is
especially the case for homograph words), or, for other cases (as the one
described above), a significant increase in efficiency.
As far as Semantic Web is concerned all three methods may offer interesting
alternatives to the end results:
- the morphological processor is an interesting option as a word written in
natural language may be linked to another, and only the latter is relevant.
However, a word like left, if processed by this process may not remain in the
same way, but become leave. Having this in mind, it’s probably a good idea
to keep both when dealing with Semantic Web.
- The distinction between nouns/verbs is also not so relevant in terms of
searching a word in semantic web but it becomes significant in terms of
SPARQL queries. This has, however, the advantage that it knows by the
way the syntax is formed which one is the noun and which is the verb.
7. - For the connections between different types of words, has relevance only
if multiple words are searched for at the same time, and some common
denominators may then be used to provide results that better match as
many items provided as possible
For both search types, the end result should be a list of search words, with the
note that, for web mining it should only contain the most relevant words, and for
Semantic Web it should have first the words obtained by joining the semantics,
then the morphologically obtained values (if any) and the words themselves.
This may seem an unnecessary overload but it may help the end user to better
understand the results given, and the first would be the most relevant.
3.2 Algorithm
The Ant-miner algorithm is a variation of the Ant Colony paradigm, used in data
mining. In the beginning it initialises the training set of all available training
cases (web pages) and adds an empty rule list. In an Repeat-Until loop, one
classification rule at a time is discovered: first, all trails are initialised with the
same quantity of pheromone (giving them the same chance to be selected) and
an inner rule lets the ants to select the best option. Each ant selects the path
to follow based on the path followed by the previous ants due to the presence
of pheromone traces. The higher the amount the better the path. In the second
step, the irrelevant terms are removed so that in step three the pheromone values
are updated . The inner loop continues until a condition is fulfilled (maximum
number of paths is generated).
After the processing of the inner loop, the highest-quality rule is chosen
and added to the discovered rule list. All training sets that satisfy the rule are
removed. This ensures that the next inner loop will run with fewer rules than
the previous. The outer loop continues it’s execution until a criteria is satisfied
(ex: some max number of uncovered cases is covered). The algorithm returns the
rule list found.
3.3 Experiment
The study took into account the <meta> and <title> contents of the BBC site.
They chose this because of their high code writing standard, and due to the
very well structured information that improved the chance of making very good
connections between <meta> and content.
4 Conclusions and further research
This paper shows how nature inspired optimization methods can be more effi-
cient than classical, exact methods, when implementing Semantic Web mining
algorithms. Among all, the Ant Colony Optimization metaheuristic proves to
be one of the best solution techniques. As future work, the ideas described in
8. the previous sections need to be implemented and thoroughly tested, as nature
inspired methods have rarely been used in the context of mining the Semantic
Web. Such an implementation would then allow the clustering of resources based
on their associated metadata, e.g. their FOAF description, the microformat in-
formation they contain, etc.
References
1. Arostegui, Jr., M.A., Jr., Kadipasaoglu, S.N., Khumawala, B.M., An empirical com-
parison of Tabu Search, Simulated Annealing, and Genetic Algorithms for facilities
location problems, International Journal of Production Economics, Vol. 103, No. 2,
742-754, 2006.
2. Cortinhal, M.J., Captivo, M.E., Genetic Algorithms for the Single Source Capac-
itated Location Problem: a Computational Study, in the Proceedings of the 4th
Metaheuristics International Conference, 355-359, Porto, Portugal 2001.
3. Dorigo, M., St¨tzle, T., Ant Colony Optimization, MIT Press, 2004.
u
4. Fellbaum, C. (Ed.), WordNet - an electronic lexical database, MIT, 1998.
5. Grimnes, G.A., Edwards, P., Preece, A., Instance based Clustering of Semantic Web
Resources, in the Proceedings of the 5th European Semantic Web Conference, LNCS
5021, Springer-Verlag, pp. 303-317, 2008.
6. Holden, N., Freitas, A.A., Web Page Classification with an Ant Colony algorithm,
in the Proceedings of the 8th International Conference on Parallel Problem Solving
from Nature, LNCS 3242, Springer-Verlag, pp. 1092-1102, 2004.
7. Kao, Y., Cheng, K., An ACO-Based Clustering Algorithm, in the Proceedings of
the Ant Colony Optimization and Swarm Intelligence Conference, LNCS 4150, pp.
340-347, 2006.
8. Mahajan, M., Nimbhorkar, P., Varadarajan, K., The Planar k-means Problem is
NP-hard, in the Proceedings of the 3rd International Workshop on Algorithms and
Computation, LNCS 5431, pp. 274-285, 2009.
9. Mirchandani, P.B., Francis, R.L., Discrete location theory, New York: Wiley, 1990.
10. Montes-y-G´mez, M., Gelbukh, A., L´pez-L´pez, A., Comparison of Conceptual
o o o
Graphs, in Lecture Notes in Artificial Intelligence, Volume 1793, Springer-Verlag,
pp. 548-556, 2000.
11. Sevkli, M., Guner, A.R., A Continuous Particle Swarm Optimization Algorithm
for the Uncapacitated Facility Location Problem, in the Proceedings of the 5th In-
ternational Workshop on Ant Colony Optimization and Swarm Intelligence, ANTS
2006, 316-323, Brussels, Belgium 2006.
12. The official W3C Semantic Web Activity page at http://www.w3.org/2001/sw/.
13. Venables, H., Moscardini, A., An Adaptive Search Heuristic for the Capacitated
Fixed Charge Location Problem, in the Proceedings of the 5th International Work-
shop on Ant Colony Optimization and Swarm Intelligence, ANTS 2006, 348-355,
Brussels, Belgium 2006.
14. The wapedia page on homographs at http://wapedia.mobi/en/Homograph.
15. The wapedia page on homonyms at http://wapedia.mobi/en/Homonyms.
16. Words that can be used both as nouns and verbs, http://www.dailywritingtips.
com/careful-with-words-used-as-noun-and-verb/