SlideShare une entreprise Scribd logo
1  sur  8
Ontology-Based Semantic Context Framework (OBSC)
                 For Arabic Web Contents
                                      Dr. Bassel AlKhatib
                                      balkhatib@svuonline.org
                                    Eng.Mouhamad Kawas
                                            kawas@w.cn
                                       Eng. Wajdi Bshara
                                      w_bshara@hotmail.com
                                    Eng. Mhd. Talal Kallas
                                           ekallas@w.cn

           Faculty of Information Technology Engineering, University of Damascus, Syria




        Abstract
        Several researches developed optimized ontology-based semantic (OBSC)
        framework for English content. The methodology used in these approaches could not
        be used for Arabic content due to the complexity of the syntax, semantics and
        ontology of the Arabic language.
        Existing methodologies do not work properly and efficiently with Arabic language.
        To correctly and accurately comprehend Arabic web content new concepts were
        developed, and existing methodologies and framework, such as the tokenization,
        Word sense disambiguation (WSD), and Arabic WordNet (AWN) were extensively
        modified.

        Keywords: Semantic, ontology, WSD (Word sense disambiguation), Tokenization,
        AWN (Arabic WordNet), context, clustering, Part of speech (POS).



1. Introduction

The goal of the proposed framework is to build a core system that can be used for many semantic
applications, such as: Semantic Search Engines, Semantic Encyclopedias, Arabic Question Answering
Systems, Semantic Dictionaries, etc.
This framework can "understand" Arabic web contents using AWN ontology.
Most of previous researches and work tackling this dilemma depend mostly on information retrieval
and statistical approaches that did not go deep into the semantic and context meaning, especially
researches for Arabic language applications.
In the proposed framework new approaches, measures and algorithms to achieve Arabic web content
semantic understanding are illustrated.
This paper is organized as follows: Section 2 illustrates the basic components of the OBSC framework.
Section 3 is devoted to a short overview of customized Arabic Ontology. Section 4 describes the
framework architecture and how the modules work. The last two sections conclude the paper, and
discussed proposed future work.




                                                 1
2. Framework Concepts

Arabic contents exist in many forms on the web – HTML, word documents, XML, PDF, etc. The main
goal of the OBSC framework is to transfer any of these contents into conceptual structures that can be
understood by the machine, and can be used in many semantic applications.



                                                                                              Content

           Similarity                      Tokenization and
           Measuring                          Indexing

                                                                                            Conceptual
                                                                                             Content
           Clustering                            WSD



                                                                                                Store

     Postprocessing                      Preprocessing


                                                                                              Multi
    Framework                                                                               Conceptual
                                                                                             Content
                        Figure -1- OBSC Framework Architecture




The main components of the OBSC framework, as illustrated in the figure, are:
    1- Tokenization and Indexing.
    2- Word Sense Disambiguation (WSD), based on Arabic ontology.
    3- Measuring similarities between Arabic contents.
    4- Clustering group of Arabic contents using previous measures.

3. Arabic ontology

The OBSC Framework uses the AWN that has been constructed according to the same rules that has
been used in Euro WordNet.
Each word is represented as a synset. Each item of the synset can be any type of the part of speech
(POS): verb, noun, subject, adjective and adverb. For example: the word "       " has the synset {
," ", …}. Each item of this set can be any POS. For example "        " can be noun or verb.
The sense is the exact spelling, (that gives the precise meaning,) of each item in the synset for each of
the POSs. For example: the item           appears as        or     when the POS is a noun; and appears
as "    " or " " when the POS is a verb.
This ontology was designed to connect the synset in explicit semantic relations; these relations can be
(hypernym, hyponym, menonym, troponym, …)

3.1. Customizing the Arabic WordNet (AWN)

The AWN was implemented by many authors; each of them uses different strategies to store the stem
of each word depending on author’s algorithm. It was found that the previous AWN doesn't meet our
needs, so a decision was made to customize the AWN using the following steps:


                                                   2
Use a specific stemming algorithm to store the stems.
         Merge the AWN with Princeton WordNet (the English version) in order to find the translation
         of a word in English, when needed.
         Denormalize the AWN in order to speed the retrieval.
         In the sense list traditionally the word is stored with soft vowels (   ). This leads to a
         problem because not all content is written with soft vowels. This issue was resolved by pairing
         the word without soft vowels with the word with soft vowels.

3.2. How to use the customized AWN

In order to enhance the proposed framework’s performance and efficiency the Arabic Ontology was
transferred into a data structure referred to as the “Ontology data structure”, which can be loaded into
the main memory. The following figure illustrates the Ontology data structure:

                                                  Nouns

                                                  Verbs

     …                                            Subjective
                                                                              …
                                                  Adjectives
                            …
                                                  Adverbs




                                         mugaAdarap_n1AR                   $aHon_n1AR

                                                  Departure                    Dispatch
                  ...


                                                &%Motion+                  &%Transfer+

                                           ArrayList of gloss




                                                            ….           the act of sen..




                                  Figure -2- Ontology data structure

4. Framework Architecture
The OBSC framework has four modules:

    4.1. Tokenization and Indexing

    In this module a list of words, from any Arabic content, is obtained and the stop words are
    eliminated. Then, each word is paired with its stem. the document properties along with the
    obtained list is stored in the “Tokenization and Indexing data structure”.
    The word and its stem are used to retrieve all the synsets, in any part of speech, from the Ontology
    data structure.




                                                    3
We have faced two major problems with some of the Arabic plurals or conjugation. The first arose
when the word is not found in any of the synsets of the stem, derived from the ontology. For
example, the word "     " has the stem "     ", but none of the synsets of " " contain          ,
another example, the word “        ” is not found in any of the synsets derived from the stem
"    ".

This problem was solved by going through the items of the synsets, derived from the stem,
selecting the word that most contained the word "    ", using the synsets in Figure-2, the
Synset        will be retrieved.

This solution implied a second problem with the special plurals in Arabic, such as “           ”.
For example, the plural words "        ", has “ ” as the stem. However applying the “most
contained algorithm will retrieve the synset for   , when it should have retrieved the synset for
      .

To solve this problem, the algorithm was modified by adding a rule that if the retrieved item
exactly equals the stem, the most contained algorithm will not be used, and all the synsets of the
stem will be returned.

Thus the developed algorithm for this module can be summarized as follows:
-  If the word was found in the synsets derived from the stem then return the found synset.
-  Else if the word was not founded in the synsets then use the “most contained” algorithm and
   return the synset unless the synset is the stem word.
-  Else the word was not found in the synsets and the “most contained” algorithm returned the
   stem word, then return all the synsets of the stem.

The output of this module is passed to the next module in the OBSC framework: the WSD module.
In the WSD the best sense, based on the POS for the Synset(s) we have retrieved, is selected.

4.2. Word Sense Disambiguation (WSD)

Each word, from the content document may be associated with one or more synsets. This will lead
to an ambiguity in analyzing the content.

For example: the word         will retrieve several synsets, such as
                             , etc.
Each item in the synset can be associated with five parts of speech. Each POS is associated with
many Senses. For example:           as a noun can be               so we need to disambiguate the
synsets based on the best sense.

The implemented algorithm, to achieve disambiguation, transfers the content from a list of synsets
to a list of senses to get the conceptual content.

The WSD process is based on finding the closest and most appropriate meaning of a word in a
specific context. The Micheal Lesk’s algorithm was used as the basis of the WSD algorithm.

Lesk’s algorithm uses the dictionary to solve the WSD.
“The Lesk algorithm is based on the assumption that words in a given neighborhood will tend to
share a common topic. A naive implementation of the Lesk algorithm would be:

    1.   Choose pairs of ambiguous words within a neighborhood
    2.   Check their definition in the dictionary
    3.   Choose the senses as to maximize the number of common terms in the definitions of the
         chosen words” (1)

Thus, using the Lesk’s algorithm, the meaning with the highest count is the closest to the actual
meaning in the context.



                                              4
This algorithm has a major performance problem as the number of words in a sentence increases.
In addition, “dictionary glosses are often quite brief, and may not include sufficient vocabulary to
identify related senses.”(2).

An effective algorithm that does not rely solely on the dictionary details of the word was
implemented. This algorithm takes the full glossary of the parents and children of the desired word
in the Ontology. To reduce the processing overload we will examine K words around the word we
are disambiguating without any redundant recalculations.

The modified WSD algorithm
    Input: list of N words represented by synset(s).
    Output: list of N Senses (conceptual content).
    Lookup: search in the Ontology data structure instead of a dictionary.

    The process to get the desired output is as follows:

    Step 1
    A window of K words is created: up to three words before the word being disambiguated, and
    up to 3 words after. Thus K is between 4 and 7 words. For example, if we are disambiguating
    the word “ ” K has 7 words, the 3 words before the focus word, and 3 after. However, if the
    focus word is “     ”, then K has 4 words: the focus word, and the 3 subsequent words.




 Words that are contributing to the                             Disambiguated words, and are contributing
 disambiguation of the focus word.                                 to the disambiguation of the focus word
                                         The focus word
                                            that is being
                                         disambiguated.

                     Figure -3- An example of how the slide window works

    Step 2
    For each word in the slider window we prepare full detail for each Sense of this word by
    obtaining the definition, Hypernyms, Hyponyms, Menonym and Troponym of the words the
    slider window.

    These definitions are concatenated into one string for each sense of each word.

    Step 3
    Starting with the string of the first sense of the focus word, the stop words are eliminated, and
    then the string is split into words.
    For each word we count its occurrence in the strings of the senses for the other words in the
    slider window. This count is weighted based on ZIPF’s law. Once we calculated the weighted
    count for all the words for this sense, of the focus word, we add them and associate the
    number with the sense.
    the same steps are repeated for the remaining senses of the focus word.

    ZIPF’s law assumes that the length of a word has a negative effect on the usability of this
    word, so each calculation which consists of consequent words with length N will have the N2
    effect to the final result.
    For example: the word        is weighted as 32 = 9, but the word        will be weighted as:
     2   2
    2 +1 =5.

    Step 4
    Since a weighted count for each sense of the focus word is obtained, the sense with the highest
    weighted count is selected.


                                               5
The resulting sense has the highest probability of being the correct one from the context.
    Furthermore, we have the correct spelling and part of speech, and even the translation of the
    word in English.

    After the focused word is disambiguated, the slider window is moved forward to disambiguate
    the next word (return to step2) until the sentence is finished.
4.3. Measurement of related similarity between two conceptual contents

Input: two conceptual contents.
Output: related similarity ratio.

This module allows to determine the similarity of two contents. The two contents are preprocessed
using modules 1 and 2 to obtain the conceptual content (list of senses). The process to get the
related similarity ratio is as follows:

Assume the two contents are: C1 and C2.
-   The length of C1 is m (m: the number of concepts in C1).
-   The length of C2 is n (n: the number of concepts in C2).

To measure the similarity between C1 and C2, the similarity between any two concepts in these
contents should be measured first. This process is based on calculating the distance between these
words in the Ontology.
P.Resnik stated that “however the path between two nodes is shorter, these nodes will be similar “
The Wu & Palmar measurement law was used, which is:
             Sim(s,t) = 2 * depth (LCS) / [ depth(s)+depth(t) ].

Where:
    s, t : are the source and destination nodes (sense) being compared.
    depth(s): is the shortest path between the root and the node s.
    LCS: is the common parent of s, t.
The process uses the following steps:

Step 1:
Building the semantic similarity matrix R[m,n] for each pair of concepts contained in the contents,
where R[i,j] represents the semantic similarity between the concept i in content C1 and the
concept j in the content C2 , thus the R[i,j] will be the weight of the path that connects i and j
using the Wu & Palmar measurement law.

Step 2:
After building the previous matrix, the similarity between the two contents has to be found. This
problem is similar to the calculation of the highest sum of weighted Bipartite Graph where C1
and C2 are the sets of unmerged nodes.
The used algorithm needs to take into consideration processing speed. Practical usability of this
framework requires fast determination of the similarity between two contents.

For example:
C1:                           .
C2:                       .
R[m,n] is:

These contents have no similarity.

The value of this approach is that it adds
intelligence to the calculation of
similarities. Approaches that do not use all
the steps included in the OBSC framework          Figure -4- The similarity matrix after using WSD
often result in misleading or incorrect
similarity measure.
For example, if we use the same two contents, but do not use WSD, the resulting matrix will be:


                                              6
0.14       0.73      0.67

                                   0.12       0.46      0.43

                                   0.62       0.77        1

                       Figure -5- The similarity matrix without using
                                           WSD
Sum of the maximum values per row = 0.73 + 0.46 + 1 = 2.19
Sum of the maximum values per column = 1 + 0.77 + 0.62 + 0 = 2.39




It is obvious to any reader that, in fact, the two contents have no similarity, and thus, 65% is
definitely wrong.

4.4. Clustering

Input: Conceptual contents.
Output: Hierarchal Clusters -based on the previous similarity measurement- contain these
contents.

This module has several promising applications, all concerned with improving efficiency and
effectiveness of this OBSC Framework. Some of the more interesting include:

    Finding Similar Documents;
    Search Result Clustering;
    Guided/Interactive Search;
    Organizing Site Content into Categories;
    Recommender System;
    Faster/Better Search.

K-Means has been considered the standard within clustering and still remains a very strong player
in the field.

The research found that K-means clustering algorithm has several limitations:
    Cannot determine the optimal number of clusters K;
    Randomness of selecting the centers of the clusters;
    A hard and non-hierarchical cluster.

The advantages include:
    Definite and good upper run-time;
    Simplicity;
    Near linear which yields good performance.

The Bisecting K-means algorithm for clustering was selected because it has the same advantages
as the K-means, but resolved the main disadvantages.


Bisecting K-means added:
    The ability to generate the optimal number of clusters;
    Flexibility and hierarchy,


                                                7
The randomness of selecting the centers of the clusters was resolved by replacing the cosine the
      Bisecting K-means uses with the similarity ratio from module3, the first center is selected
      randomly; the second center is the content with the least similarity with the first one.

      These steps are repeated as we progress through the hierarchy.

5. Conclusion

Using all the previous steps, The OBSC framework can be used for building many semantic
applications such as: dictionaries, QA systems, Encyclopedias and others…
To test this framework we developed an Encyclopedia and named it Arapedia which has the ability to
add any content from the web, or any written content from other sources. It offers many services like
translate any disambiguate word to English; show all related contents to the current content; as well as
semantic search for contents. For example: the result of searching for              will return contents
bases on “      ” as a noun meaning gold. Other approaches may return content such as                , in
which “      ” is a verb meaning to go.

6. Future Work

          Expand the AWN to enhance the results of the framework.
          Adding to the framework a Morphological and lexical analyzer.
          Expand the framework to facilitate machine learning.

7. References

1.    http://en.wikipedia.org/wiki/Lesk_algorithm
2.    http://www.gabormelli.com/RKB/Lesk_Algorithm
3.    William B. Frakes and Ricardo Baeza Yates. Information Retrieval : Datastructures and
      Algorithms. Prentice Hall, 1992
4.    S. M. Rueger and S. E. Gauch. Feature reduction for document clustering and classification.
      Technical report, Computing Department, Imperial College, London,UK, 2000.
5.    Daniel Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery
6.    A. Hotho, S. Staab, and G. Stumme. Explaining text clustering results using semantic structures. In
      7th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2003),
      2003.
7.    Anette Hulth. Improved automatic keyword extraction given more linguistic knowledge. In
      Proceedings of the Conference on Empirical Methods in Natural Language Processing
      (EMNLP’03), July 2003.
8.    A Prototype English‐ Arabic Dictionary Based on WordNet William J. Black and Sabri El Kateb
9.    THE CHALLENGE OF ARABIC FOR NLP/MT Arabic WordNet and the Challenges of Arabic;
      Sabri Elkateb, William Black, The University of Manchester; David Farwell Politechnical
      University of Catalonia
10.   IMPROVING Q/A USING ARABIC WORDNET Lahsen Abouenour1, Karim Bouzoubaa1, Paolo
      Rosso2
11.   Tabulator: Exploring and Analyzing linked data on the Semantic Web Tim Berners‐ Lee, Yuhsin
      Chen, Lydia Chilton, Dan Connolly, Ruth Dhanaraj, James Hollenbach, Adam Lerer, and David
      Sheets
12.   Semantic Web Technologies Dr Brian Matthews CCLRC Rutherford Appleton Laboratory
13.   Scientific American: The Semantic Web, May 17, 2001, Tim Berners‐ Lee, James Hendler and
      Ora Lassila




                                                    8

Contenu connexe

Similaire à OBSC Framework

Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Improving Text Categorization with Semantic Knowledge in Wikipedia
Improving Text Categorization with Semantic Knowledge in WikipediaImproving Text Categorization with Semantic Knowledge in Wikipedia
Improving Text Categorization with Semantic Knowledge in Wikipediachjshan
 
An approach to word sense disambiguation combining modified lesk and bag of w...
An approach to word sense disambiguation combining modified lesk and bag of w...An approach to word sense disambiguation combining modified lesk and bag of w...
An approach to word sense disambiguation combining modified lesk and bag of w...csandit
 
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...cscpconf
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
 
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...csandit
 
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...AIST
 
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...cscpconf
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
Hyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyHyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyIJwest
 
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...dannyijwest
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
An Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic WebAn Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic WebAndrea Porter
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) inventionjournals
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfacepathsproject
 

Similaire à OBSC Framework (20)

Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Improving Text Categorization with Semantic Knowledge in Wikipedia
Improving Text Categorization with Semantic Knowledge in WikipediaImproving Text Categorization with Semantic Knowledge in Wikipedia
Improving Text Categorization with Semantic Knowledge in Wikipedia
 
An approach to word sense disambiguation combining modified lesk and bag of w...
An approach to word sense disambiguation combining modified lesk and bag of w...An approach to word sense disambiguation combining modified lesk and bag of w...
An approach to word sense disambiguation combining modified lesk and bag of w...
 
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
 
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
 
NLP todo
NLP todoNLP todo
NLP todo
 
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
 
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Hyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyHyponymy extraction of domain ontology
Hyponymy extraction of domain ontology
 
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 
G04124041046
G04124041046G04124041046
G04124041046
 
Surface realization
Surface realizationSurface realization
Surface realization
 
An Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic WebAn Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic Web
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 
C04 07 1519
C04 07 1519C04 07 1519
C04 07 1519
 

Dernier

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 

Dernier (20)

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 

OBSC Framework

  • 1. Ontology-Based Semantic Context Framework (OBSC) For Arabic Web Contents Dr. Bassel AlKhatib balkhatib@svuonline.org Eng.Mouhamad Kawas kawas@w.cn Eng. Wajdi Bshara w_bshara@hotmail.com Eng. Mhd. Talal Kallas ekallas@w.cn Faculty of Information Technology Engineering, University of Damascus, Syria Abstract Several researches developed optimized ontology-based semantic (OBSC) framework for English content. The methodology used in these approaches could not be used for Arabic content due to the complexity of the syntax, semantics and ontology of the Arabic language. Existing methodologies do not work properly and efficiently with Arabic language. To correctly and accurately comprehend Arabic web content new concepts were developed, and existing methodologies and framework, such as the tokenization, Word sense disambiguation (WSD), and Arabic WordNet (AWN) were extensively modified. Keywords: Semantic, ontology, WSD (Word sense disambiguation), Tokenization, AWN (Arabic WordNet), context, clustering, Part of speech (POS). 1. Introduction The goal of the proposed framework is to build a core system that can be used for many semantic applications, such as: Semantic Search Engines, Semantic Encyclopedias, Arabic Question Answering Systems, Semantic Dictionaries, etc. This framework can "understand" Arabic web contents using AWN ontology. Most of previous researches and work tackling this dilemma depend mostly on information retrieval and statistical approaches that did not go deep into the semantic and context meaning, especially researches for Arabic language applications. In the proposed framework new approaches, measures and algorithms to achieve Arabic web content semantic understanding are illustrated. This paper is organized as follows: Section 2 illustrates the basic components of the OBSC framework. Section 3 is devoted to a short overview of customized Arabic Ontology. Section 4 describes the framework architecture and how the modules work. The last two sections conclude the paper, and discussed proposed future work. 1
  • 2. 2. Framework Concepts Arabic contents exist in many forms on the web – HTML, word documents, XML, PDF, etc. The main goal of the OBSC framework is to transfer any of these contents into conceptual structures that can be understood by the machine, and can be used in many semantic applications. Content Similarity Tokenization and Measuring Indexing Conceptual Content Clustering WSD Store Postprocessing Preprocessing Multi Framework Conceptual Content Figure -1- OBSC Framework Architecture The main components of the OBSC framework, as illustrated in the figure, are: 1- Tokenization and Indexing. 2- Word Sense Disambiguation (WSD), based on Arabic ontology. 3- Measuring similarities between Arabic contents. 4- Clustering group of Arabic contents using previous measures. 3. Arabic ontology The OBSC Framework uses the AWN that has been constructed according to the same rules that has been used in Euro WordNet. Each word is represented as a synset. Each item of the synset can be any type of the part of speech (POS): verb, noun, subject, adjective and adverb. For example: the word " " has the synset { ," ", …}. Each item of this set can be any POS. For example " " can be noun or verb. The sense is the exact spelling, (that gives the precise meaning,) of each item in the synset for each of the POSs. For example: the item appears as or when the POS is a noun; and appears as " " or " " when the POS is a verb. This ontology was designed to connect the synset in explicit semantic relations; these relations can be (hypernym, hyponym, menonym, troponym, …) 3.1. Customizing the Arabic WordNet (AWN) The AWN was implemented by many authors; each of them uses different strategies to store the stem of each word depending on author’s algorithm. It was found that the previous AWN doesn't meet our needs, so a decision was made to customize the AWN using the following steps: 2
  • 3. Use a specific stemming algorithm to store the stems. Merge the AWN with Princeton WordNet (the English version) in order to find the translation of a word in English, when needed. Denormalize the AWN in order to speed the retrieval. In the sense list traditionally the word is stored with soft vowels ( ). This leads to a problem because not all content is written with soft vowels. This issue was resolved by pairing the word without soft vowels with the word with soft vowels. 3.2. How to use the customized AWN In order to enhance the proposed framework’s performance and efficiency the Arabic Ontology was transferred into a data structure referred to as the “Ontology data structure”, which can be loaded into the main memory. The following figure illustrates the Ontology data structure: Nouns Verbs … Subjective … Adjectives … Adverbs mugaAdarap_n1AR $aHon_n1AR Departure Dispatch ... &%Motion+ &%Transfer+ ArrayList of gloss …. the act of sen.. Figure -2- Ontology data structure 4. Framework Architecture The OBSC framework has four modules: 4.1. Tokenization and Indexing In this module a list of words, from any Arabic content, is obtained and the stop words are eliminated. Then, each word is paired with its stem. the document properties along with the obtained list is stored in the “Tokenization and Indexing data structure”. The word and its stem are used to retrieve all the synsets, in any part of speech, from the Ontology data structure. 3
  • 4. We have faced two major problems with some of the Arabic plurals or conjugation. The first arose when the word is not found in any of the synsets of the stem, derived from the ontology. For example, the word " " has the stem " ", but none of the synsets of " " contain , another example, the word “ ” is not found in any of the synsets derived from the stem " ". This problem was solved by going through the items of the synsets, derived from the stem, selecting the word that most contained the word " ", using the synsets in Figure-2, the Synset will be retrieved. This solution implied a second problem with the special plurals in Arabic, such as “ ”. For example, the plural words " ", has “ ” as the stem. However applying the “most contained algorithm will retrieve the synset for , when it should have retrieved the synset for . To solve this problem, the algorithm was modified by adding a rule that if the retrieved item exactly equals the stem, the most contained algorithm will not be used, and all the synsets of the stem will be returned. Thus the developed algorithm for this module can be summarized as follows: - If the word was found in the synsets derived from the stem then return the found synset. - Else if the word was not founded in the synsets then use the “most contained” algorithm and return the synset unless the synset is the stem word. - Else the word was not found in the synsets and the “most contained” algorithm returned the stem word, then return all the synsets of the stem. The output of this module is passed to the next module in the OBSC framework: the WSD module. In the WSD the best sense, based on the POS for the Synset(s) we have retrieved, is selected. 4.2. Word Sense Disambiguation (WSD) Each word, from the content document may be associated with one or more synsets. This will lead to an ambiguity in analyzing the content. For example: the word will retrieve several synsets, such as , etc. Each item in the synset can be associated with five parts of speech. Each POS is associated with many Senses. For example: as a noun can be so we need to disambiguate the synsets based on the best sense. The implemented algorithm, to achieve disambiguation, transfers the content from a list of synsets to a list of senses to get the conceptual content. The WSD process is based on finding the closest and most appropriate meaning of a word in a specific context. The Micheal Lesk’s algorithm was used as the basis of the WSD algorithm. Lesk’s algorithm uses the dictionary to solve the WSD. “The Lesk algorithm is based on the assumption that words in a given neighborhood will tend to share a common topic. A naive implementation of the Lesk algorithm would be: 1. Choose pairs of ambiguous words within a neighborhood 2. Check their definition in the dictionary 3. Choose the senses as to maximize the number of common terms in the definitions of the chosen words” (1) Thus, using the Lesk’s algorithm, the meaning with the highest count is the closest to the actual meaning in the context. 4
  • 5. This algorithm has a major performance problem as the number of words in a sentence increases. In addition, “dictionary glosses are often quite brief, and may not include sufficient vocabulary to identify related senses.”(2). An effective algorithm that does not rely solely on the dictionary details of the word was implemented. This algorithm takes the full glossary of the parents and children of the desired word in the Ontology. To reduce the processing overload we will examine K words around the word we are disambiguating without any redundant recalculations. The modified WSD algorithm Input: list of N words represented by synset(s). Output: list of N Senses (conceptual content). Lookup: search in the Ontology data structure instead of a dictionary. The process to get the desired output is as follows: Step 1 A window of K words is created: up to three words before the word being disambiguated, and up to 3 words after. Thus K is between 4 and 7 words. For example, if we are disambiguating the word “ ” K has 7 words, the 3 words before the focus word, and 3 after. However, if the focus word is “ ”, then K has 4 words: the focus word, and the 3 subsequent words. Words that are contributing to the Disambiguated words, and are contributing disambiguation of the focus word. to the disambiguation of the focus word The focus word that is being disambiguated. Figure -3- An example of how the slide window works Step 2 For each word in the slider window we prepare full detail for each Sense of this word by obtaining the definition, Hypernyms, Hyponyms, Menonym and Troponym of the words the slider window. These definitions are concatenated into one string for each sense of each word. Step 3 Starting with the string of the first sense of the focus word, the stop words are eliminated, and then the string is split into words. For each word we count its occurrence in the strings of the senses for the other words in the slider window. This count is weighted based on ZIPF’s law. Once we calculated the weighted count for all the words for this sense, of the focus word, we add them and associate the number with the sense. the same steps are repeated for the remaining senses of the focus word. ZIPF’s law assumes that the length of a word has a negative effect on the usability of this word, so each calculation which consists of consequent words with length N will have the N2 effect to the final result. For example: the word is weighted as 32 = 9, but the word will be weighted as: 2 2 2 +1 =5. Step 4 Since a weighted count for each sense of the focus word is obtained, the sense with the highest weighted count is selected. 5
  • 6. The resulting sense has the highest probability of being the correct one from the context. Furthermore, we have the correct spelling and part of speech, and even the translation of the word in English. After the focused word is disambiguated, the slider window is moved forward to disambiguate the next word (return to step2) until the sentence is finished. 4.3. Measurement of related similarity between two conceptual contents Input: two conceptual contents. Output: related similarity ratio. This module allows to determine the similarity of two contents. The two contents are preprocessed using modules 1 and 2 to obtain the conceptual content (list of senses). The process to get the related similarity ratio is as follows: Assume the two contents are: C1 and C2. - The length of C1 is m (m: the number of concepts in C1). - The length of C2 is n (n: the number of concepts in C2). To measure the similarity between C1 and C2, the similarity between any two concepts in these contents should be measured first. This process is based on calculating the distance between these words in the Ontology. P.Resnik stated that “however the path between two nodes is shorter, these nodes will be similar “ The Wu & Palmar measurement law was used, which is: Sim(s,t) = 2 * depth (LCS) / [ depth(s)+depth(t) ]. Where: s, t : are the source and destination nodes (sense) being compared. depth(s): is the shortest path between the root and the node s. LCS: is the common parent of s, t. The process uses the following steps: Step 1: Building the semantic similarity matrix R[m,n] for each pair of concepts contained in the contents, where R[i,j] represents the semantic similarity between the concept i in content C1 and the concept j in the content C2 , thus the R[i,j] will be the weight of the path that connects i and j using the Wu & Palmar measurement law. Step 2: After building the previous matrix, the similarity between the two contents has to be found. This problem is similar to the calculation of the highest sum of weighted Bipartite Graph where C1 and C2 are the sets of unmerged nodes. The used algorithm needs to take into consideration processing speed. Practical usability of this framework requires fast determination of the similarity between two contents. For example: C1: . C2: . R[m,n] is: These contents have no similarity. The value of this approach is that it adds intelligence to the calculation of similarities. Approaches that do not use all the steps included in the OBSC framework Figure -4- The similarity matrix after using WSD often result in misleading or incorrect similarity measure. For example, if we use the same two contents, but do not use WSD, the resulting matrix will be: 6
  • 7. 0.14 0.73 0.67 0.12 0.46 0.43 0.62 0.77 1 Figure -5- The similarity matrix without using WSD Sum of the maximum values per row = 0.73 + 0.46 + 1 = 2.19 Sum of the maximum values per column = 1 + 0.77 + 0.62 + 0 = 2.39 It is obvious to any reader that, in fact, the two contents have no similarity, and thus, 65% is definitely wrong. 4.4. Clustering Input: Conceptual contents. Output: Hierarchal Clusters -based on the previous similarity measurement- contain these contents. This module has several promising applications, all concerned with improving efficiency and effectiveness of this OBSC Framework. Some of the more interesting include: Finding Similar Documents; Search Result Clustering; Guided/Interactive Search; Organizing Site Content into Categories; Recommender System; Faster/Better Search. K-Means has been considered the standard within clustering and still remains a very strong player in the field. The research found that K-means clustering algorithm has several limitations: Cannot determine the optimal number of clusters K; Randomness of selecting the centers of the clusters; A hard and non-hierarchical cluster. The advantages include: Definite and good upper run-time; Simplicity; Near linear which yields good performance. The Bisecting K-means algorithm for clustering was selected because it has the same advantages as the K-means, but resolved the main disadvantages. Bisecting K-means added: The ability to generate the optimal number of clusters; Flexibility and hierarchy, 7
  • 8. The randomness of selecting the centers of the clusters was resolved by replacing the cosine the Bisecting K-means uses with the similarity ratio from module3, the first center is selected randomly; the second center is the content with the least similarity with the first one. These steps are repeated as we progress through the hierarchy. 5. Conclusion Using all the previous steps, The OBSC framework can be used for building many semantic applications such as: dictionaries, QA systems, Encyclopedias and others… To test this framework we developed an Encyclopedia and named it Arapedia which has the ability to add any content from the web, or any written content from other sources. It offers many services like translate any disambiguate word to English; show all related contents to the current content; as well as semantic search for contents. For example: the result of searching for will return contents bases on “ ” as a noun meaning gold. Other approaches may return content such as , in which “ ” is a verb meaning to go. 6. Future Work Expand the AWN to enhance the results of the framework. Adding to the framework a Morphological and lexical analyzer. Expand the framework to facilitate machine learning. 7. References 1. http://en.wikipedia.org/wiki/Lesk_algorithm 2. http://www.gabormelli.com/RKB/Lesk_Algorithm 3. William B. Frakes and Ricardo Baeza Yates. Information Retrieval : Datastructures and Algorithms. Prentice Hall, 1992 4. S. M. Rueger and S. E. Gauch. Feature reduction for document clustering and classification. Technical report, Computing Department, Imperial College, London,UK, 2000. 5. Daniel Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery 6. A. Hotho, S. Staab, and G. Stumme. Explaining text clustering results using semantic structures. In 7th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2003), 2003. 7. Anette Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’03), July 2003. 8. A Prototype English‐ Arabic Dictionary Based on WordNet William J. Black and Sabri El Kateb 9. THE CHALLENGE OF ARABIC FOR NLP/MT Arabic WordNet and the Challenges of Arabic; Sabri Elkateb, William Black, The University of Manchester; David Farwell Politechnical University of Catalonia 10. IMPROVING Q/A USING ARABIC WORDNET Lahsen Abouenour1, Karim Bouzoubaa1, Paolo Rosso2 11. Tabulator: Exploring and Analyzing linked data on the Semantic Web Tim Berners‐ Lee, Yuhsin Chen, Lydia Chilton, Dan Connolly, Ruth Dhanaraj, James Hollenbach, Adam Lerer, and David Sheets 12. Semantic Web Technologies Dr Brian Matthews CCLRC Rutherford Appleton Laboratory 13. Scientific American: The Semantic Web, May 17, 2001, Tim Berners‐ Lee, James Hendler and Ora Lassila 8