SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Developing an Interactive Web Information Retrieval and
Visualization System
Research Project
Master Artificial Intelligence
Department of Knowledge Engineering
R. Atachiants J. Meyer J. Steinhauer T. Stocksmeier
June 24, 2009
Abstract
Finding the needed information (images, articles or other) is not always as simple as going to a search en-
gine. This paper aim at developing an interactive presentation system, able to cope with live presentation
challenges: speech recognition, information retrieval, filtering unstructured data and its visualization.
Speech recognition is achieved using Microsoft SAPI; information is retrieved using various Text Min-
ing techniques in order to derive most relevant keywords and next, the search using expanded Google
queries is performed. The information then is filtered: HTML tags are cleared and text summarization
is performed in order to extract the most relevant information. Such system has been tested and per-
forms satisfactory, and such interactive systems can be built with modern hardware and faster internet
connections, but several challenges still need to be faced and some improvement is required in order to
create smooth presentation experience. This paper also presents a novel approach of query expansion
using Cyc concepts, based on CycFoundation.org knowledge base, it presents various query expansion
patterns as: Or-Concept, Or-Most-Relevant-Alias and their respective results. It also presents 2 word
matching algorithms which tries to match a word to the most relevant concept, using WordNet similarity
measurement to accomplish the goal at semantic level. It shows that Or-Concepts pattern was improving
the query precision with around a factor of 3.
CONTENTS CONTENTS
Contents
1 Introduction 1
2 Web Retrieval 1
2.1 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4 Query Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4.1 WordNet Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4.2 Cyc Concepts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4.3 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4.4 Query for Google Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Results Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.1 Retrieving Unstructured Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.2 Summarizing the Web Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Tests and Results 8
3.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Query Construction and Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Results Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Discussion 10
5 Conclusions 11
A Appendix: Query Expansion using Cyc 12
A.1 WordNet Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A.2 CycFoundation.org REST API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A.3 Cyc Concepts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.3.1 DisplayName Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.3.2 Comment Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.3.3 Experiments with both algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
A.4 Query Expansion Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
A.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.5.1 Google Search on Wikipedia KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.5.2 Lemur Search in AP Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
June 24, 2009 2
2 WEB RETRIEVAL
1 Introduction
Why is there a need for an Interactive Web Infor-
mation Retrieval and Visualization System?
A person who is teaching at school, at university
or somewhere else is often required to present vari-
ous types of information (his own knowledge, texts
written by others, pictures, diagrams etc.) more or
less at once. In order to do this, he will collect his
data beforehand, arrange it and set up a presenta-
tion, using an ordinary presentation tool, such as
Microsoft PowerPoint. During his lecture, he will
start the presentation, showing one slide after an-
other and explaining what can be seen on it.
This is, compared to what lessons were like some
decades ago, of course a great increase in possibil-
ities of how information can be presented and how
the listeners’ attention can be kept. But still, there
are a lot of problems with this approach. First, all
the work has to be done beforehand, resulting in
a very long time one has to spend on the presen-
tation before he can show something useful. An-
other problem is that the prepared presentation is
very static. Presentations about topics in which
changes can occur during some time will have to
be updated again and again with new information
and media from the web. Further, during the pre-
sentation one will learn that the audience’s knowl-
edge about some parts of the covered subjects is
different from what one had thought before. If the
listeners know more than expected, that won’t be
a real problem - some slides can be skipped. But
what if they know less than expected? It is impos-
sible to make new slides during the presentation, so
the either problem would have to be ignored or one
had to search for the missing information during
the presentation.
Another hardship with the usual presentations is
the research itself. Finding information is nowa-
days very easy. One just goes to Google or another
search engine, enters what he is looking for and
looks at the first results where he finds what he is
looking for. But it is not always so easy, because
one doesn’t always know what he is searching for
and where he can find it. In addition, you often
recieve a lot of results and the precision is too low
to find the relevant information very fast.
In our project, we want to meet all these prob-
lems and difficulties. Our aim is to develop a sys-
tem that both makes the research easier and more
efficient and also increases the quality and change-
ability of the presentation.
To enhance the research itself, the program
should increase the precision of the search by
searching for concepts instead of keywords. If done
properly, the results of the search are all relevant
for the topic and can be used in some way. This
can reduce the time spent on evaluating each of
the search results. In addition, the texts on the re-
sulted pages can be summarized, further reducing
the time needed to evaluate the results. Another
possibility is to do the usual web search and the
image search at once, because e.g. tables can be
found in both formats.
To provide the possibility of using the system
also during the presentation itself, the search query
should be constructed, not solely from some key-
board input, but as well directly from spoken text,
in a way that provides a possibility to choose the
currenty wanted way of communication. The re-
sults have to be displayed in a way that the impor-
tant parts are visible directly, while other things
are suppressed. The user of the system has to be
able to decide which contents (texts, images etc.)
are shown and which not.
To meet all these requirements, the system has
to consist of the following steps:
1. Recognize the spoken text
2. Retrieve keywords from the text
3. Create a powerful search query
4. Filter and summerize the results
5. Visualize the results in a proper way
These steps will be further described in the next
chapters.
2 Web Retrieval
2.1 Architectural Overview
The spoken words of the presentation are recog-
nized and translated by a speech recognition en-
gine. The output of the speech recognition is fed
to a keyword extractor to create the keywords of
the text. These are used by a query construction
to produce an internet search query. The picture
results of the search are directly transferred to the
June 24, 2009 1
2.3 Keyword Extraction 2 WEB RETRIEVAL
visualisation module. The text results are summa-
rized by a summarizer module and then also dis-
played by the visualisation module. The workflow
can be shown in figure 1.
Figure 1: workflow
2.2 Speech Recognition
The user of the information retrieval system has
several possibilities of interaction with the system,
but the most important way of communication is
the speech. It is used to get a starting point as well
as to define the wanted content more precisely.
Thus, our application needs a well-working speech
recognition engine, to ”understand” as much of the
spoken text as possible. It would be interesting to
write an own speech recognition engine, but the
effort for this is too high to be done for our project.
So the main task here is to choose a feasible speech
recognition engine and to use it in a proper way.
There are several speech recognition engine avail-
able with different strengths and weaknesses. For
example, the program ”Dragon Naturally Speak-
ing” is an established commercial software product
that works quite well. But due to the fact that it is
commercial, it is very expensive and thus not the
right solution for a research project. Beside the
one mentioned there are several other commercial
products that include a speech recognition engine.
Another group of engines are open-source freeware-
tools, found spread one the web. Unfortunatly, the
tools tested turned out not to have the efficiency
and the quality that we would need for the projet.
So we decide to use one product of the third group
- built-in search engines from operating systems.
These engine are likely to have a high quality
and a good performance, while they don’t have
to be bought for a lot of money, because they are
included in the operating system that everyone
has to buy anyway.
After some tests and some problems with the
usage of some of the mentions software products,
we choosed the Windows Speech Recognition API
(SAPI 5.1) [9], [10] as a part of our project. The
Windows speech module offers different kinds of
functionality, including a dictation mode (beside
others such as voice commands, text to speech and
others). In this dictation mode, the spoken words
are simply put into a plain text, which can by used
by other programs via the programming interface
[8].
2.3 Keyword Extraction
For the keyword generation from an input text the
Natural Language Processing Open Source tool:
SharpNLP is used [3].
SharpNLP is a collection of natural language pro-
cessing tools written in C#. This tool provides the
following NLP tools [4]:
• a sentence splitter (to identify sentence bound-
aries)
• a tokenizer (to find tokens or word segments)
• a part-of-speech tagger
• a chunker (to find non-recursive syntactic an-
notation such as noun phrase chunks a parser)
• a name finder (Name finder to finding proper
names and numeric amounts)
June 24, 2009 2
2.4 Query Construction 2 WEB RETRIEVAL
• a coreference tool (to perform coreference res-
olution)
• an interface to the WordNet lexical database
[14]
This toolkit turned out to be tricky to setup, but
subsequently offers good performance and useful
features. To generate keywords the following fea-
tures are used:
• Sentence Splitting:
Splitting the text based on sentence signs and
additional rules (will not split ”and Mr. Smith
said” into two sentences), uses a model trained
on English data
• Tokenizer:
Resolves the sentences into independent to-
kens, based on Maximum Entropy Modeling
[5]
• POS-Tagging (Part Of Speech-Tagging): As-
sociating tokens with corresponding tags, de-
nominating what grammatical part of a sen-
tence it constitutes, based on a model trained
on English data from the Wall Street Journal
and the Brown corpus [7]
• POS-Filtering:
Filtering relevant word categories (nouns and
foreign words) based on the POS-Tags (NN,
NNS, NNP, NNPS, FW)
• Stemming [13]:
Reducing the different morphological versions
of a word to their common word stem, based
on the Porter-Stemmer-Algorithm [2]
The relevance rating of the keywords is calculated
based on word count, actuality and word length.
2.4 Query Construction
In order to construct the query, which the system
uses, several steps are performed:
1. each keyword is mapped to its context sen-
tence;
2. using keyword and the context, the keywords
are matched wit a particular Cyc concept;
3. query is expanded using keywords and con-
cepts;
4. query is searched in Google search engine.
More detailed description of Concepts Matching
and Expansion can be found in Appendix A.
2.4.1 WordNet Similarity Measurement
In order to be able to compare two words or sen-
tences together, a semantic similarity measurement
was needed. WordNet similarity measurement was
used [11].
2.4.2 Cyc Concepts Matching
Cyc[12] is a very large Knowledge Base (KB) which
tries to gather a formalized representation of a vast
quantity of fondamental human knowledge: facts,
rules, etc. The Cyc KB contains thousands of dif-
ferent assertions and can be used in many different
ways. For the system described in this paper, a
particular subset of Cyc has been used: concepts
and different relations.
In Cyc, every concept linked to additional knowl-
edge, as such:
• a display name (readable name of the concept)
• a comment (short description)
• general concepts, for example: HumanAdult is
linked to AdultAniman and HomoSapiens
• specific concepts, for example: HumanAdult is
linked to Professional-Adult, SoftwareEngin-
ner...
• aliases, for example: for HumanAdult there
are: ”adult homo sapiens”, ”grownup” ...
Unfortunately the whole Cyc KB was unavail-
able to public when our research has been done,
therefore CycFoundation.org [6] REST APIs have
been used in order to interact with Cyc. More
about the actual REST API implementation, as
well as the explanation of different algorithms and
specific results can be found in Appendix A.
Since Cyc contains semantic knowledge about
the concepts and not words, the word to concept
matching algorithm was created. The algorithm
is built in such way so it’s operates only at
June 24, 2009 3
2.4 Query Construction 2 WEB RETRIEVAL
semantic level, therefore uses WordNet similarity
measurement[11] (also described in section 2.4.1)
in order to compute similarity score between two
sentences. The algorithm takes a keyword and
its corresponding sentence, then using CycFoun-
dation APIs the set of relevant Cyc Concepts
(aka Constants) are retrieved. The actual process
of retrieval retrieves only concepts containing in
it’s name the keyword, for example for keyword
”human”, a set of concepts would contain ”Hu-
manAdult”, ”HumanActivity”, ”HumanBody”.
Next, for each item in the set of concepts, the sim-
ilarity score with the keyword’s context (sentence)
is computed and the best one is used.
Following straightforward pseudo-code imple-
mentation illustrates the approach (more about
matching type in A.3):
Input: Keyword, its context (sentence) and a
matching type
Output: Cyc concept matched
(BestConstant)
Constants ⇐ GetConstants(Keyword)
BestConstant ⇐ new CycConstant()
BestDistance ⇐ ∞
foreach Constant in Constants do
Distance ⇐ ∞
if MatchingType = DisplayNameMatching
then
dK ⇐ GetDistance(Keyword,
Constant.DisplayName)
dC ⇐ GetDistance(Context,
Constant.DisplayName)
Distance ⇐ (dK + dC) / 2
end
if MatchingType = CommentMatching
then
dCK ⇐ 0
Comment ⇐ GetComment(Constant)
Keywords = GetKeywords(Comment)
foreach CK in Keywords do
dCK ⇐ dCK +
GetDistance(Keyword, CK)
end
dK ⇐ GetDistance(Keyword,
Constant.DisplayName)
dC ⇐ GetDistance(Context,
Constant.DisplayName)
Distance ⇐ (dK + dC + (dCK /
Keywords.Count)) / 3
end
if Distance < BestDistance then
BestDistance ⇐ Distance
BestConstant ⇐ Constant
end
end
2.4.3 Query Expansion
After the keywords and their respective context are
matched to a particular Cyc concept, the actual
query expansion can be done. One can think about
many different possible query expansions using the
additional Cyc knowledge. In our research we have
chosen several expansion methods, most of them
are quite straightforward. After some experiments,
one particular structured query expansion was cho-
sen to be used in the system: Or-Concept.
June 24, 2009 4
2.4 Query Construction 2 WEB RETRIEVAL
The algorithm constructs a structured query, which
can be described using the following formula, where
K is the set of keywords, k ∈ K and c(k) is a func-
tion which gets a matched concept for a keyword
(using a matching algorithm) :
Q(K, n) =
n
i=1
(max(ki, c(ki))) (1)
The following pseudo-code illustrates the approach
used in the system in order to construct the
structured query:
Input: Keywords taggeds with Cyc Concepts
Output: Set of expanded keywords for Google
foreach TaggedWord in Query do
Pattern ⇐ ”( [0] OR [1] ) + ,”
PatternEmpty ⇐ ”[0] + ”
Keyword ⇐ TaggedWord.Word
BestGeneral ⇐ ””
BestDistance ⇐ ∞
if TaggedWord.Concept ¬ null then
GeneralConcepts ⇐
GetGenerals(TaggedWord.Concept)
foreach General in GeneralConcepts
do
Distance ⇐
GetDistance(Word.Word,
General.DisplayName)
if (Distance < BestDistance then
BestGeneral ⇐
General.DisplayName
BestDistance ⇐ Distance
end
end
end
if BestGeneral.Length = 0 then
Pattern = PatternEmpty
end
Expanded.Add(String.Format(Pattern,
Keyword, BestGeneral))
end
2.4.4 Query for Google Search
In order to recieve the real data, we need to choose
a search engine as well as a corpus. For several
reasons, we quickly arrived at the decision that we
want to use to Google web search for our appli-
cation. First of all, Google is the most popular
search engine that one can find in the web. This is
a great advantage, because an application program-
ming interface (API) is required in order to make
interaction between our program and the search en-
gine possible. The bigger and the more popular a
search engine is, the more likely it is that a good
API can be found for it. This turned out to be
true partially. Google provides a lot of of program-
ming interfaces, where most of them are specialized
for a specific context. So it isn’t easy to find the
best fitting programming interface for an applica-
tion, in addition to the ”usual” problems one may
encouter during the implementation. Another rea-
son to choose the Google web search is the fact that
it can easily be restricted to one or a set of several
web pages. This can be done by simply adding the
tag ”site:”, followed by the desired web page, to
the keywords one wants to search for. As a result,
the search only finds results that are somewhere on
the indicated pages. As we don’t want to find arbi-
trary results, but that from specific, serious pages,
we really want to use this feature of the Google
web search. If one does a web search (either with
the Google engine or somewhere else), a lot of re-
sults are found at very different pages and types
of pages. This is of course important, since that is
what most people want when they search the web.
But with the information retrieval system, the aim
is different. We don’t want to display content from
any websites that contain the keywords, but only
from websites with trustable content. So a white
list with the sites that will be searched is the best
way of avoiding useless results. Due to the fact that
the results of the web search contain lots of HTML-
tags and similar things that are (for our purpose)
useless, some knowledge about the structure of the
found results. By limiting the search to defined
web pages, we can use our information about these
pages to parse the results, depending on the web
site they were found on. In a first approach, we lim-
ited the search only to articles on en.wikipedia.org.
Wikipedia is a very well-known and well-accepted
page that contains useful information about a lot
of domains. Retrieving data from only one page
may seem to be too less, but it is enough in order
to set up a working and useful system. In addition,
the system can easily be enhanced by adding other
sites.
June 24, 2009 5
2.5 Results Filtering 2 WEB RETRIEVAL
2.5 Results Filtering
2.5.1 Retrieving Unstructured Web Pages
The search on the web engine gives a list of URLs
as its results. Each of the pages can be downloaded
as a text, but now is a HTML-formatted web page.
There is some work left to do to recieve only the
real content. First, we want to get rid of the
HTML tags and any other formatting data. This
task has to be performed by applications of very
different types. For example, every web browser
has to use functionality that distinguishes between
content that will be displayed and content that
will not. It is usually the best to use existing,
tested and working packages, which shall be found
at web browsers.
The Microsoft Internet Explorer performs this
task using a dynamic link library (DLL) called
”mshtml”. This library can easily be used to parse
a complex html-file into plain text that contains
only that parts that would be displayed when one
is visiting this web page with the Internet Explorer
(or another browser). This does the main part of
this step.
Many web pages don’t only contain the ”real”
information, but also a lot of other things, such
as navigation links, menus, advertisements, and so
on. These are displayed by the browser and thus
seen by the user. If the user of the information
retrieval system chooses to display on of these
sites, he maybe wants to see these content. But
due to the fact that the texts from the results are
summerized in the next step, we have to remove
them from the text, because they don’t contain any
information that is relevant for the information
contained in the page itself.
The format of these menus etc. depends on the
web page where the result was found. The results
on Wikipedia are formatted in a different way from
the results in other lexigraphical web pages. Thus,
it is useful to handle the results depending on the
web page they were found on.
As the web search in this approach has been
limited to only one page, namely en.wikipedia.org,
we have to consider only one page in order to
remove ”waste” like links and menus from it. An
analysis of the structure unfolded an easy way
of parsing the wikipedia articles. All articles on
wikipedia start with article itself - beside some
HTML-tags and a lot of scripts. The menus and
navigation links follow after it. That means, that
we can simply work on the first part of the article
and skip everything that follows afterwards. The
edge between article and links is clearly defined by
the HTML-tag ’¡div class=”printfooter”¿’, which is
contained in every article on Wikipedia. So we can
simply search the results from wikipedia for the
tag mentioned above, and remove everything that
follows after this tag. It should be mentioned that
this parsing is aimed at the result of the parsing
done by the Microsoft dynamic link library, but
has to be done before the other parsing step,
because otherwise one wasn’t able to find the
HTML-tag used as the edge.
2.5.2 Summarizing the Web Content
In the document exists important, but also a lot
of irrelevant information. Additionally the screen
space to present the information is intentionally
limited to allow a quick evaluation by the user.
Because of this the documents are summarized in
order to present only the important information to
the user. For this we use the Open Text Summa-
rizer [1]. This tool doesn’t abstract the document
in a natural way because it does not rephrase the
text. It is just producing a condensed version of the
original by keeping only the important sentences.
However it shows good results for non-fictional
text and can be used with unformatted and html
formatted text. It has received favourable mention
from several academic publications and it is at
least as good as commercial tools such as Copernic
and Subject Search Summarizer.
The Open Text Summarizer parses the text
and utilizes Porter Stemming. For this an xml file
with the parsing and stemming rules is used. It
calculates the term frequency for each word and
stores this in a list. After this stop word filter
is performed. It removes all redundant common
words in the list by using a stop word dictionary,
which is also stored in the xml file. After sorting
the list for frequency the keywords are determined,
which are the most frequently occurring words.
The original sentences are scored based on these
keywords. A sentence that holds many important
words, the keywords, is given a high grade. The
June 24, 2009 6
2.6 Visualization 2 WEB RETRIEVAL
result is a text with only the highest scored
sentences. For this the limiting factor can be set to
either a percentage of the original text or sentences
number.
2.6 Visualization
Visualization is probably the one of most impor-
tant parts of the system. It consist of two main
parts:
1. The ”Presenter-View”, the view that the pre-
senter sees on his computer during the pre-
sentation. The presenter should be able to see
the dataflow, manipulate and give simple com-
mands as: ”show this image” or ”show that
article”. The presenter also needs a preview of
what’s shown to public.
2. The ”Public-View” the simplified view that
shows only the needed information and mir-
rors the lecturer interactions with the presen-
ter’s view pre-rendering zone.
Figure 2: Presenter’s view layout structure
For the Graphical User Interface (GUI), several
crucial specifications were defined, most impor-
tantly:
• rendering and screen mirroring should be done
in real-time;
• layout of the GUI should be as simple as pos-
sible, presenting the retrived data in the sim-
plest and fastest way for the presenter;
• it shoud also be interactive, the images and
articles should be manipulated at real-time.
Figure 3: Real-time parallelization workflow
In order to achieve smooth experience, several
things must be done at the same time. As shows fig-
ure 3, two main components: rendering and infor-
mation extraction were completely separated and
running in several threads. Moreover, speech recog-
nition part and google search were also separated
and all search done asynchronously. Therefore, the
system leverages actual multi-core architectures in
order to achieve significant speedup. In oder to
achieve the simplest way to presenting the data,
several prototypes of layout have been tested. Fi-
nally, the one shown at the figure 2 and screenshots
4 6 have been proven to be a good way to present-
ing the information. The system supports several
screens rendering, and the actual mirroring and
the development of the user interfaces have been
achieved using a novel Microsoft Windows Presen-
tation Foundation (WPF) technology.
WPF technology is espectially designed for rich
user interface develoment, and built on top of .Net
Framework. It uses a markup language known
as XAML to provide clear separation between the
code and the design definition, which also greatly
helped in parrallelization process. It also features
3D and Video/Audio rendering capabilities and an-
imation storyboards (similar to flash). Therefore
our system can be extended in order to perform
June 24, 2009 7
3 TESTS AND RESULTS
Figure 4: Screenshot of the GUI, the rendering
view is splitted automatically between several im-
ages and an article
video search or do 3D enhanced presentations. The
rendering engine takes advantage of modern hard-
ware (dedicated GPUs ...). One of the features of
Figure 5: Screenshot of the GUI, the public view
our system is the dynamic layout management of
the render-view. As shown on the figures 4 6 the
rendering zone automaticaly adjust itself depend-
ing on the quantity of content presented. With such
capabilites the system assures to use whole screen
without leaving much empty space.
Figure 6: Screenshot of the GUI, the rendering view
shows only images since no articles have been se-
lected by the lecturer
3 Tests and Results
3.1 Speech Recognition
In the context of this project it is important that
the speech recognition engine has a high quality
of recognition and is working fast. For this reason
these two aspects have been evaluated.
After a two hour training phase the engine
recognized about 70% of the spoken words cor-
rectly. The recognition rate can be improved by
further training. To achieve a good recognition
rate it is necessary to speak loud and articulated
and to pronounce the words always in the same
way. Additionally punctuation marks are not
recognized automatically but have to be spoken
explicitly. This is however not a natural way to
speak at a presentation.
After speaking some sentences a break of a
few seconds has to be made to initiate the recog-
nition process for these sentences. The recognition
process also needs a few seconds depending on the
June 24, 2009 8
3.4 Results Filtering 3 TESTS AND RESULTS
number of words. The tests showed a recognition
rate of about four words per second. Again these
frequent breaks of a few seconds are not a natural
way to speak at a presentation.
3.2 Keyword Extraction
The keywords extraction module has been tested
with different features and numbers of words. The
computing time and the quality of the extracted
keywords have been evaluated.
The results are shown in figure 7, 8. The
tests have been made for the complete feature set
(explained in chapter 2.3) and with disregarding
the stemmer. Due to the limitations of the speech
recognition engine (no punctuation marks if not
pronounced explicitly) the keyword extraction has
also been evaluated for texts without punctuation
marks.
It shows that the quality of the keyword extraction
without stemmer is generally lower than with the
complete feature set. Especially if the keywords
will be used in the query for the internet search the
use of the stemmer shows advantages. It avoids for
example the search for singular and plural of the
same keyword. Also the quality of the relevancy
of the keywords is improved. It also shows that
the gain in computation time by disregarding
the stemmer is minimal. With 1200 words the
gain is only 120 ms. For the expected relatively
short sentences usually provided by the speech
recognition engine the gain is even less.
The tests also showed that for text without
punctuation marks the quality is nearly the same
as for the same text with punctuation marks.
3.3 Query Construction and Expan-
sion
As figure 9 shows, the Or-Concept algorithm of
query expansion makes a significant improvement
of the precision. In that figure, 10 word query is
compared to an extended 10 word query, and at N 5
one can see an improvement of almost 3 times. Cyc-
enhanced queries gives better results, but should be
evaluated (more about different experiments and
query-enhancements in Appendix A.
Figure 7: time results for an extraction of ten key-
words (with different numbers of words)
Figure 8: quality of results for an extraction of ten
keywords (with different numbers of words)
3.4 Results Filtering
The summarizer module has been tested with
and without stemmer for different summarizing
June 24, 2009 9
4 DISCUSSION
Figure 9: Precision at N, 10 word query with no
keyword stemming
percentages of the original unformatted text. The
quality of the results represents their usability
which has been evaluated from the meaningfulness
of the summary and its length. A shorter length
was considered a better usability because it allows
a quicker evaluation by the user.
Figure 10 shows that in these tests with stemmer
the results with the best quality were achieved for
a 15% summary. Without stemmer the quality
was usually a little less.
The calculation time with and without stemmer
for different summarizer percentages is shown in
figure 11. The difference between the different
configurations is minimal.
When comparing unformatted text and HTML-
formatted text (with similar amount of content)
the tests show that the html formatted text can
take significantly longer to be processed.
4 Discussion
As explained in the preceeding chapters, the sys-
tem is works in its main parts. It is clearly possible
to enrich web search as well as presentation of
contents with different tools related to text mining,
Figure 10: quality of results for different summary
percentages for a text with 1200 content words
Figure 11: Calculation times for different summary
percentages for a text with 1200 content words
information retrieval, and others. But also many
problems turned out that had either to be solved
or to be worked around.
June 24, 2009 10
5 CONCLUSIONS
The speech recognition module did work, but still
had some problems, which lie in the implementa-
tion of the used software. The speech recognition
itself works fine, but the chosen engine has some
problems e.g. in recognizing accents, breaks
and similar. These often result in a incorrectly
recognized word or sentence. So either another,
better engine has to found or another solution has
to be found in order to meet the requirements.
The keyword extraction from the spoken text
works well and also very efficient. But still,
this step can be enhanced. For example, the
user may want to influence the chosen keywords
more directly than he does by speaking. So it
would be a useful feature if the user could decide
manually which keywords are good and important
and which keywords are or not relevant. The
keywords provided by the system could then be
viewed as suggestions that the user can accept or
decline. Another nice feature related to the first
one mentioned is a self-learning algorithm that
tries to predict the user’s decision.
Another part that works fine but still can be
improved is the summerization of the found texts.
It produces good summaries of the found web
pages without consuming too much computation
time. But the result could maybe be enhanced
by using features such as coreference resolution,
leading to a better estimation of which referents
are important and which are not. But, before
putting a big effort in it, it should further be
checked whether it is worth the effort.
5 Conclusions
In this paper a design and implementation of an In-
teractive Web Retrieval and Visualization System
have been discussed. It have been proven robust
and satisfying real-time constraints of live presen-
tation. Tests and results have shown that both key-
word extraction and query expansion return satis-
fying results, keeping good precision. During the
experiments it have been also shown that several
parts of such system still need improvement: speech
recognition needs to be enhanced and propery pre-
trained, unstructured web data should be properly
cleared from all noise and summarization can be
enhanced, creating more consize articles.
Overall, such systems can be built with modern
hardware and parallelization, and hopefully, more
of those will be seen as commercial products in the
near future.
June 24, 2009 11
A APPENDIX: QUERY EXPANSION USING CYC
References
[1] Open text summarizer.
http://libots.sourceforge.net.
[2] Porter stemmer. http://tartarus.org/ mar-
tin/PorterStemmer/.
[3] Sharpnlp - open source natu-
ral language processing tools.
http://www.codeplex.com/sharpnlp.
[4] The Text Mining Handbook: Advanced Ap-
proaches in Analyzing Unstructured Data,
chapter page 60. Cambridge University Press,
2006.
[5] The Text Mining Handbook: Advanced Ap-
proaches in Analyzing Unstructured Data,
chapter page 318. Cambridge University Press,
2006.
[6] Cyc Foundation.
http://www.cycfoundation.org/.
[7] W.N. Francis and H. Kucera. Brown corpus
manual.
[8] M. Harrington. Giving computers a voice.
http://blogs.msdn.com/coding4fun/archive/
2006/10/31/909044.aspx.
[9] J. Moskowitz. Speech recog-
nition with windows xp.
http://www.microsoft.com/windowsxp/
using/setup/expert/
moskowitz 02september23.mspx.
[10] J. Moskowitz. Windows speech recognition.
http://www.microsoft.com/windows/windows-
vista/ features/speech-recognition.aspx.
[11] T. Ngoc Dao and T. Simpson. Measuring simi-
larity between sentences. Codeproject Article.
[12] Open Cyc Project.
http://www.cyc.com/opencyc.
[13] J. C. Scholtes. 5 text mining: Preprocessing
techniques part 1.
[14] Princeton University. Wordnet, a lex-
ical database for the english language.
http://wordnet.princeton.edu/.
A Appendix: Query Expan-
sion using Cyc
A.1 WordNet Similarity Measure-
ment
In order to be able to compare two words or sen-
tences together, a semantic similarity measurement
was needed. For this the WordNet similarity mea-
surement was used.
Following are the steps that are performed to to
the semantic similarity between two sentences [11]:
• each sentence is partitioned into a list of tokens
and the stop words are removed;
• words are stemmed;
• part of speech tagging is performed;
• the most appropriate sense for every word in
a sentence is found (Word Sense Disambigua-
tion). To find out the most appropriate sense
of a word, the original Lesk algorithm was used
and expanded with the hypernym, hyponym,
meronym, troponym relations from WordNet.
The possible senses are scored with a new scor-
ing mechanism based on ZipF’s Law and the
sense with the highest score is chosen.
• the similarity of the sentences based on the
similarity of the pairs of words is computed.
In order to do this, a semantic similarity rela-
tive matrix is created, consisting of pairs of
word senses for semantic similarity between
the most appropriate sense of word. The Hun-
garian method is used to get the semantic simi-
larity between sentences. The match results of
this are included to compute single similarity
value for two sentences. The matching average
is used to compute semantic similarity between
two word-senses. This similarity is computed
by dividing the sum of similarity values of all
match candidates of both sentences by the to-
tal number of set tokens.
A.2 CycFoundation.org REST API
As mentionned in section 2.4.2, the system used
REST APIs in order to access the CycFounda-
tion.org Cyc KB. REST stands for Representa-
June 24, 2009 12
A.3 Cyc Concepts Matching A APPENDIX: QUERY EXPANSION USING CYC
tional state transfer, a way to build a service ori-
ented architecture, based on HTTP and XML, us-
ing generally GET or POST methods of HTTP pro-
tocol.
CycFoundation Web Services exposes only a sub-
set of Cyc’s capabilities, therefore the implemented
API in our system is rather small. It contains fol-
lowing queries:
• GetConstants(Keyword) - performs a search
for Cyc concepts for a keyword
• GetComment(Concept) - returns a comment
for a particular concept
• GeCanonicalPrettyString(Concept) - returns a
simplified name for a concept
• GetDenotation(Concept) - returns a denota-
tion for a particular concept
• GetGenerals(Concept) - returns a set of gen-
eral concepts for a particular concept
• GetSpecifics(Concept) - returns a set of spe-
cific concepts for a particular concept
• GetInstances(Concept) - returns a set of in-
stances (concepts) for a particular concept
• GetIsA(Concept) - returns a set of Is A con-
cepts for a particular concept
• GetAliases(Concept) - returns a set of aliases
(words or phrases) for a particular concept
• GetSiblings(Concept) - returns a set of sibling
concepts for a particular concept
A.3 Cyc Concepts Matching
The actual general algorithm descibed in section
2.4.2 was actually derived from two algorithms:
display name matching and comment matching.
Those algorithms have been tested separately.
A.3.1 DisplayName Matching Algorithm
The display name used in cyc, is the shortest con-
cept description. For example the displayname for
”HumanAdult” concept is simply ”a human adult”.
Therefore the matching algorithm computes the
distance between the keyword contentex and this
description, as following:
Input: Keyword, its context (sentence)
Output: Cyc concept matched
(BestConstant)
Constants ⇐ GetConstants(Keyword)
BestConstant ⇐ new CycConstant()
BestDistance ⇐ ∞
foreach Constant in Constants do
dK ⇐ GetDistance(Keyword,
Constant.DisplayName)
dC ⇐ GetDistance(Context,
Constant.DisplayName)
Distance ⇐ (dK + dC) / 2
if Distance < BestDistance then
BestDistance ⇐ Distance
BestConstant ⇐ Constant
end
end
This algorithm has proven to be quite efficient since
less computation needs to be done, but on the other
hand it tends to provide quite poor results and
should be enhanced for proper use.
A.3.2 Comment Matching Algorithm
The comment matching algorithm takes the addi-
tional knowledge about cyc concept, called com-
ment. It is long description which tends to be as
specific as possible. For example again, for ”Hu-
manAdult” concept, the comment is:
”A specialization of Person, and an in-
stance of HumanTypeByLifeStageType.
Each instance of this collection is a per-
son old enough to participate as an inde-
pendent, mature member of society. In
most modern Western contexts it is as-
sumed that anyone over 18 is an adult.
However, in many cultures, adulthood oc-
curs when one reaches puberty. Adult-
hood is contiguousAfter (q.v.) childhood.
Notable specializations of this collection
include AdultMaleHuman, AdultFemale-
Human, MiddleAgedHuman and OldHu-
man.”
The actual pseudo code implementation of such
algorithm, where GetComment method is actual
REST API call:
June 24, 2009 13
A.4 Query Expansion Patterns A APPENDIX: QUERY EXPANSION USING CYC
Input: Keyword, its context (sentence)
Output: Cyc concept matched
(BestConstant)
Constants ⇐ GetConstants(Keyword)
BestConstant ⇐ new CycConstant()
BestDistance ⇐ ∞
foreach Constant in Constants do
dCK ⇐ 0
Comment ⇐ GetComment(Constant)
Keywords = GetKeywords(Comment)
foreach CK in Keywords do
dCK ⇐ dCK + GetDistance(Keyword,
CK)
end
dK ⇐ GetDistance(Keyword,
Constant.DisplayName)
dC ⇐ GetDistance(Context,
Constant.DisplayName)
Distance ⇐ (dK + dC + (dCK /
Keywords.Count)) / 3
if Distance < BestDistance then
BestDistance ⇐ Distance
BestConstant ⇐ Constant
end
end
A.3.3 Experiments with both algorithms
After running several experiments (Fig. 13) com-
paring the Comment Matching and Display Name
algorithms, we found that Comment Matching is
the one gives significantly better results, but still
can be enhanced to get even better matching.
This may be done by using general/specific con-
cepts in the distance calculation formula or learning
weights.
A.4 Query Expansion Patterns
In our research, several different query expan-
sio algorithms have been implemented, all have a
quite straightforward implementation as explained
in section 2.4.3. The algorithms, which can be de-
scribed using the following formulas, where K is the
set of keywords, k ∈ K, have been implemented and
evaluated :
Or-Concept Expansion:
Q(K, n) =
n
i=1
(max(ki, c(ki))) (2)
where c(k) is a concept linked with the keyword
Or-Aliases Expansion:
Q(K, n) =
n
i=1
(max(ki, (
m
j=1
aj(ki)))) (3)
where a(k) is a alias linked with the keyword’s
concept
Or-Most-Relevant-Alias Expansion:
Q(K, n) =
n
i=1
(max(ki, a(ki))) (4)
where a(k) is a most relevant alias linked with the
keyword’s concept
Or-Most-Relevant-General Expansion:
Q(K, n) =
n
i=1
(max(ki, g(ki))) (5)
where g(k) is a most relevant general concept
linked with the keyword’s concept
Or-Most-Relevant-Specific Expansion:
Q(K, n) =
n
i=1
(max(ki, s(ki))) (6)
where s(k) is a most relevant specific concept
linked with the keyword’s concept
Or-Is-A-Concept Expansion:
Q(K, n) =
n
i=1
(max(ki, isa(ki))) (7)
June 24, 2009 14
A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC
where isa(k) is a most relevant ’Is A ...’ concept
linked with the keyword’s concept
Or-Most-Relevant-AGS Expansion:
Q(K, n) =
n
i=1
(max(ki, a(ki), g(ki), s(ki))) (8)
where:
a(k) is a most relevant alias linked with the
keyword’s concept,
g(k) is a most relevant general concept linked with
the keyword’s concept,
s(k) is a most relevant specific concept linked with
the keyword’s concept
A.5 Results and Discussion
A.5.1 Google Search on Wikipedia KB
Our system have been tested and able to generate
queries for Google search engine, however it is very
difficult to evaluate such results, therefore this
secrion of the paper is only to give an example of
results.
Using as input an introductory text of Standford
Encyclopedia for the topic of ’evolution’, the
system have derived several keywords:
• ”species”, matched to ”speciesImmunity”
• ”theory”, matched to ”theoryOfBeliefSystem”
• ”evolution”, matched to ”Evolution”
• ”change”, matched to ”changesSlot”
• ”term”, matched to ”termExternalIDString”
Next, several queries have been constructed and
restricted to Wikipedia KB:
• Google Non-Expanded Query: ”species theory
evolution change term site:en.wikipedia.org”
• Google Or-Concepts Query: ”( species OR
species immunity ) + ( theory OR Theory Of
Belief System ) + ( evolution OR biological
evolution ) + ( change OR Changes Slot )
+ ( term OR Term External ID String ) +
site:en.wikipedia.org”
Figure 12: Precision at N, results of Or-Concept
expanded query with and without stemming com-
pared
• Google Or-Aliases Query: ”species + theory +
( evolution OR (biologically will have evolved
OR biologically had evolved OR biologically
will evolve OR biologically has evolved OR bi-
ologically have evolved OR biologically evolv-
ing OR biologically evolves OR biologically
evolved OR biologically evolve OR most evo-
lutionary OR more evolutionary OR evolu-
tionary OR evolution) ) + change + term +
site:en.wikipedia.org”
• Google Or-Most-Relevant-General Query:
”species + theory + ( evolution OR
”development” ) + change + term +
site:en.wikipedia.org”
• ...
After analyzing the results, we compared several
of the results returned by Google. For the normal
non expanded query the results are quite satisfy-
ing:
* Evolution
* Punctuated equilibrium
* Evolution as theory and fact
* Macroevolution
* History of evolutionary thought
* On the Origin of Species
June 24, 2009 15
A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC
Figure 13: Precision at N, results of Or-Concept
expanded query using different matching
* Species
* Hopeful Monster
However, the source article mentionned Charles
Darwin and his name was not derived in the
keywords, so no direct Wikipedia link to his theory
of natural selection was found in the non-expanded
results. After analyzing deeper, it was actually
found in the extended query (Or-Most-Relevant-
General):
* Evolution
* Punctuated equilibrium
* Macroevolution
* Evolution as theory and fact
* Charles Darwin
* On the Origin of Species
* Hopeful Monster
* Natural selection
A.5.2 Lemur Search in AP Corpus
In order to perform an evaluation of the query ex-
pansion methods, a structured query generation for
Lemur search engine have been implemented too.
The queries have been automatically derived from
the narrative of topics 101 to 115 of AP Corpus
Figure 14: Precision at N, expansions of a 5 word
query compared (part 1)
and a batch evaluation of those queries has been
performed.
Several different sets of results have been produced
by IREval function of Lemur, ones of them are:
• 5-10 word query with no keyword stemming;
• 5-10 word query with keyword stemming;
• 5-10 word query using name matching.
Most of the experiments have been performed
with Comment Matching Algorithm with whole
set of query expansion patterns. Some experiments
have been performed in order to compare the Dis-
play Name and Comment Matching Algorithms.
The keyword stemming part was especially tested
in order to suppress plurals, since in derived
keywords from AP Corpus topics usually left it. So
for example: ”changes” was stemmed to ”change”
and therefore give a larger set of possible concepts
to match. During our experiments (Figure 12) we
have found that using the keyword stemmin in
those algorithms decreases significantly the preci-
sion at larger queries. One possible explanation
to that is that the actual matching algorithms are
making more errors, since the possible concepts
set is larger.
In the figures 14 and 15 one can see that only
the Or-Concept algorithm gives better precision
June 24, 2009 16
A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC
Figure 15: Precision at N, expansions of a 5 word
query compared (part 2)
Figure 16: Precision at N, expansions of a 10 word
query compared (part 1)
at 5 and more. The Or-Most-Relevant-Alias
and Or-Most-Relevant-General give a precision
improvement only at N > 20.
Finally, it’s in the figures 16 and 17 that
one can really see the improvement made by
stemming and especially Or-Concept query ex-
Figure 17: Precision at N, expansions of a 10 word
query compared (part 2)
pansion. While 5 words query expansion (at 5N)
in this particular algorithm gives an improvement
of almost 50 percent, with 10 words query the
improvement is almost of the order of 200 percent
(significant increase of precision).
The results speak for themselves; we think that
the Cyc-enabled query expansion is definitely way
to go, but better patterns still needed to be built.
In our experiments from 7 different patterns only
one of them is actually making good precision in-
crease: the Or-Concept pattern.
June 24, 2009 17

Contenu connexe

Similaire à Research: Developing an Interactive Web Information Retrieval and Visualization System

Digital Content Retrieval Final Report
Digital Content Retrieval Final ReportDigital Content Retrieval Final Report
Digital Content Retrieval Final ReportKourosh Sajjadi
 
A.R.C. Usability Evaluation
A.R.C. Usability EvaluationA.R.C. Usability Evaluation
A.R.C. Usability EvaluationJPC Hanson
 
Computing Science Dissertation
Computing Science DissertationComputing Science Dissertation
Computing Science Dissertationrmc1987
 
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Nóra Szepes
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerAdel Belasker
 
Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Pieter Van Zyl
 
01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPadTraitet Thepbandansuk
 
435752048-web-development-report.pdf
435752048-web-development-report.pdf435752048-web-development-report.pdf
435752048-web-development-report.pdfUtkarshSingh697319
 
Requirements engineering by elizabeth hull, ken jackson, jeremy dick (z lib.org)
Requirements engineering by elizabeth hull, ken jackson, jeremy dick (z lib.org)Requirements engineering by elizabeth hull, ken jackson, jeremy dick (z lib.org)
Requirements engineering by elizabeth hull, ken jackson, jeremy dick (z lib.org)DagimbBekele
 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportTrushita Redij
 
Mobile Friendly Web Services - Thesis
Mobile Friendly Web Services - ThesisMobile Friendly Web Services - Thesis
Mobile Friendly Web Services - ThesisNiko Kumpu
 
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...LinkedTV
 

Similaire à Research: Developing an Interactive Web Information Retrieval and Visualization System (20)

Thesis
ThesisThesis
Thesis
 
Digital Content Retrieval Final Report
Digital Content Retrieval Final ReportDigital Content Retrieval Final Report
Digital Content Retrieval Final Report
 
A.R.C. Usability Evaluation
A.R.C. Usability EvaluationA.R.C. Usability Evaluation
A.R.C. Usability Evaluation
 
Computing Science Dissertation
Computing Science DissertationComputing Science Dissertation
Computing Science Dissertation
 
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
 
Z suzanne van_den_bosch
Z suzanne van_den_boschZ suzanne van_den_bosch
Z suzanne van_den_bosch
 
Zap Scanning
Zap ScanningZap Scanning
Zap Scanning
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel Belasker
 
Final_Thesis
Final_ThesisFinal_Thesis
Final_Thesis
 
Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010
 
thesis_online
thesis_onlinethesis_online
thesis_online
 
CS4099Report
CS4099ReportCS4099Report
CS4099Report
 
CloWSer
CloWSerCloWSer
CloWSer
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad
 
435752048-web-development-report.pdf
435752048-web-development-report.pdf435752048-web-development-report.pdf
435752048-web-development-report.pdf
 
Requirements engineering by elizabeth hull, ken jackson, jeremy dick (z lib.org)
Requirements engineering by elizabeth hull, ken jackson, jeremy dick (z lib.org)Requirements engineering by elizabeth hull, ken jackson, jeremy dick (z lib.org)
Requirements engineering by elizabeth hull, ken jackson, jeremy dick (z lib.org)
 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_report
 
Mobile Friendly Web Services - Thesis
Mobile Friendly Web Services - ThesisMobile Friendly Web Services - Thesis
Mobile Friendly Web Services - Thesis
 
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...
LinkedTV Deliverable 4.7 - Contextualisation and personalisation evaluation a...
 

Plus de Roman Atachiants

Geant4 Model Testing Framework: From PAW to ROOT
Geant4 Model Testing Framework:  From PAW to ROOTGeant4 Model Testing Framework:  From PAW to ROOT
Geant4 Model Testing Framework: From PAW to ROOTRoman Atachiants
 
Report: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing ToolsReport: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing ToolsRoman Atachiants
 
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...Roman Atachiants
 
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de vol
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de volB.Sc Thesis: Moteur 3D en XNA pour un simulateur de vol
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de volRoman Atachiants
 
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Roman Atachiants
 

Plus de Roman Atachiants (6)

Geant4 Model Testing Framework: From PAW to ROOT
Geant4 Model Testing Framework:  From PAW to ROOTGeant4 Model Testing Framework:  From PAW to ROOT
Geant4 Model Testing Framework: From PAW to ROOT
 
Report: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing ToolsReport: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing Tools
 
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
Research: Applying Various DSP-Related Techniques for Robust Recognition of A...
 
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de vol
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de volB.Sc Thesis: Moteur 3D en XNA pour un simulateur de vol
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de vol
 
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
 
Ahieving Performance C#
Ahieving Performance C#Ahieving Performance C#
Ahieving Performance C#
 

Dernier

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Dernier (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Research: Developing an Interactive Web Information Retrieval and Visualization System

  • 1. Developing an Interactive Web Information Retrieval and Visualization System Research Project Master Artificial Intelligence Department of Knowledge Engineering R. Atachiants J. Meyer J. Steinhauer T. Stocksmeier June 24, 2009
  • 2. Abstract Finding the needed information (images, articles or other) is not always as simple as going to a search en- gine. This paper aim at developing an interactive presentation system, able to cope with live presentation challenges: speech recognition, information retrieval, filtering unstructured data and its visualization. Speech recognition is achieved using Microsoft SAPI; information is retrieved using various Text Min- ing techniques in order to derive most relevant keywords and next, the search using expanded Google queries is performed. The information then is filtered: HTML tags are cleared and text summarization is performed in order to extract the most relevant information. Such system has been tested and per- forms satisfactory, and such interactive systems can be built with modern hardware and faster internet connections, but several challenges still need to be faced and some improvement is required in order to create smooth presentation experience. This paper also presents a novel approach of query expansion using Cyc concepts, based on CycFoundation.org knowledge base, it presents various query expansion patterns as: Or-Concept, Or-Most-Relevant-Alias and their respective results. It also presents 2 word matching algorithms which tries to match a word to the most relevant concept, using WordNet similarity measurement to accomplish the goal at semantic level. It shows that Or-Concepts pattern was improving the query precision with around a factor of 3.
  • 3. CONTENTS CONTENTS Contents 1 Introduction 1 2 Web Retrieval 1 2.1 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.3 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.4 Query Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.4.1 WordNet Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.4.2 Cyc Concepts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.4.3 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4.4 Query for Google Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.5 Results Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.1 Retrieving Unstructured Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.2 Summarizing the Web Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Tests and Results 8 3.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Query Construction and Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.4 Results Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Discussion 10 5 Conclusions 11 A Appendix: Query Expansion using Cyc 12 A.1 WordNet Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.2 CycFoundation.org REST API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.3 Cyc Concepts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3.1 DisplayName Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3.2 Comment Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3.3 Experiments with both algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.4 Query Expansion Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.5.1 Google Search on Wikipedia KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.5.2 Lemur Search in AP Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 June 24, 2009 2
  • 4. 2 WEB RETRIEVAL 1 Introduction Why is there a need for an Interactive Web Infor- mation Retrieval and Visualization System? A person who is teaching at school, at university or somewhere else is often required to present vari- ous types of information (his own knowledge, texts written by others, pictures, diagrams etc.) more or less at once. In order to do this, he will collect his data beforehand, arrange it and set up a presenta- tion, using an ordinary presentation tool, such as Microsoft PowerPoint. During his lecture, he will start the presentation, showing one slide after an- other and explaining what can be seen on it. This is, compared to what lessons were like some decades ago, of course a great increase in possibil- ities of how information can be presented and how the listeners’ attention can be kept. But still, there are a lot of problems with this approach. First, all the work has to be done beforehand, resulting in a very long time one has to spend on the presen- tation before he can show something useful. An- other problem is that the prepared presentation is very static. Presentations about topics in which changes can occur during some time will have to be updated again and again with new information and media from the web. Further, during the pre- sentation one will learn that the audience’s knowl- edge about some parts of the covered subjects is different from what one had thought before. If the listeners know more than expected, that won’t be a real problem - some slides can be skipped. But what if they know less than expected? It is impos- sible to make new slides during the presentation, so the either problem would have to be ignored or one had to search for the missing information during the presentation. Another hardship with the usual presentations is the research itself. Finding information is nowa- days very easy. One just goes to Google or another search engine, enters what he is looking for and looks at the first results where he finds what he is looking for. But it is not always so easy, because one doesn’t always know what he is searching for and where he can find it. In addition, you often recieve a lot of results and the precision is too low to find the relevant information very fast. In our project, we want to meet all these prob- lems and difficulties. Our aim is to develop a sys- tem that both makes the research easier and more efficient and also increases the quality and change- ability of the presentation. To enhance the research itself, the program should increase the precision of the search by searching for concepts instead of keywords. If done properly, the results of the search are all relevant for the topic and can be used in some way. This can reduce the time spent on evaluating each of the search results. In addition, the texts on the re- sulted pages can be summarized, further reducing the time needed to evaluate the results. Another possibility is to do the usual web search and the image search at once, because e.g. tables can be found in both formats. To provide the possibility of using the system also during the presentation itself, the search query should be constructed, not solely from some key- board input, but as well directly from spoken text, in a way that provides a possibility to choose the currenty wanted way of communication. The re- sults have to be displayed in a way that the impor- tant parts are visible directly, while other things are suppressed. The user of the system has to be able to decide which contents (texts, images etc.) are shown and which not. To meet all these requirements, the system has to consist of the following steps: 1. Recognize the spoken text 2. Retrieve keywords from the text 3. Create a powerful search query 4. Filter and summerize the results 5. Visualize the results in a proper way These steps will be further described in the next chapters. 2 Web Retrieval 2.1 Architectural Overview The spoken words of the presentation are recog- nized and translated by a speech recognition en- gine. The output of the speech recognition is fed to a keyword extractor to create the keywords of the text. These are used by a query construction to produce an internet search query. The picture results of the search are directly transferred to the June 24, 2009 1
  • 5. 2.3 Keyword Extraction 2 WEB RETRIEVAL visualisation module. The text results are summa- rized by a summarizer module and then also dis- played by the visualisation module. The workflow can be shown in figure 1. Figure 1: workflow 2.2 Speech Recognition The user of the information retrieval system has several possibilities of interaction with the system, but the most important way of communication is the speech. It is used to get a starting point as well as to define the wanted content more precisely. Thus, our application needs a well-working speech recognition engine, to ”understand” as much of the spoken text as possible. It would be interesting to write an own speech recognition engine, but the effort for this is too high to be done for our project. So the main task here is to choose a feasible speech recognition engine and to use it in a proper way. There are several speech recognition engine avail- able with different strengths and weaknesses. For example, the program ”Dragon Naturally Speak- ing” is an established commercial software product that works quite well. But due to the fact that it is commercial, it is very expensive and thus not the right solution for a research project. Beside the one mentioned there are several other commercial products that include a speech recognition engine. Another group of engines are open-source freeware- tools, found spread one the web. Unfortunatly, the tools tested turned out not to have the efficiency and the quality that we would need for the projet. So we decide to use one product of the third group - built-in search engines from operating systems. These engine are likely to have a high quality and a good performance, while they don’t have to be bought for a lot of money, because they are included in the operating system that everyone has to buy anyway. After some tests and some problems with the usage of some of the mentions software products, we choosed the Windows Speech Recognition API (SAPI 5.1) [9], [10] as a part of our project. The Windows speech module offers different kinds of functionality, including a dictation mode (beside others such as voice commands, text to speech and others). In this dictation mode, the spoken words are simply put into a plain text, which can by used by other programs via the programming interface [8]. 2.3 Keyword Extraction For the keyword generation from an input text the Natural Language Processing Open Source tool: SharpNLP is used [3]. SharpNLP is a collection of natural language pro- cessing tools written in C#. This tool provides the following NLP tools [4]: • a sentence splitter (to identify sentence bound- aries) • a tokenizer (to find tokens or word segments) • a part-of-speech tagger • a chunker (to find non-recursive syntactic an- notation such as noun phrase chunks a parser) • a name finder (Name finder to finding proper names and numeric amounts) June 24, 2009 2
  • 6. 2.4 Query Construction 2 WEB RETRIEVAL • a coreference tool (to perform coreference res- olution) • an interface to the WordNet lexical database [14] This toolkit turned out to be tricky to setup, but subsequently offers good performance and useful features. To generate keywords the following fea- tures are used: • Sentence Splitting: Splitting the text based on sentence signs and additional rules (will not split ”and Mr. Smith said” into two sentences), uses a model trained on English data • Tokenizer: Resolves the sentences into independent to- kens, based on Maximum Entropy Modeling [5] • POS-Tagging (Part Of Speech-Tagging): As- sociating tokens with corresponding tags, de- nominating what grammatical part of a sen- tence it constitutes, based on a model trained on English data from the Wall Street Journal and the Brown corpus [7] • POS-Filtering: Filtering relevant word categories (nouns and foreign words) based on the POS-Tags (NN, NNS, NNP, NNPS, FW) • Stemming [13]: Reducing the different morphological versions of a word to their common word stem, based on the Porter-Stemmer-Algorithm [2] The relevance rating of the keywords is calculated based on word count, actuality and word length. 2.4 Query Construction In order to construct the query, which the system uses, several steps are performed: 1. each keyword is mapped to its context sen- tence; 2. using keyword and the context, the keywords are matched wit a particular Cyc concept; 3. query is expanded using keywords and con- cepts; 4. query is searched in Google search engine. More detailed description of Concepts Matching and Expansion can be found in Appendix A. 2.4.1 WordNet Similarity Measurement In order to be able to compare two words or sen- tences together, a semantic similarity measurement was needed. WordNet similarity measurement was used [11]. 2.4.2 Cyc Concepts Matching Cyc[12] is a very large Knowledge Base (KB) which tries to gather a formalized representation of a vast quantity of fondamental human knowledge: facts, rules, etc. The Cyc KB contains thousands of dif- ferent assertions and can be used in many different ways. For the system described in this paper, a particular subset of Cyc has been used: concepts and different relations. In Cyc, every concept linked to additional knowl- edge, as such: • a display name (readable name of the concept) • a comment (short description) • general concepts, for example: HumanAdult is linked to AdultAniman and HomoSapiens • specific concepts, for example: HumanAdult is linked to Professional-Adult, SoftwareEngin- ner... • aliases, for example: for HumanAdult there are: ”adult homo sapiens”, ”grownup” ... Unfortunately the whole Cyc KB was unavail- able to public when our research has been done, therefore CycFoundation.org [6] REST APIs have been used in order to interact with Cyc. More about the actual REST API implementation, as well as the explanation of different algorithms and specific results can be found in Appendix A. Since Cyc contains semantic knowledge about the concepts and not words, the word to concept matching algorithm was created. The algorithm is built in such way so it’s operates only at June 24, 2009 3
  • 7. 2.4 Query Construction 2 WEB RETRIEVAL semantic level, therefore uses WordNet similarity measurement[11] (also described in section 2.4.1) in order to compute similarity score between two sentences. The algorithm takes a keyword and its corresponding sentence, then using CycFoun- dation APIs the set of relevant Cyc Concepts (aka Constants) are retrieved. The actual process of retrieval retrieves only concepts containing in it’s name the keyword, for example for keyword ”human”, a set of concepts would contain ”Hu- manAdult”, ”HumanActivity”, ”HumanBody”. Next, for each item in the set of concepts, the sim- ilarity score with the keyword’s context (sentence) is computed and the best one is used. Following straightforward pseudo-code imple- mentation illustrates the approach (more about matching type in A.3): Input: Keyword, its context (sentence) and a matching type Output: Cyc concept matched (BestConstant) Constants ⇐ GetConstants(Keyword) BestConstant ⇐ new CycConstant() BestDistance ⇐ ∞ foreach Constant in Constants do Distance ⇐ ∞ if MatchingType = DisplayNameMatching then dK ⇐ GetDistance(Keyword, Constant.DisplayName) dC ⇐ GetDistance(Context, Constant.DisplayName) Distance ⇐ (dK + dC) / 2 end if MatchingType = CommentMatching then dCK ⇐ 0 Comment ⇐ GetComment(Constant) Keywords = GetKeywords(Comment) foreach CK in Keywords do dCK ⇐ dCK + GetDistance(Keyword, CK) end dK ⇐ GetDistance(Keyword, Constant.DisplayName) dC ⇐ GetDistance(Context, Constant.DisplayName) Distance ⇐ (dK + dC + (dCK / Keywords.Count)) / 3 end if Distance < BestDistance then BestDistance ⇐ Distance BestConstant ⇐ Constant end end 2.4.3 Query Expansion After the keywords and their respective context are matched to a particular Cyc concept, the actual query expansion can be done. One can think about many different possible query expansions using the additional Cyc knowledge. In our research we have chosen several expansion methods, most of them are quite straightforward. After some experiments, one particular structured query expansion was cho- sen to be used in the system: Or-Concept. June 24, 2009 4
  • 8. 2.4 Query Construction 2 WEB RETRIEVAL The algorithm constructs a structured query, which can be described using the following formula, where K is the set of keywords, k ∈ K and c(k) is a func- tion which gets a matched concept for a keyword (using a matching algorithm) : Q(K, n) = n i=1 (max(ki, c(ki))) (1) The following pseudo-code illustrates the approach used in the system in order to construct the structured query: Input: Keywords taggeds with Cyc Concepts Output: Set of expanded keywords for Google foreach TaggedWord in Query do Pattern ⇐ ”( [0] OR [1] ) + ,” PatternEmpty ⇐ ”[0] + ” Keyword ⇐ TaggedWord.Word BestGeneral ⇐ ”” BestDistance ⇐ ∞ if TaggedWord.Concept ¬ null then GeneralConcepts ⇐ GetGenerals(TaggedWord.Concept) foreach General in GeneralConcepts do Distance ⇐ GetDistance(Word.Word, General.DisplayName) if (Distance < BestDistance then BestGeneral ⇐ General.DisplayName BestDistance ⇐ Distance end end end if BestGeneral.Length = 0 then Pattern = PatternEmpty end Expanded.Add(String.Format(Pattern, Keyword, BestGeneral)) end 2.4.4 Query for Google Search In order to recieve the real data, we need to choose a search engine as well as a corpus. For several reasons, we quickly arrived at the decision that we want to use to Google web search for our appli- cation. First of all, Google is the most popular search engine that one can find in the web. This is a great advantage, because an application program- ming interface (API) is required in order to make interaction between our program and the search en- gine possible. The bigger and the more popular a search engine is, the more likely it is that a good API can be found for it. This turned out to be true partially. Google provides a lot of of program- ming interfaces, where most of them are specialized for a specific context. So it isn’t easy to find the best fitting programming interface for an applica- tion, in addition to the ”usual” problems one may encouter during the implementation. Another rea- son to choose the Google web search is the fact that it can easily be restricted to one or a set of several web pages. This can be done by simply adding the tag ”site:”, followed by the desired web page, to the keywords one wants to search for. As a result, the search only finds results that are somewhere on the indicated pages. As we don’t want to find arbi- trary results, but that from specific, serious pages, we really want to use this feature of the Google web search. If one does a web search (either with the Google engine or somewhere else), a lot of re- sults are found at very different pages and types of pages. This is of course important, since that is what most people want when they search the web. But with the information retrieval system, the aim is different. We don’t want to display content from any websites that contain the keywords, but only from websites with trustable content. So a white list with the sites that will be searched is the best way of avoiding useless results. Due to the fact that the results of the web search contain lots of HTML- tags and similar things that are (for our purpose) useless, some knowledge about the structure of the found results. By limiting the search to defined web pages, we can use our information about these pages to parse the results, depending on the web site they were found on. In a first approach, we lim- ited the search only to articles on en.wikipedia.org. Wikipedia is a very well-known and well-accepted page that contains useful information about a lot of domains. Retrieving data from only one page may seem to be too less, but it is enough in order to set up a working and useful system. In addition, the system can easily be enhanced by adding other sites. June 24, 2009 5
  • 9. 2.5 Results Filtering 2 WEB RETRIEVAL 2.5 Results Filtering 2.5.1 Retrieving Unstructured Web Pages The search on the web engine gives a list of URLs as its results. Each of the pages can be downloaded as a text, but now is a HTML-formatted web page. There is some work left to do to recieve only the real content. First, we want to get rid of the HTML tags and any other formatting data. This task has to be performed by applications of very different types. For example, every web browser has to use functionality that distinguishes between content that will be displayed and content that will not. It is usually the best to use existing, tested and working packages, which shall be found at web browsers. The Microsoft Internet Explorer performs this task using a dynamic link library (DLL) called ”mshtml”. This library can easily be used to parse a complex html-file into plain text that contains only that parts that would be displayed when one is visiting this web page with the Internet Explorer (or another browser). This does the main part of this step. Many web pages don’t only contain the ”real” information, but also a lot of other things, such as navigation links, menus, advertisements, and so on. These are displayed by the browser and thus seen by the user. If the user of the information retrieval system chooses to display on of these sites, he maybe wants to see these content. But due to the fact that the texts from the results are summerized in the next step, we have to remove them from the text, because they don’t contain any information that is relevant for the information contained in the page itself. The format of these menus etc. depends on the web page where the result was found. The results on Wikipedia are formatted in a different way from the results in other lexigraphical web pages. Thus, it is useful to handle the results depending on the web page they were found on. As the web search in this approach has been limited to only one page, namely en.wikipedia.org, we have to consider only one page in order to remove ”waste” like links and menus from it. An analysis of the structure unfolded an easy way of parsing the wikipedia articles. All articles on wikipedia start with article itself - beside some HTML-tags and a lot of scripts. The menus and navigation links follow after it. That means, that we can simply work on the first part of the article and skip everything that follows afterwards. The edge between article and links is clearly defined by the HTML-tag ’¡div class=”printfooter”¿’, which is contained in every article on Wikipedia. So we can simply search the results from wikipedia for the tag mentioned above, and remove everything that follows after this tag. It should be mentioned that this parsing is aimed at the result of the parsing done by the Microsoft dynamic link library, but has to be done before the other parsing step, because otherwise one wasn’t able to find the HTML-tag used as the edge. 2.5.2 Summarizing the Web Content In the document exists important, but also a lot of irrelevant information. Additionally the screen space to present the information is intentionally limited to allow a quick evaluation by the user. Because of this the documents are summarized in order to present only the important information to the user. For this we use the Open Text Summa- rizer [1]. This tool doesn’t abstract the document in a natural way because it does not rephrase the text. It is just producing a condensed version of the original by keeping only the important sentences. However it shows good results for non-fictional text and can be used with unformatted and html formatted text. It has received favourable mention from several academic publications and it is at least as good as commercial tools such as Copernic and Subject Search Summarizer. The Open Text Summarizer parses the text and utilizes Porter Stemming. For this an xml file with the parsing and stemming rules is used. It calculates the term frequency for each word and stores this in a list. After this stop word filter is performed. It removes all redundant common words in the list by using a stop word dictionary, which is also stored in the xml file. After sorting the list for frequency the keywords are determined, which are the most frequently occurring words. The original sentences are scored based on these keywords. A sentence that holds many important words, the keywords, is given a high grade. The June 24, 2009 6
  • 10. 2.6 Visualization 2 WEB RETRIEVAL result is a text with only the highest scored sentences. For this the limiting factor can be set to either a percentage of the original text or sentences number. 2.6 Visualization Visualization is probably the one of most impor- tant parts of the system. It consist of two main parts: 1. The ”Presenter-View”, the view that the pre- senter sees on his computer during the pre- sentation. The presenter should be able to see the dataflow, manipulate and give simple com- mands as: ”show this image” or ”show that article”. The presenter also needs a preview of what’s shown to public. 2. The ”Public-View” the simplified view that shows only the needed information and mir- rors the lecturer interactions with the presen- ter’s view pre-rendering zone. Figure 2: Presenter’s view layout structure For the Graphical User Interface (GUI), several crucial specifications were defined, most impor- tantly: • rendering and screen mirroring should be done in real-time; • layout of the GUI should be as simple as pos- sible, presenting the retrived data in the sim- plest and fastest way for the presenter; • it shoud also be interactive, the images and articles should be manipulated at real-time. Figure 3: Real-time parallelization workflow In order to achieve smooth experience, several things must be done at the same time. As shows fig- ure 3, two main components: rendering and infor- mation extraction were completely separated and running in several threads. Moreover, speech recog- nition part and google search were also separated and all search done asynchronously. Therefore, the system leverages actual multi-core architectures in order to achieve significant speedup. In oder to achieve the simplest way to presenting the data, several prototypes of layout have been tested. Fi- nally, the one shown at the figure 2 and screenshots 4 6 have been proven to be a good way to present- ing the information. The system supports several screens rendering, and the actual mirroring and the development of the user interfaces have been achieved using a novel Microsoft Windows Presen- tation Foundation (WPF) technology. WPF technology is espectially designed for rich user interface develoment, and built on top of .Net Framework. It uses a markup language known as XAML to provide clear separation between the code and the design definition, which also greatly helped in parrallelization process. It also features 3D and Video/Audio rendering capabilities and an- imation storyboards (similar to flash). Therefore our system can be extended in order to perform June 24, 2009 7
  • 11. 3 TESTS AND RESULTS Figure 4: Screenshot of the GUI, the rendering view is splitted automatically between several im- ages and an article video search or do 3D enhanced presentations. The rendering engine takes advantage of modern hard- ware (dedicated GPUs ...). One of the features of Figure 5: Screenshot of the GUI, the public view our system is the dynamic layout management of the render-view. As shown on the figures 4 6 the rendering zone automaticaly adjust itself depend- ing on the quantity of content presented. With such capabilites the system assures to use whole screen without leaving much empty space. Figure 6: Screenshot of the GUI, the rendering view shows only images since no articles have been se- lected by the lecturer 3 Tests and Results 3.1 Speech Recognition In the context of this project it is important that the speech recognition engine has a high quality of recognition and is working fast. For this reason these two aspects have been evaluated. After a two hour training phase the engine recognized about 70% of the spoken words cor- rectly. The recognition rate can be improved by further training. To achieve a good recognition rate it is necessary to speak loud and articulated and to pronounce the words always in the same way. Additionally punctuation marks are not recognized automatically but have to be spoken explicitly. This is however not a natural way to speak at a presentation. After speaking some sentences a break of a few seconds has to be made to initiate the recog- nition process for these sentences. The recognition process also needs a few seconds depending on the June 24, 2009 8
  • 12. 3.4 Results Filtering 3 TESTS AND RESULTS number of words. The tests showed a recognition rate of about four words per second. Again these frequent breaks of a few seconds are not a natural way to speak at a presentation. 3.2 Keyword Extraction The keywords extraction module has been tested with different features and numbers of words. The computing time and the quality of the extracted keywords have been evaluated. The results are shown in figure 7, 8. The tests have been made for the complete feature set (explained in chapter 2.3) and with disregarding the stemmer. Due to the limitations of the speech recognition engine (no punctuation marks if not pronounced explicitly) the keyword extraction has also been evaluated for texts without punctuation marks. It shows that the quality of the keyword extraction without stemmer is generally lower than with the complete feature set. Especially if the keywords will be used in the query for the internet search the use of the stemmer shows advantages. It avoids for example the search for singular and plural of the same keyword. Also the quality of the relevancy of the keywords is improved. It also shows that the gain in computation time by disregarding the stemmer is minimal. With 1200 words the gain is only 120 ms. For the expected relatively short sentences usually provided by the speech recognition engine the gain is even less. The tests also showed that for text without punctuation marks the quality is nearly the same as for the same text with punctuation marks. 3.3 Query Construction and Expan- sion As figure 9 shows, the Or-Concept algorithm of query expansion makes a significant improvement of the precision. In that figure, 10 word query is compared to an extended 10 word query, and at N 5 one can see an improvement of almost 3 times. Cyc- enhanced queries gives better results, but should be evaluated (more about different experiments and query-enhancements in Appendix A. Figure 7: time results for an extraction of ten key- words (with different numbers of words) Figure 8: quality of results for an extraction of ten keywords (with different numbers of words) 3.4 Results Filtering The summarizer module has been tested with and without stemmer for different summarizing June 24, 2009 9
  • 13. 4 DISCUSSION Figure 9: Precision at N, 10 word query with no keyword stemming percentages of the original unformatted text. The quality of the results represents their usability which has been evaluated from the meaningfulness of the summary and its length. A shorter length was considered a better usability because it allows a quicker evaluation by the user. Figure 10 shows that in these tests with stemmer the results with the best quality were achieved for a 15% summary. Without stemmer the quality was usually a little less. The calculation time with and without stemmer for different summarizer percentages is shown in figure 11. The difference between the different configurations is minimal. When comparing unformatted text and HTML- formatted text (with similar amount of content) the tests show that the html formatted text can take significantly longer to be processed. 4 Discussion As explained in the preceeding chapters, the sys- tem is works in its main parts. It is clearly possible to enrich web search as well as presentation of contents with different tools related to text mining, Figure 10: quality of results for different summary percentages for a text with 1200 content words Figure 11: Calculation times for different summary percentages for a text with 1200 content words information retrieval, and others. But also many problems turned out that had either to be solved or to be worked around. June 24, 2009 10
  • 14. 5 CONCLUSIONS The speech recognition module did work, but still had some problems, which lie in the implementa- tion of the used software. The speech recognition itself works fine, but the chosen engine has some problems e.g. in recognizing accents, breaks and similar. These often result in a incorrectly recognized word or sentence. So either another, better engine has to found or another solution has to be found in order to meet the requirements. The keyword extraction from the spoken text works well and also very efficient. But still, this step can be enhanced. For example, the user may want to influence the chosen keywords more directly than he does by speaking. So it would be a useful feature if the user could decide manually which keywords are good and important and which keywords are or not relevant. The keywords provided by the system could then be viewed as suggestions that the user can accept or decline. Another nice feature related to the first one mentioned is a self-learning algorithm that tries to predict the user’s decision. Another part that works fine but still can be improved is the summerization of the found texts. It produces good summaries of the found web pages without consuming too much computation time. But the result could maybe be enhanced by using features such as coreference resolution, leading to a better estimation of which referents are important and which are not. But, before putting a big effort in it, it should further be checked whether it is worth the effort. 5 Conclusions In this paper a design and implementation of an In- teractive Web Retrieval and Visualization System have been discussed. It have been proven robust and satisfying real-time constraints of live presen- tation. Tests and results have shown that both key- word extraction and query expansion return satis- fying results, keeping good precision. During the experiments it have been also shown that several parts of such system still need improvement: speech recognition needs to be enhanced and propery pre- trained, unstructured web data should be properly cleared from all noise and summarization can be enhanced, creating more consize articles. Overall, such systems can be built with modern hardware and parallelization, and hopefully, more of those will be seen as commercial products in the near future. June 24, 2009 11
  • 15. A APPENDIX: QUERY EXPANSION USING CYC References [1] Open text summarizer. http://libots.sourceforge.net. [2] Porter stemmer. http://tartarus.org/ mar- tin/PorterStemmer/. [3] Sharpnlp - open source natu- ral language processing tools. http://www.codeplex.com/sharpnlp. [4] The Text Mining Handbook: Advanced Ap- proaches in Analyzing Unstructured Data, chapter page 60. Cambridge University Press, 2006. [5] The Text Mining Handbook: Advanced Ap- proaches in Analyzing Unstructured Data, chapter page 318. Cambridge University Press, 2006. [6] Cyc Foundation. http://www.cycfoundation.org/. [7] W.N. Francis and H. Kucera. Brown corpus manual. [8] M. Harrington. Giving computers a voice. http://blogs.msdn.com/coding4fun/archive/ 2006/10/31/909044.aspx. [9] J. Moskowitz. Speech recog- nition with windows xp. http://www.microsoft.com/windowsxp/ using/setup/expert/ moskowitz 02september23.mspx. [10] J. Moskowitz. Windows speech recognition. http://www.microsoft.com/windows/windows- vista/ features/speech-recognition.aspx. [11] T. Ngoc Dao and T. Simpson. Measuring simi- larity between sentences. Codeproject Article. [12] Open Cyc Project. http://www.cyc.com/opencyc. [13] J. C. Scholtes. 5 text mining: Preprocessing techniques part 1. [14] Princeton University. Wordnet, a lex- ical database for the english language. http://wordnet.princeton.edu/. A Appendix: Query Expan- sion using Cyc A.1 WordNet Similarity Measure- ment In order to be able to compare two words or sen- tences together, a semantic similarity measurement was needed. For this the WordNet similarity mea- surement was used. Following are the steps that are performed to to the semantic similarity between two sentences [11]: • each sentence is partitioned into a list of tokens and the stop words are removed; • words are stemmed; • part of speech tagging is performed; • the most appropriate sense for every word in a sentence is found (Word Sense Disambigua- tion). To find out the most appropriate sense of a word, the original Lesk algorithm was used and expanded with the hypernym, hyponym, meronym, troponym relations from WordNet. The possible senses are scored with a new scor- ing mechanism based on ZipF’s Law and the sense with the highest score is chosen. • the similarity of the sentences based on the similarity of the pairs of words is computed. In order to do this, a semantic similarity rela- tive matrix is created, consisting of pairs of word senses for semantic similarity between the most appropriate sense of word. The Hun- garian method is used to get the semantic simi- larity between sentences. The match results of this are included to compute single similarity value for two sentences. The matching average is used to compute semantic similarity between two word-senses. This similarity is computed by dividing the sum of similarity values of all match candidates of both sentences by the to- tal number of set tokens. A.2 CycFoundation.org REST API As mentionned in section 2.4.2, the system used REST APIs in order to access the CycFounda- tion.org Cyc KB. REST stands for Representa- June 24, 2009 12
  • 16. A.3 Cyc Concepts Matching A APPENDIX: QUERY EXPANSION USING CYC tional state transfer, a way to build a service ori- ented architecture, based on HTTP and XML, us- ing generally GET or POST methods of HTTP pro- tocol. CycFoundation Web Services exposes only a sub- set of Cyc’s capabilities, therefore the implemented API in our system is rather small. It contains fol- lowing queries: • GetConstants(Keyword) - performs a search for Cyc concepts for a keyword • GetComment(Concept) - returns a comment for a particular concept • GeCanonicalPrettyString(Concept) - returns a simplified name for a concept • GetDenotation(Concept) - returns a denota- tion for a particular concept • GetGenerals(Concept) - returns a set of gen- eral concepts for a particular concept • GetSpecifics(Concept) - returns a set of spe- cific concepts for a particular concept • GetInstances(Concept) - returns a set of in- stances (concepts) for a particular concept • GetIsA(Concept) - returns a set of Is A con- cepts for a particular concept • GetAliases(Concept) - returns a set of aliases (words or phrases) for a particular concept • GetSiblings(Concept) - returns a set of sibling concepts for a particular concept A.3 Cyc Concepts Matching The actual general algorithm descibed in section 2.4.2 was actually derived from two algorithms: display name matching and comment matching. Those algorithms have been tested separately. A.3.1 DisplayName Matching Algorithm The display name used in cyc, is the shortest con- cept description. For example the displayname for ”HumanAdult” concept is simply ”a human adult”. Therefore the matching algorithm computes the distance between the keyword contentex and this description, as following: Input: Keyword, its context (sentence) Output: Cyc concept matched (BestConstant) Constants ⇐ GetConstants(Keyword) BestConstant ⇐ new CycConstant() BestDistance ⇐ ∞ foreach Constant in Constants do dK ⇐ GetDistance(Keyword, Constant.DisplayName) dC ⇐ GetDistance(Context, Constant.DisplayName) Distance ⇐ (dK + dC) / 2 if Distance < BestDistance then BestDistance ⇐ Distance BestConstant ⇐ Constant end end This algorithm has proven to be quite efficient since less computation needs to be done, but on the other hand it tends to provide quite poor results and should be enhanced for proper use. A.3.2 Comment Matching Algorithm The comment matching algorithm takes the addi- tional knowledge about cyc concept, called com- ment. It is long description which tends to be as specific as possible. For example again, for ”Hu- manAdult” concept, the comment is: ”A specialization of Person, and an in- stance of HumanTypeByLifeStageType. Each instance of this collection is a per- son old enough to participate as an inde- pendent, mature member of society. In most modern Western contexts it is as- sumed that anyone over 18 is an adult. However, in many cultures, adulthood oc- curs when one reaches puberty. Adult- hood is contiguousAfter (q.v.) childhood. Notable specializations of this collection include AdultMaleHuman, AdultFemale- Human, MiddleAgedHuman and OldHu- man.” The actual pseudo code implementation of such algorithm, where GetComment method is actual REST API call: June 24, 2009 13
  • 17. A.4 Query Expansion Patterns A APPENDIX: QUERY EXPANSION USING CYC Input: Keyword, its context (sentence) Output: Cyc concept matched (BestConstant) Constants ⇐ GetConstants(Keyword) BestConstant ⇐ new CycConstant() BestDistance ⇐ ∞ foreach Constant in Constants do dCK ⇐ 0 Comment ⇐ GetComment(Constant) Keywords = GetKeywords(Comment) foreach CK in Keywords do dCK ⇐ dCK + GetDistance(Keyword, CK) end dK ⇐ GetDistance(Keyword, Constant.DisplayName) dC ⇐ GetDistance(Context, Constant.DisplayName) Distance ⇐ (dK + dC + (dCK / Keywords.Count)) / 3 if Distance < BestDistance then BestDistance ⇐ Distance BestConstant ⇐ Constant end end A.3.3 Experiments with both algorithms After running several experiments (Fig. 13) com- paring the Comment Matching and Display Name algorithms, we found that Comment Matching is the one gives significantly better results, but still can be enhanced to get even better matching. This may be done by using general/specific con- cepts in the distance calculation formula or learning weights. A.4 Query Expansion Patterns In our research, several different query expan- sio algorithms have been implemented, all have a quite straightforward implementation as explained in section 2.4.3. The algorithms, which can be de- scribed using the following formulas, where K is the set of keywords, k ∈ K, have been implemented and evaluated : Or-Concept Expansion: Q(K, n) = n i=1 (max(ki, c(ki))) (2) where c(k) is a concept linked with the keyword Or-Aliases Expansion: Q(K, n) = n i=1 (max(ki, ( m j=1 aj(ki)))) (3) where a(k) is a alias linked with the keyword’s concept Or-Most-Relevant-Alias Expansion: Q(K, n) = n i=1 (max(ki, a(ki))) (4) where a(k) is a most relevant alias linked with the keyword’s concept Or-Most-Relevant-General Expansion: Q(K, n) = n i=1 (max(ki, g(ki))) (5) where g(k) is a most relevant general concept linked with the keyword’s concept Or-Most-Relevant-Specific Expansion: Q(K, n) = n i=1 (max(ki, s(ki))) (6) where s(k) is a most relevant specific concept linked with the keyword’s concept Or-Is-A-Concept Expansion: Q(K, n) = n i=1 (max(ki, isa(ki))) (7) June 24, 2009 14
  • 18. A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC where isa(k) is a most relevant ’Is A ...’ concept linked with the keyword’s concept Or-Most-Relevant-AGS Expansion: Q(K, n) = n i=1 (max(ki, a(ki), g(ki), s(ki))) (8) where: a(k) is a most relevant alias linked with the keyword’s concept, g(k) is a most relevant general concept linked with the keyword’s concept, s(k) is a most relevant specific concept linked with the keyword’s concept A.5 Results and Discussion A.5.1 Google Search on Wikipedia KB Our system have been tested and able to generate queries for Google search engine, however it is very difficult to evaluate such results, therefore this secrion of the paper is only to give an example of results. Using as input an introductory text of Standford Encyclopedia for the topic of ’evolution’, the system have derived several keywords: • ”species”, matched to ”speciesImmunity” • ”theory”, matched to ”theoryOfBeliefSystem” • ”evolution”, matched to ”Evolution” • ”change”, matched to ”changesSlot” • ”term”, matched to ”termExternalIDString” Next, several queries have been constructed and restricted to Wikipedia KB: • Google Non-Expanded Query: ”species theory evolution change term site:en.wikipedia.org” • Google Or-Concepts Query: ”( species OR species immunity ) + ( theory OR Theory Of Belief System ) + ( evolution OR biological evolution ) + ( change OR Changes Slot ) + ( term OR Term External ID String ) + site:en.wikipedia.org” Figure 12: Precision at N, results of Or-Concept expanded query with and without stemming com- pared • Google Or-Aliases Query: ”species + theory + ( evolution OR (biologically will have evolved OR biologically had evolved OR biologically will evolve OR biologically has evolved OR bi- ologically have evolved OR biologically evolv- ing OR biologically evolves OR biologically evolved OR biologically evolve OR most evo- lutionary OR more evolutionary OR evolu- tionary OR evolution) ) + change + term + site:en.wikipedia.org” • Google Or-Most-Relevant-General Query: ”species + theory + ( evolution OR ”development” ) + change + term + site:en.wikipedia.org” • ... After analyzing the results, we compared several of the results returned by Google. For the normal non expanded query the results are quite satisfy- ing: * Evolution * Punctuated equilibrium * Evolution as theory and fact * Macroevolution * History of evolutionary thought * On the Origin of Species June 24, 2009 15
  • 19. A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC Figure 13: Precision at N, results of Or-Concept expanded query using different matching * Species * Hopeful Monster However, the source article mentionned Charles Darwin and his name was not derived in the keywords, so no direct Wikipedia link to his theory of natural selection was found in the non-expanded results. After analyzing deeper, it was actually found in the extended query (Or-Most-Relevant- General): * Evolution * Punctuated equilibrium * Macroevolution * Evolution as theory and fact * Charles Darwin * On the Origin of Species * Hopeful Monster * Natural selection A.5.2 Lemur Search in AP Corpus In order to perform an evaluation of the query ex- pansion methods, a structured query generation for Lemur search engine have been implemented too. The queries have been automatically derived from the narrative of topics 101 to 115 of AP Corpus Figure 14: Precision at N, expansions of a 5 word query compared (part 1) and a batch evaluation of those queries has been performed. Several different sets of results have been produced by IREval function of Lemur, ones of them are: • 5-10 word query with no keyword stemming; • 5-10 word query with keyword stemming; • 5-10 word query using name matching. Most of the experiments have been performed with Comment Matching Algorithm with whole set of query expansion patterns. Some experiments have been performed in order to compare the Dis- play Name and Comment Matching Algorithms. The keyword stemming part was especially tested in order to suppress plurals, since in derived keywords from AP Corpus topics usually left it. So for example: ”changes” was stemmed to ”change” and therefore give a larger set of possible concepts to match. During our experiments (Figure 12) we have found that using the keyword stemmin in those algorithms decreases significantly the preci- sion at larger queries. One possible explanation to that is that the actual matching algorithms are making more errors, since the possible concepts set is larger. In the figures 14 and 15 one can see that only the Or-Concept algorithm gives better precision June 24, 2009 16
  • 20. A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC Figure 15: Precision at N, expansions of a 5 word query compared (part 2) Figure 16: Precision at N, expansions of a 10 word query compared (part 1) at 5 and more. The Or-Most-Relevant-Alias and Or-Most-Relevant-General give a precision improvement only at N > 20. Finally, it’s in the figures 16 and 17 that one can really see the improvement made by stemming and especially Or-Concept query ex- Figure 17: Precision at N, expansions of a 10 word query compared (part 2) pansion. While 5 words query expansion (at 5N) in this particular algorithm gives an improvement of almost 50 percent, with 10 words query the improvement is almost of the order of 200 percent (significant increase of precision). The results speak for themselves; we think that the Cyc-enabled query expansion is definitely way to go, but better patterns still needed to be built. In our experiments from 7 different patterns only one of them is actually making good precision in- crease: the Or-Concept pattern. June 24, 2009 17