Research: Developing an Interactive Web Information Retrieval and Visualization System

Developing an Interactive Web Information Retrieval and
Visualization System
Research Project
Master Artiﬁcial Intelligence
Department of Knowledge Engineering
R. Atachiants J. Meyer J. Steinhauer T. Stocksmeier
June 24, 2009

Abstract
Finding the needed information (images, articles or other) is not always as simple as going to a search en-
gine. This paper aim at developing an interactive presentation system, able to cope with live presentation
challenges: speech recognition, information retrieval, ﬁltering unstructured data and its visualization.
Speech recognition is achieved using Microsoft SAPI; information is retrieved using various Text Min-
ing techniques in order to derive most relevant keywords and next, the search using expanded Google
queries is performed. The information then is ﬁltered: HTML tags are cleared and text summarization
is performed in order to extract the most relevant information. Such system has been tested and per-
forms satisfactory, and such interactive systems can be built with modern hardware and faster internet
connections, but several challenges still need to be faced and some improvement is required in order to
create smooth presentation experience. This paper also presents a novel approach of query expansion
using Cyc concepts, based on CycFoundation.org knowledge base, it presents various query expansion
patterns as: Or-Concept, Or-Most-Relevant-Alias and their respective results. It also presents 2 word
matching algorithms which tries to match a word to the most relevant concept, using WordNet similarity
measurement to accomplish the goal at semantic level. It shows that Or-Concepts pattern was improving
the query precision with around a factor of 3.

CONTENTS CONTENTS
Contents
1 Introduction 1
2 Web Retrieval 1
2.1 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4 Query Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4.1 WordNet Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4.2 Cyc Concepts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4.3 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4.4 Query for Google Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Results Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.1 Retrieving Unstructured Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.2 Summarizing the Web Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Tests and Results 8
3.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Query Construction and Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Results Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Discussion 10
5 Conclusions 11
A Appendix: Query Expansion using Cyc 12
A.1 WordNet Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A.2 CycFoundation.org REST API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A.3 Cyc Concepts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.3.1 DisplayName Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.3.2 Comment Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.3.3 Experiments with both algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
A.4 Query Expansion Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
A.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.5.1 Google Search on Wikipedia KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.5.2 Lemur Search in AP Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
June 24, 2009 2

2 WEB RETRIEVAL
1 Introduction
Why is there a need for an Interactive Web Infor-
mation Retrieval and Visualization System?
A person who is teaching at school, at university
or somewhere else is often required to present vari-
ous types of information (his own knowledge, texts
written by others, pictures, diagrams etc.) more or
less at once. In order to do this, he will collect his
data beforehand, arrange it and set up a presenta-
tion, using an ordinary presentation tool, such as
Microsoft PowerPoint. During his lecture, he will
start the presentation, showing one slide after an-
other and explaining what can be seen on it.
This is, compared to what lessons were like some
decades ago, of course a great increase in possibil-
ities of how information can be presented and how
the listeners’ attention can be kept. But still, there
are a lot of problems with this approach. First, all
the work has to be done beforehand, resulting in
a very long time one has to spend on the presen-
tation before he can show something useful. An-
other problem is that the prepared presentation is
very static. Presentations about topics in which
changes can occur during some time will have to
be updated again and again with new information
and media from the web. Further, during the pre-
sentation one will learn that the audience’s knowl-
edge about some parts of the covered subjects is
different from what one had thought before. If the
listeners know more than expected, that won’t be
a real problem - some slides can be skipped. But
what if they know less than expected? It is impos-
sible to make new slides during the presentation, so
the either problem would have to be ignored or one
had to search for the missing information during
the presentation.
Another hardship with the usual presentations is
the research itself. Finding information is nowa-
days very easy. One just goes to Google or another
search engine, enters what he is looking for and
looks at the first results where he finds what he is
looking for. But it is not always so easy, because
one doesn’t always know what he is searching for
and where he can find it. In addition, you often
recieve a lot of results and the precision is too low
to find the relevant information very fast.
In our project, we want to meet all these prob-
lems and difficulties. Our aim is to develop a sys-
tem that both makes the research easier and more
efficient and also increases the quality and change-
ability of the presentation.
To enhance the research itself, the program
should increase the precision of the search by
searching for concepts instead of keywords. If done
properly, the results of the search are all relevant
for the topic and can be used in some way. This
can reduce the time spent on evaluating each of
the search results. In addition, the texts on the re-
sulted pages can be summarized, further reducing
the time needed to evaluate the results. Another
possibility is to do the usual web search and the
image search at once, because e.g. tables can be
found in both formats.
To provide the possibility of using the system
also during the presentation itself, the search query
should be constructed, not solely from some key-
board input, but as well directly from spoken text,
in a way that provides a possibility to choose the
currenty wanted way of communication. The re-
sults have to be displayed in a way that the impor-
tant parts are visible directly, while other things
are suppressed. The user of the system has to be
able to decide which contents (texts, images etc.)
are shown and which not.
To meet all these requirements, the system has
to consist of the following steps:
1. Recognize the spoken text
2. Retrieve keywords from the text
3. Create a powerful search query
4. Filter and summerize the results
5. Visualize the results in a proper way
These steps will be further described in the next
chapters.
2 Web Retrieval
2.1 Architectural Overview
The spoken words of the presentation are recog-
nized and translated by a speech recognition en-
gine. The output of the speech recognition is fed
to a keyword extractor to create the keywords of
the text. These are used by a query construction
to produce an internet search query. The picture
results of the search are directly transferred to the
June 24, 2009 1

2.3 Keyword Extraction 2 WEB RETRIEVAL
visualisation module. The text results are summa-
rized by a summarizer module and then also dis-
played by the visualisation module. The workflow
can be shown in figure 1.
Figure 1: workflow
2.2 Speech Recognition
The user of the information retrieval system has
several possibilities of interaction with the system,
but the most important way of communication is
the speech. It is used to get a starting point as well
as to define the wanted content more precisely.
Thus, our application needs a well-working speech
recognition engine, to ”understand” as much of the
spoken text as possible. It would be interesting to
write an own speech recognition engine, but the
effort for this is too high to be done for our project.
So the main task here is to choose a feasible speech
recognition engine and to use it in a proper way.
There are several speech recognition engine avail-
able with different strengths and weaknesses. For
example, the program ”Dragon Naturally Speak-
ing” is an established commercial software product
that works quite well. But due to the fact that it is
commercial, it is very expensive and thus not the
right solution for a research project. Beside the
one mentioned there are several other commercial
products that include a speech recognition engine.
Another group of engines are open-source freeware-
tools, found spread one the web. Unfortunatly, the
tools tested turned out not to have the efficiency
and the quality that we would need for the projet.
So we decide to use one product of the third group
- built-in search engines from operating systems.
These engine are likely to have a high quality
and a good performance, while they don’t have
to be bought for a lot of money, because they are
included in the operating system that everyone
has to buy anyway.
After some tests and some problems with the
usage of some of the mentions software products,
we choosed the Windows Speech Recognition API
(SAPI 5.1) [9], [10] as a part of our project. The
Windows speech module offers different kinds of
functionality, including a dictation mode (beside
others such as voice commands, text to speech and
others). In this dictation mode, the spoken words
are simply put into a plain text, which can by used
by other programs via the programming interface
[8].
2.3 Keyword Extraction
For the keyword generation from an input text the
Natural Language Processing Open Source tool:
SharpNLP is used [3].
SharpNLP is a collection of natural language pro-
cessing tools written in C#. This tool provides the
following NLP tools [4]:
• a sentence splitter (to identify sentence bound-
aries)
• a tokenizer (to find tokens or word segments)
• a part-of-speech tagger
• a chunker (to find non-recursive syntactic an-
notation such as noun phrase chunks a parser)
• a name finder (Name finder to finding proper
names and numeric amounts)
June 24, 2009 2

2.4 Query Construction 2 WEB RETRIEVAL
• a coreference tool (to perform coreference res-
olution)
• an interface to the WordNet lexical database
[14]
This toolkit turned out to be tricky to setup, but
subsequently offers good performance and useful
features. To generate keywords the following fea-
tures are used:
• Sentence Splitting:
Splitting the text based on sentence signs and
additional rules (will not split ”and Mr. Smith
said” into two sentences), uses a model trained
on English data
• Tokenizer:
Resolves the sentences into independent to-
kens, based on Maximum Entropy Modeling
[5]
• POS-Tagging (Part Of Speech-Tagging): As-
sociating tokens with corresponding tags, de-
nominating what grammatical part of a sen-
tence it constitutes, based on a model trained
on English data from the Wall Street Journal
and the Brown corpus [7]
• POS-Filtering:
Filtering relevant word categories (nouns and
foreign words) based on the POS-Tags (NN,
NNS, NNP, NNPS, FW)
• Stemming [13]:
Reducing the different morphological versions
of a word to their common word stem, based
on the Porter-Stemmer-Algorithm [2]
The relevance rating of the keywords is calculated
based on word count, actuality and word length.
2.4 Query Construction
In order to construct the query, which the system
uses, several steps are performed:
1. each keyword is mapped to its context sen-
tence;
2. using keyword and the context, the keywords
are matched wit a particular Cyc concept;
3. query is expanded using keywords and con-
cepts;
4. query is searched in Google search engine.
More detailed description of Concepts Matching
and Expansion can be found in Appendix A.
2.4.1 WordNet Similarity Measurement
In order to be able to compare two words or sen-
tences together, a semantic similarity measurement
was needed. WordNet similarity measurement was
used [11].
2.4.2 Cyc Concepts Matching
Cyc[12] is a very large Knowledge Base (KB) which
tries to gather a formalized representation of a vast
quantity of fondamental human knowledge: facts,
rules, etc. The Cyc KB contains thousands of dif-
ferent assertions and can be used in many different
ways. For the system described in this paper, a
particular subset of Cyc has been used: concepts
and different relations.
In Cyc, every concept linked to additional knowl-
edge, as such:
• a display name (readable name of the concept)
• a comment (short description)
• general concepts, for example: HumanAdult is
linked to AdultAniman and HomoSapiens
• specific concepts, for example: HumanAdult is
linked to Professional-Adult, SoftwareEngin-
ner...
• aliases, for example: for HumanAdult there
are: ”adult homo sapiens”, ”grownup” ...
Unfortunately the whole Cyc KB was unavail-
able to public when our research has been done,
therefore CycFoundation.org [6] REST APIs have
been used in order to interact with Cyc. More
about the actual REST API implementation, as
well as the explanation of different algorithms and
specific results can be found in Appendix A.
Since Cyc contains semantic knowledge about
the concepts and not words, the word to concept
matching algorithm was created. The algorithm
is built in such way so it’s operates only at
June 24, 2009 3

semantic level, therefore uses WordNet similarity
measurement[11] (also described in section 2.4.1)
in order to compute similarity score between two
sentences. The algorithm takes a keyword and
its corresponding sentence, then using CycFoun-
dation APIs the set of relevant Cyc Concepts
(aka Constants) are retrieved. The actual process
of retrieval retrieves only concepts containing in
it’s name the keyword, for example for keyword
”human”, a set of concepts would contain ”Hu-
manAdult”, ”HumanActivity”, ”HumanBody”.
Next, for each item in the set of concepts, the sim-
ilarity score with the keyword’s context (sentence)
is computed and the best one is used.
Following straightforward pseudo-code imple-
mentation illustrates the approach (more about
matching type in A.3):
Input: Keyword, its context (sentence) and a
matching type
Output: Cyc concept matched
(BestConstant)
Constants ⇐ GetConstants(Keyword)
BestConstant ⇐ new CycConstant()
BestDistance ⇐ ∞
foreach Constant in Constants do
Distance ⇐ ∞
if MatchingType = DisplayNameMatching
then
dK ⇐ GetDistance(Keyword,
Constant.DisplayName)
dC ⇐ GetDistance(Context,
Distance ⇐ (dK + dC) / 2
end
if MatchingType = CommentMatching
then
dCK ⇐ 0
Comment ⇐ GetComment(Constant)
Keywords = GetKeywords(Comment)
foreach CK in Keywords do
dCK ⇐ dCK +
GetDistance(Keyword, CK)
end
Distance ⇐ (dK + dC + (dCK /
Keywords.Count)) / 3
end
if Distance < BestDistance then
BestDistance ⇐ Distance
BestConstant ⇐ Constant
end
end
2.4.3 Query Expansion
After the keywords and their respective context are
matched to a particular Cyc concept, the actual
query expansion can be done. One can think about
many diﬀerent possible query expansions using the
additional Cyc knowledge. In our research we have
chosen several expansion methods, most of them
are quite straightforward. After some experiments,
one particular structured query expansion was cho-
sen to be used in the system: Or-Concept.
June 24, 2009 4

The algorithm constructs a structured query, which
can be described using the following formula, where
K is the set of keywords, k ∈ K and c(k) is a func-
tion which gets a matched concept for a keyword
(using a matching algorithm) :
Q(K, n) =
n
i=1
(max(ki, c(ki))) (1)
The following pseudo-code illustrates the approach
used in the system in order to construct the
structured query:
Input: Keywords taggeds with Cyc Concepts
Output: Set of expanded keywords for Google
foreach TaggedWord in Query do
Pattern ⇐ ”( [0] OR [1] ) + ,”
PatternEmpty ⇐ ”[0] + ”
Keyword ⇐ TaggedWord.Word
BestGeneral ⇐ ””
if TaggedWord.Concept ¬ null then
GeneralConcepts ⇐
GetGenerals(TaggedWord.Concept)
foreach General in GeneralConcepts
do
Distance ⇐
GetDistance(Word.Word,
General.DisplayName)
if (Distance < BestDistance then
BestGeneral ⇐
General.DisplayName
end
end
end
if BestGeneral.Length = 0 then
Pattern = PatternEmpty
end
Expanded.Add(String.Format(Pattern,
Keyword, BestGeneral))
end
2.4.4 Query for Google Search
In order to recieve the real data, we need to choose
a search engine as well as a corpus. For several
reasons, we quickly arrived at the decision that we
want to use to Google web search for our appli-
cation. First of all, Google is the most popular
search engine that one can find in the web. This is
a great advantage, because an application program-
ming interface (API) is required in order to make
interaction between our program and the search en-
gine possible. The bigger and the more popular a
search engine is, the more likely it is that a good
API can be found for it. This turned out to be
true partially. Google provides a lot of of program-
ming interfaces, where most of them are specialized
for a specific context. So it isn’t easy to find the
best fitting programming interface for an applica-
tion, in addition to the ”usual” problems one may
encouter during the implementation. Another rea-
son to choose the Google web search is the fact that
it can easily be restricted to one or a set of several
web pages. This can be done by simply adding the
tag ”site:”, followed by the desired web page, to
the keywords one wants to search for. As a result,
the search only finds results that are somewhere on
the indicated pages. As we don’t want to find arbi-
trary results, but that from specific, serious pages,
we really want to use this feature of the Google
web search. If one does a web search (either with
the Google engine or somewhere else), a lot of re-
sults are found at very different pages and types
of pages. This is of course important, since that is
what most people want when they search the web.
But with the information retrieval system, the aim
is different. We don’t want to display content from
any websites that contain the keywords, but only
from websites with trustable content. So a white
list with the sites that will be searched is the best
way of avoiding useless results. Due to the fact that
the results of the web search contain lots of HTML-
tags and similar things that are (for our purpose)
useless, some knowledge about the structure of the
found results. By limiting the search to defined
web pages, we can use our information about these
pages to parse the results, depending on the web
site they were found on. In a first approach, we lim-
ited the search only to articles on en.wikipedia.org.
Wikipedia is a very well-known and well-accepted
page that contains useful information about a lot
of domains. Retrieving data from only one page
may seem to be too less, but it is enough in order
to set up a working and useful system. In addition,
the system can easily be enhanced by adding other
sites.
June 24, 2009 5

2.5 Results Filtering 2 WEB RETRIEVAL
2.5 Results Filtering
2.5.1 Retrieving Unstructured Web Pages
The search on the web engine gives a list of URLs
as its results. Each of the pages can be downloaded
as a text, but now is a HTML-formatted web page.
There is some work left to do to recieve only the
real content. First, we want to get rid of the
HTML tags and any other formatting data. This
task has to be performed by applications of very
different types. For example, every web browser
has to use functionality that distinguishes between
content that will be displayed and content that
will not. It is usually the best to use existing,
tested and working packages, which shall be found
at web browsers.
The Microsoft Internet Explorer performs this
task using a dynamic link library (DLL) called
”mshtml”. This library can easily be used to parse
a complex html-file into plain text that contains
only that parts that would be displayed when one
is visiting this web page with the Internet Explorer
(or another browser). This does the main part of
this step.
Many web pages don’t only contain the ”real”
information, but also a lot of other things, such
as navigation links, menus, advertisements, and so
on. These are displayed by the browser and thus
seen by the user. If the user of the information
retrieval system chooses to display on of these
sites, he maybe wants to see these content. But
due to the fact that the texts from the results are
summerized in the next step, we have to remove
them from the text, because they don’t contain any
information that is relevant for the information
contained in the page itself.
The format of these menus etc. depends on the
web page where the result was found. The results
on Wikipedia are formatted in a different way from
the results in other lexigraphical web pages. Thus,
it is useful to handle the results depending on the
web page they were found on.
As the web search in this approach has been
limited to only one page, namely en.wikipedia.org,
we have to consider only one page in order to
remove ”waste” like links and menus from it. An
analysis of the structure unfolded an easy way
of parsing the wikipedia articles. All articles on
wikipedia start with article itself - beside some
HTML-tags and a lot of scripts. The menus and
navigation links follow after it. That means, that
we can simply work on the first part of the article
and skip everything that follows afterwards. The
edge between article and links is clearly defined by
the HTML-tag ’¡div class=”printfooter”¿’, which is
contained in every article on Wikipedia. So we can
simply search the results from wikipedia for the
tag mentioned above, and remove everything that
follows after this tag. It should be mentioned that
this parsing is aimed at the result of the parsing
done by the Microsoft dynamic link library, but
has to be done before the other parsing step,
because otherwise one wasn’t able to find the
HTML-tag used as the edge.
2.5.2 Summarizing the Web Content
In the document exists important, but also a lot
of irrelevant information. Additionally the screen
space to present the information is intentionally
limited to allow a quick evaluation by the user.
Because of this the documents are summarized in
order to present only the important information to
the user. For this we use the Open Text Summa-
rizer [1]. This tool doesn’t abstract the document
in a natural way because it does not rephrase the
text. It is just producing a condensed version of the
original by keeping only the important sentences.
However it shows good results for non-fictional
text and can be used with unformatted and html
formatted text. It has received favourable mention
from several academic publications and it is at
least as good as commercial tools such as Copernic
and Subject Search Summarizer.
The Open Text Summarizer parses the text
and utilizes Porter Stemming. For this an xml file
with the parsing and stemming rules is used. It
calculates the term frequency for each word and
stores this in a list. After this stop word filter
is performed. It removes all redundant common
words in the list by using a stop word dictionary,
which is also stored in the xml file. After sorting
the list for frequency the keywords are determined,
which are the most frequently occurring words.
The original sentences are scored based on these
keywords. A sentence that holds many important
words, the keywords, is given a high grade. The
June 24, 2009 6

2.6 Visualization 2 WEB RETRIEVAL
result is a text with only the highest scored
sentences. For this the limiting factor can be set to
either a percentage of the original text or sentences
number.
2.6 Visualization
Visualization is probably the one of most impor-
tant parts of the system. It consist of two main
parts:
1. The ”Presenter-View”, the view that the pre-
senter sees on his computer during the pre-
sentation. The presenter should be able to see
the dataflow, manipulate and give simple com-
mands as: ”show this image” or ”show that
article”. The presenter also needs a preview of
what’s shown to public.
2. The ”Public-View” the simplified view that
shows only the needed information and mir-
rors the lecturer interactions with the presen-
ter’s view pre-rendering zone.
Figure 2: Presenter’s view layout structure
For the Graphical User Interface (GUI), several
crucial specifications were defined, most impor-
tantly:
• rendering and screen mirroring should be done
in real-time;
• layout of the GUI should be as simple as pos-
sible, presenting the retrived data in the sim-
plest and fastest way for the presenter;
• it shoud also be interactive, the images and
articles should be manipulated at real-time.
Figure 3: Real-time parallelization workflow
In order to achieve smooth experience, several
things must be done at the same time. As shows fig-
ure 3, two main components: rendering and infor-
mation extraction were completely separated and
running in several threads. Moreover, speech recog-
nition part and google search were also separated
and all search done asynchronously. Therefore, the
system leverages actual multi-core architectures in
order to achieve significant speedup. In oder to
achieve the simplest way to presenting the data,
several prototypes of layout have been tested. Fi-
nally, the one shown at the figure 2 and screenshots
4 6 have been proven to be a good way to present-
ing the information. The system supports several
screens rendering, and the actual mirroring and
the development of the user interfaces have been
achieved using a novel Microsoft Windows Presen-
tation Foundation (WPF) technology.
WPF technology is espectially designed for rich
user interface develoment, and built on top of .Net
Framework. It uses a markup language known
as XAML to provide clear separation between the
code and the design definition, which also greatly
helped in parrallelization process. It also features
3D and Video/Audio rendering capabilities and an-
imation storyboards (similar to flash). Therefore
our system can be extended in order to perform
June 24, 2009 7

3 TESTS AND RESULTS
Figure 4: Screenshot of the GUI, the rendering
view is splitted automatically between several im-
ages and an article
video search or do 3D enhanced presentations. The
rendering engine takes advantage of modern hard-
ware (dedicated GPUs ...). One of the features of
Figure 5: Screenshot of the GUI, the public view
our system is the dynamic layout management of
the render-view. As shown on the ﬁgures 4 6 the
rendering zone automaticaly adjust itself depend-
ing on the quantity of content presented. With such
capabilites the system assures to use whole screen
without leaving much empty space.
Figure 6: Screenshot of the GUI, the rendering view
shows only images since no articles have been se-
lected by the lecturer
3 Tests and Results
3.1 Speech Recognition
In the context of this project it is important that
the speech recognition engine has a high quality
of recognition and is working fast. For this reason
these two aspects have been evaluated.
After a two hour training phase the engine
recognized about 70% of the spoken words cor-
rectly. The recognition rate can be improved by
further training. To achieve a good recognition
rate it is necessary to speak loud and articulated
and to pronounce the words always in the same
way. Additionally punctuation marks are not
recognized automatically but have to be spoken
explicitly. This is however not a natural way to
speak at a presentation.
After speaking some sentences a break of a
few seconds has to be made to initiate the recog-
nition process for these sentences. The recognition
process also needs a few seconds depending on the
June 24, 2009 8

3.4 Results Filtering 3 TESTS AND RESULTS
number of words. The tests showed a recognition
rate of about four words per second. Again these
frequent breaks of a few seconds are not a natural
way to speak at a presentation.
3.2 Keyword Extraction
The keywords extraction module has been tested
with different features and numbers of words. The
computing time and the quality of the extracted
keywords have been evaluated.
The results are shown in figure 7, 8. The
tests have been made for the complete feature set
(explained in chapter 2.3) and with disregarding
the stemmer. Due to the limitations of the speech
recognition engine (no punctuation marks if not
pronounced explicitly) the keyword extraction has
also been evaluated for texts without punctuation
marks.
It shows that the quality of the keyword extraction
without stemmer is generally lower than with the
complete feature set. Especially if the keywords
will be used in the query for the internet search the
use of the stemmer shows advantages. It avoids for
example the search for singular and plural of the
same keyword. Also the quality of the relevancy
of the keywords is improved. It also shows that
the gain in computation time by disregarding
the stemmer is minimal. With 1200 words the
gain is only 120 ms. For the expected relatively
short sentences usually provided by the speech
recognition engine the gain is even less.
The tests also showed that for text without
punctuation marks the quality is nearly the same
as for the same text with punctuation marks.
3.3 Query Construction and Expan-
sion
As figure 9 shows, the Or-Concept algorithm of
query expansion makes a significant improvement
of the precision. In that figure, 10 word query is
compared to an extended 10 word query, and at N 5
one can see an improvement of almost 3 times. Cyc-
enhanced queries gives better results, but should be
evaluated (more about different experiments and
query-enhancements in Appendix A.
Figure 7: time results for an extraction of ten key-
words (with different numbers of words)
Figure 8: quality of results for an extraction of ten
keywords (with different numbers of words)
3.4 Results Filtering
The summarizer module has been tested with
and without stemmer for different summarizing
June 24, 2009 9

4 DISCUSSION
Figure 9: Precision at N, 10 word query with no
keyword stemming
percentages of the original unformatted text. The
quality of the results represents their usability
which has been evaluated from the meaningfulness
of the summary and its length. A shorter length
was considered a better usability because it allows
a quicker evaluation by the user.
Figure 10 shows that in these tests with stemmer
the results with the best quality were achieved for
a 15% summary. Without stemmer the quality
was usually a little less.
The calculation time with and without stemmer
for different summarizer percentages is shown in
figure 11. The difference between the different
configurations is minimal.
When comparing unformatted text and HTML-
formatted text (with similar amount of content)
the tests show that the html formatted text can
take significantly longer to be processed.
4 Discussion
As explained in the preceeding chapters, the sys-
tem is works in its main parts. It is clearly possible
to enrich web search as well as presentation of
contents with different tools related to text mining,
Figure 10: quality of results for different summary
percentages for a text with 1200 content words
Figure 11: Calculation times for different summary
percentages for a text with 1200 content words
information retrieval, and others. But also many
problems turned out that had either to be solved
or to be worked around.
June 24, 2009 10

5 CONCLUSIONS
The speech recognition module did work, but still
had some problems, which lie in the implementa-
tion of the used software. The speech recognition
itself works fine, but the chosen engine has some
problems e.g. in recognizing accents, breaks
and similar. These often result in a incorrectly
recognized word or sentence. So either another,
better engine has to found or another solution has
to be found in order to meet the requirements.
The keyword extraction from the spoken text
works well and also very efficient. But still,
this step can be enhanced. For example, the
user may want to influence the chosen keywords
more directly than he does by speaking. So it
would be a useful feature if the user could decide
manually which keywords are good and important
and which keywords are or not relevant. The
keywords provided by the system could then be
viewed as suggestions that the user can accept or
decline. Another nice feature related to the first
one mentioned is a self-learning algorithm that
tries to predict the user’s decision.
Another part that works fine but still can be
improved is the summerization of the found texts.
It produces good summaries of the found web
pages without consuming too much computation
time. But the result could maybe be enhanced
by using features such as coreference resolution,
leading to a better estimation of which referents
are important and which are not. But, before
putting a big effort in it, it should further be
checked whether it is worth the effort.
5 Conclusions
In this paper a design and implementation of an In-
teractive Web Retrieval and Visualization System
have been discussed. It have been proven robust
and satisfying real-time constraints of live presen-
tation. Tests and results have shown that both key-
word extraction and query expansion return satis-
fying results, keeping good precision. During the
experiments it have been also shown that several
parts of such system still need improvement: speech
recognition needs to be enhanced and propery pre-
trained, unstructured web data should be properly
cleared from all noise and summarization can be
enhanced, creating more consize articles.
Overall, such systems can be built with modern
hardware and parallelization, and hopefully, more
of those will be seen as commercial products in the
near future.
June 24, 2009 11

A APPENDIX: QUERY EXPANSION USING CYC
References
[1] Open text summarizer.
http://libots.sourceforge.net.
[2] Porter stemmer. http://tartarus.org/ mar-
tin/PorterStemmer/.
[3] Sharpnlp - open source natu-
ral language processing tools.
http://www.codeplex.com/sharpnlp.
[4] The Text Mining Handbook: Advanced Ap-
proaches in Analyzing Unstructured Data,
chapter page 60. Cambridge University Press,
2006.
[5] The Text Mining Handbook: Advanced Ap-
proaches in Analyzing Unstructured Data,
chapter page 318. Cambridge University Press,
2006.
[6] Cyc Foundation.
http://www.cycfoundation.org/.
[7] W.N. Francis and H. Kucera. Brown corpus
manual.
[8] M. Harrington. Giving computers a voice.
http://blogs.msdn.com/coding4fun/archive/
2006/10/31/909044.aspx.
[9] J. Moskowitz. Speech recog-
nition with windows xp.
http://www.microsoft.com/windowsxp/
using/setup/expert/
moskowitz 02september23.mspx.
[10] J. Moskowitz. Windows speech recognition.
http://www.microsoft.com/windows/windows-
vista/ features/speech-recognition.aspx.
[11] T. Ngoc Dao and T. Simpson. Measuring simi-
larity between sentences. Codeproject Article.
[12] Open Cyc Project.
http://www.cyc.com/opencyc.
[13] J. C. Scholtes. 5 text mining: Preprocessing
techniques part 1.
[14] Princeton University. Wordnet, a lex-
ical database for the english language.
http://wordnet.princeton.edu/.
A Appendix: Query Expan-
sion using Cyc
A.1 WordNet Similarity Measure-
ment
In order to be able to compare two words or sen-
tences together, a semantic similarity measurement
was needed. For this the WordNet similarity mea-
surement was used.
Following are the steps that are performed to to
the semantic similarity between two sentences [11]:
• each sentence is partitioned into a list of tokens
and the stop words are removed;
• words are stemmed;
• part of speech tagging is performed;
• the most appropriate sense for every word in
a sentence is found (Word Sense Disambigua-
tion). To ﬁnd out the most appropriate sense
of a word, the original Lesk algorithm was used
and expanded with the hypernym, hyponym,
meronym, troponym relations from WordNet.
The possible senses are scored with a new scor-
ing mechanism based on ZipF’s Law and the
sense with the highest score is chosen.
• the similarity of the sentences based on the
similarity of the pairs of words is computed.
In order to do this, a semantic similarity rela-
tive matrix is created, consisting of pairs of
word senses for semantic similarity between
the most appropriate sense of word. The Hun-
garian method is used to get the semantic simi-
larity between sentences. The match results of
this are included to compute single similarity
value for two sentences. The matching average
is used to compute semantic similarity between
two word-senses. This similarity is computed
by dividing the sum of similarity values of all
match candidates of both sentences by the to-
tal number of set tokens.
A.2 CycFoundation.org REST API
As mentionned in section 2.4.2, the system used
REST APIs in order to access the CycFounda-
tion.org Cyc KB. REST stands for Representa-
June 24, 2009 12

A.3 Cyc Concepts Matching A APPENDIX: QUERY EXPANSION USING CYC
tional state transfer, a way to build a service ori-
ented architecture, based on HTTP and XML, us-
ing generally GET or POST methods of HTTP pro-
tocol.
CycFoundation Web Services exposes only a sub-
set of Cyc’s capabilities, therefore the implemented
API in our system is rather small. It contains fol-
lowing queries:
• GetConstants(Keyword) - performs a search
for Cyc concepts for a keyword
• GetComment(Concept) - returns a comment
for a particular concept
• GeCanonicalPrettyString(Concept) - returns a
simplified name for a concept
• GetDenotation(Concept) - returns a denota-
tion for a particular concept
• GetGenerals(Concept) - returns a set of gen-
eral concepts for a particular concept
• GetSpecifics(Concept) - returns a set of spe-
cific concepts for a particular concept
• GetInstances(Concept) - returns a set of in-
stances (concepts) for a particular concept
• GetIsA(Concept) - returns a set of Is A con-
cepts for a particular concept
• GetAliases(Concept) - returns a set of aliases
(words or phrases) for a particular concept
• GetSiblings(Concept) - returns a set of sibling
concepts for a particular concept
A.3 Cyc Concepts Matching
The actual general algorithm descibed in section
2.4.2 was actually derived from two algorithms:
display name matching and comment matching.
Those algorithms have been tested separately.
A.3.1 DisplayName Matching Algorithm
The display name used in cyc, is the shortest con-
cept description. For example the displayname for
”HumanAdult” concept is simply ”a human adult”.
Therefore the matching algorithm computes the
distance between the keyword contentex and this
description, as following:
Input: Keyword, its context (sentence)
(BestConstant)
Distance ⇐ (dK + dC) / 2
end
end
This algorithm has proven to be quite efficient since
less computation needs to be done, but on the other
hand it tends to provide quite poor results and
should be enhanced for proper use.
A.3.2 Comment Matching Algorithm
The comment matching algorithm takes the addi-
tional knowledge about cyc concept, called com-
ment. It is long description which tends to be as
specific as possible. For example again, for ”Hu-
manAdult” concept, the comment is:
”A specialization of Person, and an in-
stance of HumanTypeByLifeStageType.
Each instance of this collection is a per-
son old enough to participate as an inde-
pendent, mature member of society. In
most modern Western contexts it is as-
sumed that anyone over 18 is an adult.
However, in many cultures, adulthood oc-
curs when one reaches puberty. Adult-
hood is contiguousAfter (q.v.) childhood.
Notable specializations of this collection
include AdultMaleHuman, AdultFemale-
Human, MiddleAgedHuman and OldHu-
man.”
The actual pseudo code implementation of such
algorithm, where GetComment method is actual
REST API call:
June 24, 2009 13

A.4 Query Expansion Patterns A APPENDIX: QUERY EXPANSION USING CYC
Input: Keyword, its context (sentence)
(BestConstant)
dCK ⇐ 0
Comment ⇐ GetComment(Constant)
Keywords = GetKeywords(Comment)
foreach CK in Keywords do
dCK ⇐ dCK + GetDistance(Keyword,
CK)
end
Distance ⇐ (dK + dC + (dCK /
Keywords.Count)) / 3
end
end
A.3.3 Experiments with both algorithms
After running several experiments (Fig. 13) com-
paring the Comment Matching and Display Name
algorithms, we found that Comment Matching is
the one gives significantly better results, but still
can be enhanced to get even better matching.
This may be done by using general/specific con-
cepts in the distance calculation formula or learning
weights.
A.4 Query Expansion Patterns
In our research, several different query expan-
sio algorithms have been implemented, all have a
quite straightforward implementation as explained
in section 2.4.3. The algorithms, which can be de-
scribed using the following formulas, where K is the
set of keywords, k ∈ K, have been implemented and
evaluated :
Or-Concept Expansion:
Q(K, n) =
n
i=1
(max(ki, c(ki))) (2)
where c(k) is a concept linked with the keyword
Or-Aliases Expansion:
Q(K, n) =
n
i=1
(max(ki, (
m
j=1
aj(ki)))) (3)
where a(k) is a alias linked with the keyword’s
concept
Or-Most-Relevant-Alias Expansion:
Q(K, n) =
n
i=1
(max(ki, a(ki))) (4)
where a(k) is a most relevant alias linked with the
keyword’s concept
Or-Most-Relevant-General Expansion:
Q(K, n) =
n
i=1
(max(ki, g(ki))) (5)
where g(k) is a most relevant general concept
linked with the keyword’s concept
Or-Most-Relevant-Specific Expansion:
Q(K, n) =
n
i=1
(max(ki, s(ki))) (6)
where s(k) is a most relevant specific concept
Or-Is-A-Concept Expansion:
Q(K, n) =
n
i=1
(max(ki, isa(ki))) (7)
June 24, 2009 14

A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC
where isa(k) is a most relevant ’Is A ...’ concept
Or-Most-Relevant-AGS Expansion:
Q(K, n) =
n
i=1
(max(ki, a(ki), g(ki), s(ki))) (8)
where:
a(k) is a most relevant alias linked with the
keyword’s concept,
g(k) is a most relevant general concept linked with
the keyword’s concept,
s(k) is a most relevant speciﬁc concept linked with
the keyword’s concept
A.5 Results and Discussion
A.5.1 Google Search on Wikipedia KB
Our system have been tested and able to generate
queries for Google search engine, however it is very
diﬃcult to evaluate such results, therefore this
secrion of the paper is only to give an example of
results.
Using as input an introductory text of Standford
Encyclopedia for the topic of ’evolution’, the
system have derived several keywords:
• ”species”, matched to ”speciesImmunity”
• ”theory”, matched to ”theoryOfBeliefSystem”
• ”evolution”, matched to ”Evolution”
• ”change”, matched to ”changesSlot”
• ”term”, matched to ”termExternalIDString”
Next, several queries have been constructed and
restricted to Wikipedia KB:
• Google Non-Expanded Query: ”species theory
evolution change term site:en.wikipedia.org”
• Google Or-Concepts Query: ”( species OR
species immunity ) + ( theory OR Theory Of
Belief System ) + ( evolution OR biological
evolution ) + ( change OR Changes Slot )
+ ( term OR Term External ID String ) +
site:en.wikipedia.org”
Figure 12: Precision at N, results of Or-Concept
expanded query with and without stemming com-
pared
• Google Or-Aliases Query: ”species + theory +
( evolution OR (biologically will have evolved
OR biologically had evolved OR biologically
will evolve OR biologically has evolved OR bi-
ologically have evolved OR biologically evolv-
ing OR biologically evolves OR biologically
evolved OR biologically evolve OR most evo-
lutionary OR more evolutionary OR evolu-
tionary OR evolution) ) + change + term +
• Google Or-Most-Relevant-General Query:
”species + theory + ( evolution OR
”development” ) + change + term +
• ...
After analyzing the results, we compared several
of the results returned by Google. For the normal
non expanded query the results are quite satisfy-
ing:
* Evolution
* Punctuated equilibrium
* Evolution as theory and fact
* Macroevolution
* History of evolutionary thought
* On the Origin of Species
June 24, 2009 15

Figure 13: Precision at N, results of Or-Concept
expanded query using different matching
* Species
* Hopeful Monster
However, the source article mentionned Charles
Darwin and his name was not derived in the
keywords, so no direct Wikipedia link to his theory
of natural selection was found in the non-expanded
results. After analyzing deeper, it was actually
found in the extended query (Or-Most-Relevant-
General):
* Evolution
* Punctuated equilibrium
* Macroevolution
* Evolution as theory and fact
* Charles Darwin
* On the Origin of Species
* Hopeful Monster
* Natural selection
A.5.2 Lemur Search in AP Corpus
In order to perform an evaluation of the query ex-
pansion methods, a structured query generation for
Lemur search engine have been implemented too.
The queries have been automatically derived from
the narrative of topics 101 to 115 of AP Corpus
Figure 14: Precision at N, expansions of a 5 word
query compared (part 1)
and a batch evaluation of those queries has been
performed.
Several different sets of results have been produced
by IREval function of Lemur, ones of them are:
• 5-10 word query with no keyword stemming;
• 5-10 word query with keyword stemming;
• 5-10 word query using name matching.
Most of the experiments have been performed
with Comment Matching Algorithm with whole
set of query expansion patterns. Some experiments
have been performed in order to compare the Dis-
play Name and Comment Matching Algorithms.
The keyword stemming part was especially tested
in order to suppress plurals, since in derived
keywords from AP Corpus topics usually left it. So
for example: ”changes” was stemmed to ”change”
and therefore give a larger set of possible concepts
to match. During our experiments (Figure 12) we
have found that using the keyword stemmin in
those algorithms decreases significantly the preci-
sion at larger queries. One possible explanation
to that is that the actual matching algorithms are
making more errors, since the possible concepts
set is larger.
In the figures 14 and 15 one can see that only
the Or-Concept algorithm gives better precision
June 24, 2009 16

at 5 and more. The Or-Most-Relevant-Alias
and Or-Most-Relevant-General give a precision
improvement only at N > 20.
Finally, it’s in the figures 16 and 17 that
one can really see the improvement made by
stemming and especially Or-Concept query ex-
pansion. While 5 words query expansion (at 5N)
in this particular algorithm gives an improvement
of almost 50 percent, with 10 words query the
improvement is almost of the order of 200 percent
(significant increase of precision).
The results speak for themselves; we think that
the Cyc-enabled query expansion is definitely way
to go, but better patterns still needed to be built.
In our experiments from 7 different patterns only
one of them is actually making good precision in-
crease: the Or-Concept pattern.
June 24, 2009 17

Research: Developing an Interactive Web Information Retrieval and Visualization System

Recommandé

Recommandé

Contenu connexe

Similaire à Research: Developing an Interactive Web Information Retrieval and Visualization System

Similaire à Research: Developing an Interactive Web Information Retrieval and Visualization System (20)

Plus de Roman Atachiants

Plus de Roman Atachiants (6)

Dernier

Dernier (20)

Research: Developing an Interactive Web Information Retrieval and Visualization System