Revisiting evolutionary information filtering

Revisiting Evolutionary Information Filtering
Nikolaos Nanas, Centre for Research and Technology Thessaly, GREECE
Stefanos Kodovas, University of Thessaly, GREECE
Manolis Vavalis, University of Thessaly, GREECE

outline
 Adaptive Information Filtering – brief introduction
 Evolutionary Information Filtering – review
 Diversity & dimensionality – theoretical issues
 Experimental evaluation
• Methodology – a test-bed
• Results – not a success
• Discussion – interesting observations
 Conclusions and future work

Information Overload is still around

Adaptive Information Filtering in the case of textual information

Adaptive Information Filtering (AIF)
 challenging problem with no established solution
 complex and dynamic
• multiple and changing user interests
• changing information environment
 crucial issues for successful AIF
• profile representationprofile representation
• profile adaptationprofile adaptation

Evolutionary Information Filtering with the Vector Space Model
Profile adaptation through evolution of user’s profiles.

Evolutionary Information Filtering
• “A Review of Evolutionary and Immune-Inspired Information Filtering”, Natural Computing, 2009
• A common vector space with as many dimensions as the number of unique keywords
• A population of profiles that collectively represent the user’s interests
• Both profiles and documents are represented as (weighted) vectors in this space
• Trigonometric measures of similarity for comparing profile vectors to document vectors
• Fitness function based on (explicit or implicit) user feedback
• reward profiles that assigned a high relevance score to relevant documents and vice versa
• fitness is updated proportional to user feedback
• average score of relevant documents
• ratio of successful evaluations

Evolutionary Information Filtering
 profile initialisation is not random
 selection
• fixed percentage of best individuals
• variable percentage
• roulette wheel
 crossover
• single-point, two-point, three-point
• variable percentage
• roulette wheel
 mutation
• keyword replacement
• random weight modification
 steady-space replacement
• offspring typically replace less fit individuals

Diversity Issues
 AIF is not a classic optimisation problem
• online learning problem
• reminiscent of Multimodal Dynamic Optimisation (MDO)
 Traditional GAs suffer in the case of MDO due to diversity loss.
 Four types of remedies:
1. adjust mutation rate when changes are observed
2. spread the population
3. memory of previous generations
4. multiple subpopulations
• in “Multimodal Dynamic Optimisation: from Evolutionary Algorithms to Artificial Immune Systems”, 2007
• intrinsic diversity problems due to:
• selection based on relative fitness
• no developmental process
• fixed population size

Dimensionality Issues
• A vector space with a large number of dimensions (keywords) is required for successful AIF
• In a multi-dimensional space:
• the volume increases exponentially with the number of dimensions
• distance based measures become meaningless as points become equidistant
• the discriminatory power of pair-wise distances is significantly affected
• scalar metrics can not differentiate between vectors with distributed and concentrated differences
• in a multi-dimensional keyword space the ability of GAs to achieve profile adaptation is affected because:
• the number of possible weighted keyword combinations increases exponentially with the number of dimensions
• crossover and mutation cannot randomly produce the right combination of weighted keywords

Experimental Evaluation: Dataset
 Reuters-21578
• 21578 news stories that appeared in Reuters newswire in 1987
• documents are ordered according to publication date
• 135 topic categories
• experiments concentrate on the 23 topics with at least 100 relevant documents
 document pre-processing
• stop word removal
• stemming with Porter’s algorithm
• weighting with Term Frequency Inverse Document Frequency (TFIDF)
 words with large average TFIDF are selected to build the keyword space
topic
code
size
earn 3987
acq 2448
money-
fx
801
crude 634
grain 628
trade 552
interest 513
wheat 306
ship 305
corn 254
dlr 217
oilseed 192
topic
code
size
money-
suply 190
sugar 184
gnp 163
coffee 145
veg-oil 137
gold 135
nat-gas 130
soybean 120
bop 116
livestock 114
cpi 112

Experimental Evaluation: Baseline Experiment

Baseline Results
• as the number of extracted words
increases the AUP values increase
• for a small number of extracted
keywords the results are biased
towards topics with a large number
of relevant documents
• the best results are achieved when
all extracted keywords are used
• if we wish to represent a range of
topics then a multidimensional
space is required

Experimental Evaluation: Evolutionary Experiments
 a vector space comprising 31298 keywords
 The basic Genetic Algorithm:
• with a population of 100 profiles
• each profile is a weighted keyword vector (randomly initialised)
• the same random initial population is used in all experiments
• documents are evaluated in order using the inner product
• new fitness = old fitness + relevance score
• the 25% fittest profiles are selected for reproduction
• single-point crossover
• mutation through random weight modification
• the offspring replace the 25% worst profiles
 two further variations of the basic GA
• GA_init: initialisation using the first 100 relevant documents per topic.
• GA_init + learning: a MA that uses Rocchio’s learning algorithm

Comparative Results:
accuracy
• y-axis: best AUP achieved in 50 generations
(bias)
• baseline results are included
• additional results for ranking by date
Findings:
• the GA performs worse than the baseline
• marginal improvements for non-random
initialisation
• significant improvement when learning is
introduced
• the MA is only better for some topics with
small size

Comparative Results:
learning
• y-axis: average AUP over all topics after each generation
• x-axis: number of generations
• embedded figure focuses on GA and GA_init
Findings:
• GA does not essentially improve
• better initial performance and learning rate for non
random initialisation (GA_init)
• much steeper learning curve when learning is
introduced (GA_init + learning).

Conclusions
 The basic GA fails to learn the topic of interest.
• the right combination of keyword weights can not be randomly produced.
• the GA is lacking a mechanism for appropriately updating keyword weights.
• performance depends on the weighted keywords that initialisation produced.
 When the GA is initialised based on relevant documents
• then the initial set of weighted keywords produces better filtering results
 The introduction of learning allows for further improvements in the initial keyword weights.
• still worse than the baseline experiment despite the 50 generations
• this is possibly due to the negative effect of the genetic operations

Discussion
 Our experimental results do not agree with the promising results reported in the literature
• we did not re-implement an existing approach, but adopted existing techniques
• AIF is a complex problem that can not be easily tackled with weighted keyword in a multi-dimensional space
• comparative experiments between GAs and other machine learning algorithms have been missing from AIF
 large differences observed between the GA and the baseline algorithm
• despite the biased comparison in favour of the GA
• more fundamental alternatives which are not based on vector representations
• the choice of representation should facilitate the learning task
• external remedies like those adopted for MDO are not practical
 we wish to reanimate the interest of the research community on AIF
• biologically inspired solutions are well suited to the problem
• appropriate experimental methodologies that reflect the complexity and dynamics of AIF are required

Revisiting evolutionary information filtering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à Revisiting evolutionary information filtering

Similaire à Revisiting evolutionary information filtering (20)

Dernier

Dernier (20)

Revisiting evolutionary information filtering

Notes de l'éditeur