This document summarizes a study that experimentally evaluated the use of evolutionary algorithms for adaptive information filtering. The researchers tested a basic genetic algorithm approach using a vector space model to represent user profiles and documents, and found that it did not improve filtering accuracy over a baseline approach and struggled with the high-dimensional representation. Initializing profiles based on relevant documents and incorporating learning led to better initial performance but not better than the baseline. The researchers concluded the genetic algorithm was not well-suited for the complex, dynamic problem of adaptive information filtering within a high-dimensional vector space.
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Revisiting evolutionary information filtering
1. Revisiting Evolutionary Information Filtering
Nikolaos Nanas, Centre for Research and Technology Thessaly, GREECE
Stefanos Kodovas, University of Thessaly, GREECE
Manolis Vavalis, University of Thessaly, GREECE
2. outline
Adaptive Information Filtering – brief introduction
Evolutionary Information Filtering – review
Diversity & dimensionality – theoretical issues
Experimental evaluation
• Methodology – a test-bed
• Results – not a success
• Discussion – interesting observations
Conclusions and future work
5. Adaptive Information Filtering (AIF)
challenging problem with no established solution
complex and dynamic
• multiple and changing user interests
• changing information environment
crucial issues for successful AIF
• profile representationprofile representation
• profile adaptationprofile adaptation
7. Evolutionary Information Filtering
• “A Review of Evolutionary and Immune-Inspired Information Filtering”, Natural Computing, 2009
• A common vector space with as many dimensions as the number of unique keywords
• A population of profiles that collectively represent the user’s interests
• Both profiles and documents are represented as (weighted) vectors in this space
• Trigonometric measures of similarity for comparing profile vectors to document vectors
• Fitness function based on (explicit or implicit) user feedback
• reward profiles that assigned a high relevance score to relevant documents and vice versa
• fitness is updated proportional to user feedback
• average score of relevant documents
• ratio of successful evaluations
8. Evolutionary Information Filtering
profile initialisation is not random
selection
• fixed percentage of best individuals
• variable percentage
• roulette wheel
crossover
• single-point, two-point, three-point
• variable percentage
• roulette wheel
mutation
• keyword replacement
• random weight modification
steady-space replacement
• offspring typically replace less fit individuals
9. Diversity Issues
AIF is not a classic optimisation problem
• online learning problem
• reminiscent of Multimodal Dynamic Optimisation (MDO)
Traditional GAs suffer in the case of MDO due to diversity loss.
Four types of remedies:
1. adjust mutation rate when changes are observed
2. spread the population
3. memory of previous generations
4. multiple subpopulations
• in “Multimodal Dynamic Optimisation: from Evolutionary Algorithms to Artificial Immune Systems”, 2007
• intrinsic diversity problems due to:
• selection based on relative fitness
• no developmental process
• fixed population size
10. Dimensionality Issues
• A vector space with a large number of dimensions (keywords) is required for successful AIF
• In a multi-dimensional space:
• the volume increases exponentially with the number of dimensions
• distance based measures become meaningless as points become equidistant
• the discriminatory power of pair-wise distances is significantly affected
• scalar metrics can not differentiate between vectors with distributed and concentrated differences
• in a multi-dimensional keyword space the ability of GAs to achieve profile adaptation is affected because:
• the number of possible weighted keyword combinations increases exponentially with the number of dimensions
• crossover and mutation cannot randomly produce the right combination of weighted keywords
11. Experimental Evaluation: Dataset
Reuters-21578
• 21578 news stories that appeared in Reuters newswire in 1987
• documents are ordered according to publication date
• 135 topic categories
• experiments concentrate on the 23 topics with at least 100 relevant documents
document pre-processing
• stop word removal
• stemming with Porter’s algorithm
• weighting with Term Frequency Inverse Document Frequency (TFIDF)
words with large average TFIDF are selected to build the keyword space
topic
code
size
earn 3987
acq 2448
money-
fx
801
crude 634
grain 628
trade 552
interest 513
wheat 306
ship 305
corn 254
dlr 217
oilseed 192
topic
code
size
money-
suply 190
sugar 184
gnp 163
coffee 145
veg-oil 137
gold 135
nat-gas 130
soybean 120
bop 116
livestock 114
cpi 112
13. Baseline Results
• as the number of extracted words
increases the AUP values increase
• for a small number of extracted
keywords the results are biased
towards topics with a large number
of relevant documents
• the best results are achieved when
all extracted keywords are used
• if we wish to represent a range of
topics then a multidimensional
space is required
14. Experimental Evaluation: Evolutionary Experiments
a vector space comprising 31298 keywords
The basic Genetic Algorithm:
• with a population of 100 profiles
• each profile is a weighted keyword vector (randomly initialised)
• the same random initial population is used in all experiments
• documents are evaluated in order using the inner product
• new fitness = old fitness + relevance score
• the 25% fittest profiles are selected for reproduction
• single-point crossover
• mutation through random weight modification
• the offspring replace the 25% worst profiles
two further variations of the basic GA
• GA_init: initialisation using the first 100 relevant documents per topic.
• GA_init + learning: a MA that uses Rocchio’s learning algorithm
15. Comparative Results:
accuracy
• y-axis: best AUP achieved in 50 generations
(bias)
• baseline results are included
• additional results for ranking by date
Findings:
• the GA performs worse than the baseline
• marginal improvements for non-random
initialisation
• significant improvement when learning is
introduced
• the MA is only better for some topics with
small size
16. Comparative Results:
learning
• y-axis: average AUP over all topics after each generation
• x-axis: number of generations
• embedded figure focuses on GA and GA_init
Findings:
• GA does not essentially improve
• better initial performance and learning rate for non
random initialisation (GA_init)
• much steeper learning curve when learning is
introduced (GA_init + learning).
17. Conclusions
The basic GA fails to learn the topic of interest.
• the right combination of keyword weights can not be randomly produced.
• the GA is lacking a mechanism for appropriately updating keyword weights.
• performance depends on the weighted keywords that initialisation produced.
When the GA is initialised based on relevant documents
• then the initial set of weighted keywords produces better filtering results
The introduction of learning allows for further improvements in the initial keyword weights.
• still worse than the baseline experiment despite the 50 generations
• this is possibly due to the negative effect of the genetic operations
18. Discussion
Our experimental results do not agree with the promising results reported in the literature
• we did not re-implement an existing approach, but adopted existing techniques
• AIF is a complex problem that can not be easily tackled with weighted keyword in a multi-dimensional space
• comparative experiments between GAs and other machine learning algorithms have been missing from AIF
large differences observed between the GA and the baseline algorithm
• despite the biased comparison in favour of the GA
• more fundamental alternatives which are not based on vector representations
• the choice of representation should facilitate the learning task
• external remedies like those adopted for MDO are not practical
we wish to reanimate the interest of the research community on AIF
• biologically inspired solutions are well suited to the problem
• appropriate experimental methodologies that reflect the complexity and dynamics of AIF are required
Notes de l'éditeur
Web is nowadays a network of transmitters and receivers of information. Various Web channels email, newsgroups and forums, social networks and technologies as simple as Really Simple Syndication (RSS) contribute to the large amount of information 80s, information overload is still a central issue.
Web is nowadays a network of transmitters and receivers of information. Various Web channels email, newsgroups and forums, social networks and technologies as simple as Really Simple Syndication (RSS) contribute to the large amount of information 80s, information overload is still a central issue.
Why do I like this kind of pictures, music, news??? Research interest has been declined since 2000. For good reasons. Nowadays high demand, in particular from WWW businesses