3. Road Map :
• What is IR ?
• Why & How it works?
• Evaluation Techniques
• Global & Local Methods
1. Relevance Feedback
2. Probabilistic Relevance Feedback
3. Indirect Relevance Feedback
4. Rocchio Algorithm
5. Linear Classifiers
6. Naïve Bayes Text Classification
Question & Discussion
4. What is IR? Why & How?
• Information needed to satisfy user.
• Why?
Due to different formats of Data.
• How?
StopList
Stemming
Inverse Document Frequency
Word Counts
5. What is IR? Why & How?
Generally IR used in 3 scenarios
1. Web search
2. Personal IR ( Text Classification )
3. Enterprise Level
6. Evaluation Techniques
• Why?
• How?
Relevant & Non Relevant Documents
Precision And Recall Methods
P = # (relevant Items Retrieved)
#(retrieved Items)
R = #(relevant Items Retrieved)
#(relevant Items)
8. Local Methods
1. Relevance Feedback
2. Probabilistic Relevance Feedback
3. Indirect Feedback
1. Relevance Feedback
Feedback given by the user about the relevance of the
documents in the initial set of results.
1. Relevance Feedback
2. Probabilistic Relevance Feedback
PRF is implementing by building a classifiers.
1. Relevance Feedback
2. Probabilistic Relevance Feedback
3. Indirect Relevance Feedback
Without user interventions.
1. By using user actions.
2. By using user Histories or Logs
10. Rocchio Algorithm
Incorporates the relevance feedback
mechanism in vector space model.
Also uses the
Cosine Similarity Function
Euclidean Mechanism
12. Outcome
• Relevance Feedback plays an important
role to understand the user requirements.
• Rocchio Algorithm is not the best but the
optimized and better option due to its
simplicity and good results.
• Have a significant importance with respect
to content based systems.
13. Classification Problems
• Given:
– A document d
– A fixed set of categories:
Sports, Informatics, literature, medical, entertainment
– A training set of documents each labeled
with its class
• Determine:
– A learning method or algorithm which will
enable us to learn a classifier
– For a test document dT we have to determine
its category
14. Classification Techniques
• Manual (a.k.a. Knowledge Engineering)
–typically, rule-based expert systems
• Machine Learning
–Naïve Bayesian (Probabilistic)
– Decision Trees (Decision Structures)
– Support Vector Machines (Linear Classification)
16. Naïve Bayes document
classification example
• Probabilistic
– Prior vs Posterior
• Bernoulli Model
– Feature vector with binary elements
• Multinomial Model
– Integers representing frequency of
words
19. Naïve Bayes classfication
• Very fast learning and testing
– Why?
• Low storage requirements
• Very good in domains with many
equally important features
• More robust to irrelevant features
than many learning methods
20. Linear Classification
• Documents as labeled vectors
• Documents in the same class form a
contiguous region of space
• Documents from different classes don’t
overlap (much)
• Learning a classifier: build surfaces to
delineate classes in the space
31. Bottom Line
• Which classifier do I use for a given document
classification problem?
Answer : Depends
How much training data is available?
How simple/complex is the problem?
How noisy is the data?
How stable is the problem over time?
For an unstable problem, its better to use a
simple and robust classifier.