Techniques of information retrieval

Techniques of Information Retrieval
Tariq Hassan & Sabahat

Road Map :
• What is IR ?
• Why & How it works?
• Evaluation Techniques
• Global & Local Methods
1. Relevance Feedback
2. Probabilistic Relevance Feedback
3. Indirect Relevance Feedback
4. Rocchio Algorithm
5. Linear Classifiers
6. Naïve Bayes Text Classification
Question & Discussion

What is IR? Why & How?
• Information needed to satisfy user.
• Why?
Due to different formats of Data.
• How?
StopList
Stemming
Inverse Document Frequency
Word Counts

What is IR? Why & How?
Generally IR used in 3 scenarios
1. Web search
2. Personal IR ( Text Classification )
3. Enterprise Level

Evaluation Techniques
• Why?
• How?
Relevant & Non Relevant Documents
Precision And Recall Methods
P = # (relevant Items Retrieved)
#(retrieved Items)
R = #(relevant Items Retrieved)
#(relevant Items)

Methods:
1. Global Methods
Reformulation Queries
2. Local Methods
Relative to the initial results against any query

Local Methods
3. Indirect Feedback
Feedback given by the user about the relevance of the
documents in the initial set of results.
PRF is implementing by building a classifiers.
3. Indirect Relevance Feedback
Without user interventions.
1. By using user actions.
2. By using user Histories or Logs

Conclusion :
Relevance Feedback
Assumption:
User have initial knowledge
Issues :
Misspelling
Cross Languages
Mismatch Vocabulary

Rocchio Algorithm
Incorporates the relevance feedback
mechanism in vector space model.
Also uses the
Cosine Similarity Function
Euclidean Mechanism

Outcome
• Relevance Feedback plays an important
role to understand the user requirements.
• Rocchio Algorithm is not the best but the
optimized and better option due to its
simplicity and good results.
• Have a significant importance with respect
to content based systems.

Classification Problems
• Given:
– A document d
– A fixed set of categories:
Sports, Informatics, literature, medical, entertainment
– A training set of documents each labeled
with its class
• Determine:
– A learning method or algorithm which will
enable us to learn a classifier
– For a test document dT we have to determine
its category

Classification Techniques
• Manual (a.k.a. Knowledge Engineering)
–typically, rule-based expert systems
• Machine Learning
–Naïve Bayesian (Probabilistic)
– Decision Trees (Decision Structures)
– Support Vector Machines (Linear Classification)

Document Representation
• Binary Representation
• Frequency Representation
• TF*IDF Representation

Naïve Bayes document
classification example
• Probabilistic
– Prior vs Posterior
• Bernoulli Model
– Feature vector with binary elements
• Multinomial Model
– Integers representing frequency of
words

Naïve Bayes classfication
• Very fast learning and testing
– Why?
• Low storage requirements
• Very good in domains with many
equally important features
• More robust to irrelevant features
than many learning methods

Linear Classification
• Documents as labeled vectors
• Documents in the same class form a
contiguous region of space
• Documents from different classes don’t
overlap (much)
• Learning a classifier: build surfaces to
delineate classes in the space

Support Vector Machines
• Find a linear hyperplane (decision boundary) that
will separate the data

• OnePossibleSolution
B1

• Anotherpossiblesolution
B2

• Otherpossiblesolutions
B2

• Which one is better? B1 or B2?
• How do you define better?
B1
B2

• Find hyperplane maximizes the margin
B1
B2
b11
b12
b21
b22
margin

B1
B2
b11
b12
b21
b22
margin
Support
Vectors

B1
b11
b12
0 bxw

1 bxw
 1 bxw







1bxwif1
1bxwif1
)( 


xf 2
||||
2
Margin
w


Bottom Line
• Which classifier do I use for a given document
classification problem?
 Answer : Depends
 How much training data is available?
 How simple/complex is the problem?
 How noisy is the data?
 How stable is the problem over time?
 For an unstable problem, its better to use a
simple and robust classifier.

Techniques of information retrieval

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à Techniques of information retrieval

Similaire à Techniques of information retrieval (20)

Dernier

Dernier (20)

Techniques of information retrieval