SlideShare une entreprise Scribd logo
1  sur  42
1Jarrar © 2014
Dr. Mustafa Jarrar
Sina Institute, University of Birzeit
mjarrar@birzeit.edu
www.jarrar.info
Mustafa Jarrar: Lecture Notes on Information Retrieval
University of Birzeit, Palestine
2014
Introduction to
Information Retrieval
2Jarrar © 2014
Watch this lecture and download the slides from
http://jarrar-courses.blogspot.com/2011/11/artificial-intelligence-fall-2011.html
3Jarrar © 2014
Outline
• Part 1: Information Retrieval Basics
• Part 2: Term-document incidence matrices
• Part 3: Inverted Index
• Part 4: Query processing with inverted index
• Part 5: Phrase queries and positional indexes
Keywords: Natural Language Processing, Information Retrieval,​ Precision,​ Recall,​ Inverted Index ,
Positional Indexes, Query processing, Merge Algorithm,​ B​ iwords,​ P​ hrase queries
‫احالسوبية‬ ‫اللسانيات‬,‫املعلومات‬ ‫استرجاع‬,‫الدقة‬,‫استعالم‬,‫لغوية‬ ‫تطبيقات‬,‫الطبيعية‬ ‫للغات‬ ‫اآللية‬ ‫املعاجلة‬
Acknowledgement: This lecture is largely based on Chris Manning online course on NLP,
which can be accessed at http://www.youtube.com/watch?v=s3kKlUBa3b0, [1]
Jarrar © 2014 4
Information Retrieval
Information Retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).
– These days we frequently think first of web search, but there
are many other cases:
• E-mail search
• Searching your laptop
• Corporate knowledge bases
• Legal information retrieval
Jarrar © 2014 5
Unstructured (text) vs. structured (database) data in
the mid-nineties
Jarrar © 2014 6
Unstructured (text) vs. structured (database) data
later
Jarrar © 2014 7
Basic Assumptions of Information Retrieval
Collection: A set of documents
– Assume it is a static collection (but in other scenarios we may
need to add and delete documents).
Goal: Retrieve documents with information that is
relevant to the user’s information need and helps the
user complete a task.
Jarrar © 2014 8
Classic Search Model
how trap mice alive
Collection
User task
Info need
Query
Results
Search
engine
Query
refinement
Get rid of mice in a
politically correct way
Info about removing mice
without killing them
Misconception?
Misformulation?
Search
Jarrar © 2014 9
How good are the retrieved docs?
Precision : Fraction of retrieved docs that are relevant to the user’s information
need (‫استرجاعة‬ ‫تم‬ ‫ما‬ ‫بين‬ ‫من‬ ‫الصحيح‬ ‫.)نسبة‬
Recall : Fraction of relevant docs in collection that are retrieved
‫الصحيح‬ ‫جميع‬ ‫بين‬ ‫من‬ ‫المسترجع‬ ‫الصحيح‬ ‫نسبة‬
In this figure the relevant items are to the left of the straight line while the
retrieved items are within the oval. The red regions represent errors. On the
left these are the relevant items not retrieved (false negatives), while on the
right they are the retrieved items that are not relevant (false positives).
Are these measures are always useful as in case of huge collections?
Ranking becomes more important sometimes.
1411
48
P= 8/12 = 66%, R=8/14 = 57%
10Jarrar © 2014
Introduction to Information Retrieval
• Part 1: Information Retrieval Basics
• Part 2: Term-document incidence matrices
• Part 3: Inverted Index
• Part 4: Query processing with inverted index
• Part 5: Phrase queries and positional indexes
Jarrar © 2014 11
Unstructured Data in 1620
Which plays of Shakespeare contain the words Brutus AND Caesar
but NOT Calpurnia?
One could grep all of Shakespeare’s plays for Brutus and Caesar,
then strip out lines containing Calpurnia?
Why is that not the answer?
– Slow (for large corpora)
– NOT Calpurnia is non-trivial
– Other operations (e.g., find the word Romans near countrymen) not feasible
– Ranked retrieval (best documents to return)
Jarrar © 2014 12
Term-Document Incidence Matrices
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains
word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia
Jarrar © 2014 13
Incidence Vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented)  bitwise AND.
– 110100 AND
– 110111 AND
– 101111 =
– 100100
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Jarrar © 2014 14
Answers to Query
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
Jarrar © 2014 15
With Bigger Collections!
Consider N = 1 million documents, each with about 1000 words.
Avg 6 bytes/word including spaces/punctuation
= 6GB of data in the documents.
Say there are M = 500K distinct terms among these.
Jarrar © 2014 16
We Can’t Build this Matrix!
500K x 1M matrix has half-a-trillion 0’s and 1’s.
But it has no more than one billion 1’s. ( =1000 words x 1 M doc.)
 matrix is extremely sparse.
What’s a better representation?
– We only record the 1 positions.
17Jarrar © 2014
Introduction to Information Retrieval
• Part 1: Information Retrieval Basics
• Part 2: Term-document incidence matrices
• Part 3: Inverted Index
• Part 4: Query processing with inverted index
• Part 5: Phrase queries and positional indexes
Jarrar © 2014 18
What is an Inverted Index?
For each term t, we must store a list of all documents that contain t.
– Identify each doc by a docID, a document serial number
Can we used fixed-size arrays for this?
What happens if the word Caesar is added
to document 14?
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
Jarrar © 2014 19
Inverted index
We need variable-size postings lists
– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arrays
• Some tradeoffs in size/ease of insertion
Dictionary Postings
Sorted by docID (more later on why).
Posting
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
Jarrar © 2014 20
Inverted Index Construction
Tokenizer
Token stream Friends Romans Countrymen
Linguistic
modules
Modified tokens friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
Documents to
be indexed
Friends, Romans, countrymen.
Jarrar © 2014 21
Initial Stages of Text Processing
Tokenization
– Cut character sequence into word tokens (=what is a word!)
• Deal with “John’s”, a state-of-the-art solution
Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
Stemming
– We may wish have different forms of a root to match
• authorize, authorization
Stop words
– We may omit very common words (or not)
• the, a, to, of, over, between, his, him
Jarrar © 2014 22
Indexer Steps: Token Sequence
Sequence of (Modified token, Document ID) pairs.
I did enact Julius
Caesar I was killed
i’ the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Jarrar © 2014 23
Indexer Steps: Sort
Sort by terms
– And then docID
Core indexing step
Jarrar © 2014 24
Indexer Steps: Dictionary & Postings
Multiple term entries in a single
document are merged.
Split into Dictionary and Postings
Doc. frequency information is
added.
Why frequency?
Jarrar © 2014 25
Where do we pay in storage?
Pointers
Terms
and
counts
IR system implementation
• How do we index
efficiently?
• How much storage do
we need?
Lists of
docIDs
26Jarrar © 2014
Introduction to Information Retrieval
• Part 1: Information Retrieval Basics
• Part 2: Term-document incidence matrices
• Part 3: Inverted Index
• Part 4: Query processing with inverted index
• Part 5: Phrase queries and positional indexes
Jarrar © 2014 27
Query processing: AND
Consider processing the query:
Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings (intersect the document sets):
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
Jarrar © 2014 28
The Merge
Walk through the two postings simultaneously, in time
linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
If the list lengths are x and y, the merge takes O(x+y) operations.
Crucial: postings sorted by docID.
2 8
Jarrar © 2014 29
Intersecting two postings lists (a “merge” algorithm)
30Jarrar © 2014
Introduction to Information Retrieval
• Part 1: Information Retrieval Basics
• Part 2: Term-document incidence matrices
• Part 3: Inverted Index
• Part 4: Query processing with inverted index
• Part 5: Phrase queries and positional indexes
Jarrar © 2014 31
Phrase Queries
We want to be able to answer queries such as “stanford university”
as a phrase
Thus the sentence “I went to university at Stanford” is not a match.
– The concept of phrase queries has proven easily understood by
users; one of the few “advanced search” ideas that works
For this, it no longer suffices to store only
<term : docs> entries
Jarrar © 2014 32
First Attempt: Biword Indexes
Index every consecutive pair of terms in the text as a phrase
For example the text “Friends, Romans, Countrymen” would
generate the biwords
– friends romans
– romans countrymen
Each of these biwords is now a dictionary term
Two-word phrase query-processing is now immediate.
Jarrar © 2014 33
Longer Phrase Queries
• Longer phrases can be processed by breaking them down
• stanford university palo alto can be broken into the Boolean
query on biwords:
• stanford university AND university palo AND palo alto
• Without the docs, we cannot verify that the docs matching the
above Boolean query do contain the phrase.
Can have false positives!
Jarrar © 2014 34
Issues for Biword Indexes
• False positives, as noted before
• Index blowup due to bigger dictionary
– Infeasible for more than biwords, big even for them
• Biword indexes are not the standard solution (for all
biwords) but can be part of a compound strategy
 This is not practical/standard solution!
Jarrar © 2014 35
Solution 2: Positional Indexes
In the postings, store, for each term the position(s) in which tokens
of it appear:
<term, number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Jarrar © 2014 36
Example: Positional index
For phrase queries, we use a merge algorithm recursively at the
document level.
But we now need to deal with more than just equality.
Example: if we look for “Birzeit University” then if Birzeit in 87
then University should be in 88.
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
Which of docs 1,2,4,5
could contain “to be
or not to be”?
Jarrar © 2014 37
Processing Phrase Queries
Extract inverted index entries for each distinct term: to, be, or, not.
Merge their doc:position lists to enumerate all positions with “to
be or not to be”.
– to:
•2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
– be:
•1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
Same general method for proximity searches
Jarrar © 2014 38
Positional Index Size
A positional index expands postings storage substantially
– Even though indices can be compressed.
Nevertheless, a positional index is now standardly used because of
the power and usefulness of phrase and proximity queries …
whether used explicitly or implicitly in a ranking retrieval system.
Jarrar © 2014 39
Rules of thumb
• A positional index is 2–4 as large as a non-positional index
• Positional index size 35–50% of volume of original text
– Caveat: all of this holds for “English-like” languages
Jarrar © 2014 40
Combination Schemes (both Solutions)
These two approaches can be profitably combined
– For particular phrases (“Michael Jackson”, “Britney Spears”) it is
inefficient to keep on merging positional postings lists
• Even more so for phrases like “The Who”
Williams et al. (2004) evaluate a more sophisticated mixed
indexing scheme
– A typical web query mixture was executed in ¼ of the time of using
just a positional index
– It required 26% more space than having a positional index alone
Jarrar © 2014 41
Project (Search Engine)
Building an Arabic Search Engine based on a positional inverted index. A
corpus of at least 10 documents containing at least 10000 words should
be used to test this engine.
The engine should support single, multiple word, as phrase queries.
The inverted index should take into account Arabic stemming and stop
words.
The results page will list the titles of all relevant documents in a clickable
manner.
This project is a continuation of the previous project, thus students are
expected to also allow users to “auto complete” their search queries.
Deadline: 9/10/2014
Jarrar © 2014 42
References
[1] Dan Jurafsky: From Languages to Information notes
http://web.stanford.edu/class/cs124

Contenu connexe

Tendances

Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesSho Takase
 
Knowledge based agents
Knowledge based agentsKnowledge based agents
Knowledge based agentsMegha Sharma
 
Fuzzy Logic Ppt
Fuzzy Logic PptFuzzy Logic Ppt
Fuzzy Logic Pptrafi
 
Rethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast TrainingRethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast TrainingSho Takase
 
Predicate calculus
Predicate calculusPredicate calculus
Predicate calculusRajendran
 
Dependent Types in Natural Language Semantics
Dependent Types in Natural Language SemanticsDependent Types in Natural Language Semantics
Dependent Types in Natural Language SemanticsDaisuke BEKKI
 
Fuzzy logic and fuzzy time series edited
Fuzzy logic and fuzzy time series   editedFuzzy logic and fuzzy time series   edited
Fuzzy logic and fuzzy time series editedProf Dr S.M.Aqil Burney
 
Composing (Im)politeness in Dependent Type Semantics
Composing (Im)politeness in Dependent Type SemanticsComposing (Im)politeness in Dependent Type Semantics
Composing (Im)politeness in Dependent Type SemanticsDaisuke BEKKI
 
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesBayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesJinYeong Bak
 
Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?Hitomi Yanaka
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)Danushka Bollegala
 
An enhanced fuzzy rough set based clustering algorithm for categorical data
An enhanced fuzzy rough set based clustering algorithm for categorical dataAn enhanced fuzzy rough set based clustering algorithm for categorical data
An enhanced fuzzy rough set based clustering algorithm for categorical dataeSAT Publishing House
 
Fuzzy logic and application in AI
Fuzzy logic and application in AIFuzzy logic and application in AI
Fuzzy logic and application in AIIldar Nurgaliev
 
An enhanced fuzzy rough set based clustering algorithm for categorical data
An enhanced fuzzy rough set based clustering algorithm for categorical dataAn enhanced fuzzy rough set based clustering algorithm for categorical data
An enhanced fuzzy rough set based clustering algorithm for categorical dataeSAT Journals
 

Tendances (20)

Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
 
Knowledge based agents
Knowledge based agentsKnowledge based agents
Knowledge based agents
 
Fuzzy Logic Ppt
Fuzzy Logic PptFuzzy Logic Ppt
Fuzzy Logic Ppt
 
Fuzzy hypersoft sets and its weightage operator for decision making
Fuzzy hypersoft sets and its weightage operator for decision makingFuzzy hypersoft sets and its weightage operator for decision making
Fuzzy hypersoft sets and its weightage operator for decision making
 
Rethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast TrainingRethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast Training
 
AI applications in education, Pascal Zoleko, Flexudy
AI applications in education, Pascal Zoleko, FlexudyAI applications in education, Pascal Zoleko, Flexudy
AI applications in education, Pascal Zoleko, Flexudy
 
Classical Sets & fuzzy sets
Classical Sets & fuzzy setsClassical Sets & fuzzy sets
Classical Sets & fuzzy sets
 
Predicate calculus
Predicate calculusPredicate calculus
Predicate calculus
 
Dependent Types in Natural Language Semantics
Dependent Types in Natural Language SemanticsDependent Types in Natural Language Semantics
Dependent Types in Natural Language Semantics
 
A study on fundamentals of refined intuitionistic fuzzy set with some properties
A study on fundamentals of refined intuitionistic fuzzy set with some propertiesA study on fundamentals of refined intuitionistic fuzzy set with some properties
A study on fundamentals of refined intuitionistic fuzzy set with some properties
 
Fuzzy logic and fuzzy time series edited
Fuzzy logic and fuzzy time series   editedFuzzy logic and fuzzy time series   edited
Fuzzy logic and fuzzy time series edited
 
Composing (Im)politeness in Dependent Type Semantics
Composing (Im)politeness in Dependent Type SemanticsComposing (Im)politeness in Dependent Type Semantics
Composing (Im)politeness in Dependent Type Semantics
 
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesBayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
 
Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Chapter 5 (final)
Chapter 5 (final)Chapter 5 (final)
Chapter 5 (final)
 
深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)
 
An enhanced fuzzy rough set based clustering algorithm for categorical data
An enhanced fuzzy rough set based clustering algorithm for categorical dataAn enhanced fuzzy rough set based clustering algorithm for categorical data
An enhanced fuzzy rough set based clustering algorithm for categorical data
 
Fuzzy logic and application in AI
Fuzzy logic and application in AIFuzzy logic and application in AI
Fuzzy logic and application in AI
 
An enhanced fuzzy rough set based clustering algorithm for categorical data
An enhanced fuzzy rough set based clustering algorithm for categorical dataAn enhanced fuzzy rough set based clustering algorithm for categorical data
An enhanced fuzzy rough set based clustering algorithm for categorical data
 

En vedette

Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Introduction to Artificial Intelligence
Introduction to Artificial Intelligence Introduction to Artificial Intelligence
Introduction to Artificial Intelligence Mustafa Jarrar
 
Jarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language ProcessingJarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language ProcessingMustafa Jarrar
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehHadi Mohammadzadeh
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Business Process Implementation
Business Process ImplementationBusiness Process Implementation
Business Process ImplementationMustafa Jarrar
 
Tutorial: Text Analytics for Security
Tutorial: Text Analytics for SecurityTutorial: Text Analytics for Security
Tutorial: Text Analytics for SecurityTao Xie
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1 كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1 Hind Altwirqi
 
Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part IIngo Frommholz
 
تطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلوماتتطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلوماتAhmed Al-ajamy
 
نظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربيةنظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربيةBeni-Suef University
 
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالعوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالدكتور طلال ناظم الزهيري
 
محاضرتي الاولى
محاضرتي الاولىمحاضرتي الاولى
محاضرتي الاولىAmany Megahed
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
 
التعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلوماتالتعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلوماتHuda Farhan
 
Jarrar: Un-informed Search
Jarrar: Un-informed SearchJarrar: Un-informed Search
Jarrar: Un-informed SearchMustafa Jarrar
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!Jane Garay
 
نظم استرجاع المعلومات
نظم استرجاع المعلوماتنظم استرجاع المعلومات
نظم استرجاع المعلوماتBeni-Suef University
 

En vedette (20)

Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Introduction to Artificial Intelligence
Introduction to Artificial Intelligence Introduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
Jarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language ProcessingJarrar: Introduction to Natural Language Processing
Jarrar: Introduction to Natural Language Processing
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Business Process Implementation
Business Process ImplementationBusiness Process Implementation
Business Process Implementation
 
Tutorial: Text Analytics for Security
Tutorial: Text Analytics for SecurityTutorial: Text Analytics for Security
Tutorial: Text Analytics for Security
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Jarrar: Games
Jarrar: GamesJarrar: Games
Jarrar: Games
 
كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1 كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1
 
Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part I
 
تطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلوماتتطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلومات
 
نظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربيةنظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربية
 
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالعوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
 
محاضرتي الاولى
محاضرتي الاولىمحاضرتي الاولى
محاضرتي الاولى
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
التعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلوماتالتعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلومات
 
Jarrar: Un-informed Search
Jarrar: Un-informed SearchJarrar: Un-informed Search
Jarrar: Un-informed Search
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
 
نظم استرجاع المعلومات
نظم استرجاع المعلوماتنظم استرجاع المعلومات
نظم استرجاع المعلومات
 

Similaire à IR Lecture Notes on Information Retrieval Basics

lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.pptIshaXogaha
 
introduction into IR
introduction into IRintroduction into IR
introduction into IRssusere3b1a2
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
Data Science Course In Pune
Data Science Course In Pune Data Science Course In Pune
Data Science Course In Pune APT
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangaloredevipatnala1
 
Data Science Course Pune
Data Science Course PuneData Science Course Pune
Data Science Course PuneAPT
 
Data science course pdf
Data science course pdfData science course pdf
Data science course pdfAPT
 
data science course in pune
data science course in punedata science course in pune
data science course in punedevipatnala1
 
data science certification
data science certificationdata science certification
data science certificationdevipatnala1
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangaloredevipatnala1
 
Data Science course in Pune
Data Science course in PuneData Science course in Pune
Data Science course in Puneashvisingh
 

Similaire à IR Lecture Notes on Information Retrieval Basics (20)

lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
introduction into IR
introduction into IRintroduction into IR
introduction into IR
 
Basics of IR: Web Information Systems class
Basics of IR: Web Information Systems class Basics of IR: Web Information Systems class
Basics of IR: Web Information Systems class
 
lectura
lecturalectura
lectura
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Data Science Course In Pune
Data Science Course In Pune Data Science Course In Pune
Data Science Course In Pune
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 
Data Science Course Pune
Data Science Course PuneData Science Course Pune
Data Science Course Pune
 
Data science course pdf
Data science course pdfData science course pdf
Data science course pdf
 
data science certification
data science certificationdata science certification
data science certification
 
data science course in pune
data science course in punedata science course in pune
data science course in pune
 
Data mining
Data miningData mining
Data mining
 
data science certification
data science certificationdata science certification
data science certification
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 
Data science course in Pune
Data science course in PuneData science course in Pune
Data science course in Pune
 
Data Science course in Pune
Data Science course in PuneData Science course in Pune
Data Science course in Pune
 
Data science certification
Data science certificationData science certification
Data science certification
 
Data science course in pune
Data science course in puneData science course in pune
Data science course in pune
 

Plus de Mustafa Jarrar

Clustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment AnalysisClustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment AnalysisMustafa Jarrar
 
Classifying Processes and Basic Formal Ontology
Classifying Processes  and Basic Formal OntologyClassifying Processes  and Basic Formal Ontology
Classifying Processes and Basic Formal OntologyMustafa Jarrar
 
Discrete Mathematics Course Outline
Discrete Mathematics Course OutlineDiscrete Mathematics Course Outline
Discrete Mathematics Course OutlineMustafa Jarrar
 
Business Process Design and Re-engineering
Business Process Design and Re-engineeringBusiness Process Design and Re-engineering
Business Process Design and Re-engineeringMustafa Jarrar
 
BPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsBPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsMustafa Jarrar
 
BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs  BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs Mustafa Jarrar
 
Introduction to Business Process Management
Introduction to Business Process ManagementIntroduction to Business Process Management
Introduction to Business Process ManagementMustafa Jarrar
 
Customer Complaint Ontology
Customer Complaint Ontology Customer Complaint Ontology
Customer Complaint Ontology Mustafa Jarrar
 
Subset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesSubset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesMustafa Jarrar
 
Schema Modularization in ORM
Schema Modularization in ORMSchema Modularization in ORM
Schema Modularization in ORMMustafa Jarrar
 
On Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineOn Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineMustafa Jarrar
 
Lessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesLessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesMustafa Jarrar
 
Presentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-finalPresentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-finalMustafa Jarrar
 
Jarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsJarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsMustafa Jarrar
 
Habash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingHabash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingMustafa Jarrar
 
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsRiestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsMustafa Jarrar
 
Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Mustafa Jarrar
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql ProjectMustafa Jarrar
 
Jarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringJarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringMustafa Jarrar
 
Jarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing OntologiesJarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing OntologiesMustafa Jarrar
 

Plus de Mustafa Jarrar (20)

Clustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment AnalysisClustering Arabic Tweets for Sentiment Analysis
Clustering Arabic Tweets for Sentiment Analysis
 
Classifying Processes and Basic Formal Ontology
Classifying Processes  and Basic Formal OntologyClassifying Processes  and Basic Formal Ontology
Classifying Processes and Basic Formal Ontology
 
Discrete Mathematics Course Outline
Discrete Mathematics Course OutlineDiscrete Mathematics Course Outline
Discrete Mathematics Course Outline
 
Business Process Design and Re-engineering
Business Process Design and Re-engineeringBusiness Process Design and Re-engineering
Business Process Design and Re-engineering
 
BPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical ConstructsBPMN 2.0 Analytical Constructs
BPMN 2.0 Analytical Constructs
 
BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs  BPMN 2.0 Descriptive Constructs
BPMN 2.0 Descriptive Constructs
 
Introduction to Business Process Management
Introduction to Business Process ManagementIntroduction to Business Process Management
Introduction to Business Process Management
 
Customer Complaint Ontology
Customer Complaint Ontology Customer Complaint Ontology
Customer Complaint Ontology
 
Subset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion RulesSubset, Equality, and Exclusion Rules
Subset, Equality, and Exclusion Rules
 
Schema Modularization in ORM
Schema Modularization in ORMSchema Modularization in ORM
Schema Modularization in ORM
 
On Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineOn Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in Palestine
 
Lessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online CoursesLessons from Class Recording & Publishing of Eight Online Courses
Lessons from Class Recording & Publishing of Eight Online Courses
 
Presentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-finalPresentation curras paper-emnlp2014-final
Presentation curras paper-emnlp2014-final
 
Jarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 CallsJarrar: Future Internet in Horizon 2020 Calls
Jarrar: Future Internet in Horizon 2020 Calls
 
Habash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingHabash: Arabic Natural Language Processing
Habash: Arabic Natural Language Processing
 
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 ProposalsRiestra: How to Design and engineer Competitive Horizon 2020 Proposals
Riestra: How to Design and engineer Competitive Horizon 2020 Proposals
 
Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020Bouquet: SIERA Workshop on The Pillars of Horizon2020
Bouquet: SIERA Workshop on The Pillars of Horizon2020
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql Project
 
Jarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology EngineeringJarrar: Logical Foundation of Ontology Engineering
Jarrar: Logical Foundation of Ontology Engineering
 
Jarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing OntologiesJarrar: Stepwise Methodologies for Developing Ontologies
Jarrar: Stepwise Methodologies for Developing Ontologies
 

Dernier

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Dernier (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

IR Lecture Notes on Information Retrieval Basics

  • 1. 1Jarrar © 2014 Dr. Mustafa Jarrar Sina Institute, University of Birzeit mjarrar@birzeit.edu www.jarrar.info Mustafa Jarrar: Lecture Notes on Information Retrieval University of Birzeit, Palestine 2014 Introduction to Information Retrieval
  • 2. 2Jarrar © 2014 Watch this lecture and download the slides from http://jarrar-courses.blogspot.com/2011/11/artificial-intelligence-fall-2011.html
  • 3. 3Jarrar © 2014 Outline • Part 1: Information Retrieval Basics • Part 2: Term-document incidence matrices • Part 3: Inverted Index • Part 4: Query processing with inverted index • Part 5: Phrase queries and positional indexes Keywords: Natural Language Processing, Information Retrieval,​ Precision,​ Recall,​ Inverted Index , Positional Indexes, Query processing, Merge Algorithm,​ B​ iwords,​ P​ hrase queries ‫احالسوبية‬ ‫اللسانيات‬,‫املعلومات‬ ‫استرجاع‬,‫الدقة‬,‫استعالم‬,‫لغوية‬ ‫تطبيقات‬,‫الطبيعية‬ ‫للغات‬ ‫اآللية‬ ‫املعاجلة‬ Acknowledgement: This lecture is largely based on Chris Manning online course on NLP, which can be accessed at http://www.youtube.com/watch?v=s3kKlUBa3b0, [1]
  • 4. Jarrar © 2014 4 Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). – These days we frequently think first of web search, but there are many other cases: • E-mail search • Searching your laptop • Corporate knowledge bases • Legal information retrieval
  • 5. Jarrar © 2014 5 Unstructured (text) vs. structured (database) data in the mid-nineties
  • 6. Jarrar © 2014 6 Unstructured (text) vs. structured (database) data later
  • 7. Jarrar © 2014 7 Basic Assumptions of Information Retrieval Collection: A set of documents – Assume it is a static collection (but in other scenarios we may need to add and delete documents). Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.
  • 8. Jarrar © 2014 8 Classic Search Model how trap mice alive Collection User task Info need Query Results Search engine Query refinement Get rid of mice in a politically correct way Info about removing mice without killing them Misconception? Misformulation? Search
  • 9. Jarrar © 2014 9 How good are the retrieved docs? Precision : Fraction of retrieved docs that are relevant to the user’s information need (‫استرجاعة‬ ‫تم‬ ‫ما‬ ‫بين‬ ‫من‬ ‫الصحيح‬ ‫.)نسبة‬ Recall : Fraction of relevant docs in collection that are retrieved ‫الصحيح‬ ‫جميع‬ ‫بين‬ ‫من‬ ‫المسترجع‬ ‫الصحيح‬ ‫نسبة‬ In this figure the relevant items are to the left of the straight line while the retrieved items are within the oval. The red regions represent errors. On the left these are the relevant items not retrieved (false negatives), while on the right they are the retrieved items that are not relevant (false positives). Are these measures are always useful as in case of huge collections? Ranking becomes more important sometimes. 1411 48 P= 8/12 = 66%, R=8/14 = 57%
  • 10. 10Jarrar © 2014 Introduction to Information Retrieval • Part 1: Information Retrieval Basics • Part 2: Term-document incidence matrices • Part 3: Inverted Index • Part 4: Query processing with inverted index • Part 5: Phrase queries and positional indexes
  • 11. Jarrar © 2014 11 Unstructured Data in 1620 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Why is that not the answer? – Slow (for large corpora) – NOT Calpurnia is non-trivial – Other operations (e.g., find the word Romans near countrymen) not feasible – Ranked retrieval (best documents to return)
  • 12. Jarrar © 2014 12 Term-Document Incidence Matrices Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains word, 0 otherwise Brutus AND Caesar BUT NOT Calpurnia
  • 13. Jarrar © 2014 13 Incidence Vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. – 110100 AND – 110111 AND – 101111 = – 100100 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0
  • 14. Jarrar © 2014 14 Answers to Query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me.
  • 15. Jarrar © 2014 15 With Bigger Collections! Consider N = 1 million documents, each with about 1000 words. Avg 6 bytes/word including spaces/punctuation = 6GB of data in the documents. Say there are M = 500K distinct terms among these.
  • 16. Jarrar © 2014 16 We Can’t Build this Matrix! 500K x 1M matrix has half-a-trillion 0’s and 1’s. But it has no more than one billion 1’s. ( =1000 words x 1 M doc.)  matrix is extremely sparse. What’s a better representation? – We only record the 1 positions.
  • 17. 17Jarrar © 2014 Introduction to Information Retrieval • Part 1: Information Retrieval Basics • Part 2: Term-document incidence matrices • Part 3: Inverted Index • Part 4: Query processing with inverted index • Part 5: Phrase queries and positional indexes
  • 18. Jarrar © 2014 18 What is an Inverted Index? For each term t, we must store a list of all documents that contain t. – Identify each doc by a docID, a document serial number Can we used fixed-size arrays for this? What happens if the word Caesar is added to document 14? Brutus Calpurnia Caesar 1 2 4 5 6 16 57 132 1 2 4 11 31 45 173 2 31 174 54 101
  • 19. Jarrar © 2014 19 Inverted index We need variable-size postings lists – On disk, a continuous run of postings is normal and best – In memory, can use linked lists or variable length arrays • Some tradeoffs in size/ease of insertion Dictionary Postings Sorted by docID (more later on why). Posting Brutus Calpurnia Caesar 1 2 4 5 6 16 57 132 1 2 4 11 31 45 173 2 31 174 54 101
  • 20. Jarrar © 2014 20 Inverted Index Construction Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer Inverted index friend roman countryman 2 4 2 13 16 1 Documents to be indexed Friends, Romans, countrymen.
  • 21. Jarrar © 2014 21 Initial Stages of Text Processing Tokenization – Cut character sequence into word tokens (=what is a word!) • Deal with “John’s”, a state-of-the-art solution Normalization – Map text and query term to same form • You want U.S.A. and USA to match Stemming – We may wish have different forms of a root to match • authorize, authorization Stop words – We may omit very common words (or not) • the, a, to, of, over, between, his, him
  • 22. Jarrar © 2014 22 Indexer Steps: Token Sequence Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2
  • 23. Jarrar © 2014 23 Indexer Steps: Sort Sort by terms – And then docID Core indexing step
  • 24. Jarrar © 2014 24 Indexer Steps: Dictionary & Postings Multiple term entries in a single document are merged. Split into Dictionary and Postings Doc. frequency information is added. Why frequency?
  • 25. Jarrar © 2014 25 Where do we pay in storage? Pointers Terms and counts IR system implementation • How do we index efficiently? • How much storage do we need? Lists of docIDs
  • 26. 26Jarrar © 2014 Introduction to Information Retrieval • Part 1: Information Retrieval Basics • Part 2: Term-document incidence matrices • Part 3: Inverted Index • Part 4: Query processing with inverted index • Part 5: Phrase queries and positional indexes
  • 27. Jarrar © 2014 27 Query processing: AND Consider processing the query: Brutus AND Caesar – Locate Brutus in the Dictionary; • Retrieve its postings. – Locate Caesar in the Dictionary; • Retrieve its postings. – “Merge” the two postings (intersect the document sets): 128 34 2 4 8 16 32 64 1 2 3 5 8 13 21 Brutus Caesar
  • 28. Jarrar © 2014 28 The Merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 34 1282 4 8 16 32 64 1 2 3 5 8 13 21 Brutus Caesar If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID. 2 8
  • 29. Jarrar © 2014 29 Intersecting two postings lists (a “merge” algorithm)
  • 30. 30Jarrar © 2014 Introduction to Information Retrieval • Part 1: Information Retrieval Basics • Part 2: Term-document incidence matrices • Part 3: Inverted Index • Part 4: Query processing with inverted index • Part 5: Phrase queries and positional indexes
  • 31. Jarrar © 2014 31 Phrase Queries We want to be able to answer queries such as “stanford university” as a phrase Thus the sentence “I went to university at Stanford” is not a match. – The concept of phrase queries has proven easily understood by users; one of the few “advanced search” ideas that works For this, it no longer suffices to store only <term : docs> entries
  • 32. Jarrar © 2014 32 First Attempt: Biword Indexes Index every consecutive pair of terms in the text as a phrase For example the text “Friends, Romans, Countrymen” would generate the biwords – friends romans – romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate.
  • 33. Jarrar © 2014 33 Longer Phrase Queries • Longer phrases can be processed by breaking them down • stanford university palo alto can be broken into the Boolean query on biwords: • stanford university AND university palo AND palo alto • Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase. Can have false positives!
  • 34. Jarrar © 2014 34 Issues for Biword Indexes • False positives, as noted before • Index blowup due to bigger dictionary – Infeasible for more than biwords, big even for them • Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy  This is not practical/standard solution!
  • 35. Jarrar © 2014 35 Solution 2: Positional Indexes In the postings, store, for each term the position(s) in which tokens of it appear: <term, number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.>
  • 36. Jarrar © 2014 36 Example: Positional index For phrase queries, we use a merge algorithm recursively at the document level. But we now need to deal with more than just equality. Example: if we look for “Birzeit University” then if Birzeit in 87 then University should be in 88. <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”?
  • 37. Jarrar © 2014 37 Processing Phrase Queries Extract inverted index entries for each distinct term: to, be, or, not. Merge their doc:position lists to enumerate all positions with “to be or not to be”. – to: •2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... – be: •1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... Same general method for proximity searches
  • 38. Jarrar © 2014 38 Positional Index Size A positional index expands postings storage substantially – Even though indices can be compressed. Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system.
  • 39. Jarrar © 2014 39 Rules of thumb • A positional index is 2–4 as large as a non-positional index • Positional index size 35–50% of volume of original text – Caveat: all of this holds for “English-like” languages
  • 40. Jarrar © 2014 40 Combination Schemes (both Solutions) These two approaches can be profitably combined – For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists • Even more so for phrases like “The Who” Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme – A typical web query mixture was executed in ¼ of the time of using just a positional index – It required 26% more space than having a positional index alone
  • 41. Jarrar © 2014 41 Project (Search Engine) Building an Arabic Search Engine based on a positional inverted index. A corpus of at least 10 documents containing at least 10000 words should be used to test this engine. The engine should support single, multiple word, as phrase queries. The inverted index should take into account Arabic stemming and stop words. The results page will list the titles of all relevant documents in a clickable manner. This project is a continuation of the previous project, thus students are expected to also allow users to “auto complete” their search queries. Deadline: 9/10/2014
  • 42. Jarrar © 2014 42 References [1] Dan Jurafsky: From Languages to Information notes http://web.stanford.edu/class/cs124

Notes de l'éditeur

  1. Grep is line-oriented; IR is document oriented.
  2. Wikimedia commons picture of Shake
  3. Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers