SlideShare une entreprise Scribd logo
1  sur  19
 Definition: an inverted file is a word-oriented
mechanism for indexing a text collection in
order to speed up the searching task.
 Structure of inverted file:
◦ Vocabulary: is the set of all distinct words in the
text
◦ Occurrences: lists containing all information
necessary for each word of the vocabulary (text
position, frequency, documents where the word
appears, etc.)
 Inverted file index is list of terms that appear in the
document collection (called a lexicon or vocabulary) and
for each term in the lexicon, stores a list of pointers to all
occurrences of that term in the document collection. This
list is called an inverted list.
 Granularity of an index determines the accuracy of
representation of the location of the word
◦ Coarse-grained index requires less storage and more
query processing to eliminate false matches
◦ Word-level index enables queries involving adjacency
and proximity, but has higher space requirements
4
Indexed
Terms
Number of
occurrences
Occurrences Lists
Vocabulary
Posting File
This could be a tree like structure !
5
 Text:
 Inverted file
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
beautiful
flowers
garden
house
70
45, 58
18, 29
6
Vocabulary Occurrences
 Prior example allows for boolean
queries.
 Need the document frequency and term
frequency.
Vocabulary entry Posting file entry
k dk doc1 f1k doc2 f2k …
dk : document frequency of term k
doci : i-th document that contains term k
fik : term frequency of term k in document i
 The space required for the vocabulary is rather
small. According to Heaps’ law the vocabulary
grows as O(nβ
), where β is a constant between
0.4 and 0.6 in practice
◦ TREC-2: 1 GB text, 5 MB lexicon
 On the other hand, the occurrences demand
much more space. Since each word appearing
in the text is referenced once in that structure,
the extra space is O(n)
 To reduce space requirements, a technique
called block addressing is used
 The text is divided in blocks
 The occurrences point to the blocks where the
word appears
 Advantages:
◦ the number of pointers is smaller than positions
◦ all the occurrences of a word inside a single block
are collapsed to one reference
 Disadvantages:
◦ online search over the qualifying blocks if exact
positions are required
 Text:
 Inverted file
beautiful
flowers
garden
house
4
3
2
1
Vocabulary Occurrences
Block 1 Block 2 Block 3 Block 4
That house has a garden. The garden has many flowers. The flowers are
beautiful
 How big are inverted files?
◦ In relation to original collection size
 right column indexes stopwords while left removes
stopwords
 Blocks require text to be available for location of
terms within blocks.
45%
27%
18%
73%
41%
25%
36%
18%
1.7%
64%
32%
2.4%
35%
5%
0.5%
63%
9%
0.7%
Addressing words
Addressing 256 blocks
Addressing 64K blocks
Index Small collection
(1Mb)
Medium collection
(200Mb)
Large collection
(2Gb)
 The search algorithm on an inverted
index follows three steps:
1. Vocabulary search: the words present in
the query are located in the vocabulary
2. Retrieval occurrences: the lists of the
occurrences of all query words found are
retrieved
3. Manipulation of occurrences: the
occurrences are processed to solve the
query
 Searching inverted files starts with vocabulary
◦ store the vocabulary in a separate file
 Structures used to store the vocabulary
include
◦ Hashing : O (1) lookup, does not support range
queries
◦ Tries : O (c) lookup, c = length (word)
◦ B-trees : O (log v) lookup
 An alternative is simply storing the words in
lexicographical order
◦ cheaper in space and very competitive with O(log
v) cost
 All the vocabulary is kept in a suitable data
structure storing for each word and a list of
its occurrences
 Each word of each text in the corpus is
read and searched for in the vocabulary
 If it is not found, it is added to the
vocabulary with a empty list of occurrences
 The new position is added to the end of its
list of occurrences for the word
 Once the text is exhausted the vocabulary is
written to disk with the list of occurrences.
 Two files are created:
◦ in the first file, each list of word occurrences is
stored contiguously
◦ in the second file, the vocabulary is stored in
lexicographical order and, for each word, a pointer
to its list in the first file is also included. This allows
the vocabulary to be kept in memory at search time
 The overall process is O(n) worst-case time
 An option is to use the previous algorithm until
the main memory is exhausted. When no
more memory is available, the partial index Ii
obtained up to now is written to disk and
erased the main memory before continuing
with the rest of the text
 Once the text is exhausted, a number of
partial indices Ii exist on disk
 The partial indices are merged to obtain the
final index
I 1...8
I 1...4 I 5...8
I 1...2 I 3...4 I 5...6 I 7...8
I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8
1 2 4 5
3 6
7
final index
initial dumps
level 1
level 2
level 3
 The total time to generate partial indices is
O(n)
 The number of partial indices is O(n/M)
 To merge the O(n/M) partial indices are
necessary log2(n/M) merging levels
 The total cost of this algorithm is O(n log(n/M))
 Inverted files are used to index text
 The indices are appropriate when the
text collection is large and semi-static
 If the text collection is volatile online
searching is the only option
 Some techniques combine online and
indexed searching
 Vocabulary List
◦ Text preprocessing modules
 lexical analysis, stemming, stopwords
 Occurrences of Vocabulary Terms
◦ Inverted index creation
 term frequency in documents, document frequency
 Retrieval and Ranking Algorithm
 Query and Ranking Interfaces
 Browsing/Visualization Interface

Contenu connexe

Tendances

Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsVaibhav Khanna
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMSai Kumar Ale
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean modelVaibhav Khanna
 
Information retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic modelsInformation retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic modelsVaibhav Khanna
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxShivaVemula2
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3alaa223
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 
Distributed Query Processing
Distributed Query ProcessingDistributed Query Processing
Distributed Query ProcessingMythili Kannan
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notesBAIRAVI T
 

Tendances (20)

Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Information retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic modelsInformation retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic models
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptx
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
Distributed Query Processing
Distributed Query ProcessingDistributed Query Processing
Distributed Query Processing
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 

En vedette

An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted indexweedge
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionLucidworks (Archived)
 
The Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital TransformationThe Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital TransformationKasun Indrasiri
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web searchVictor de Boer
 
Product quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-reportProduct quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-reportLostmarble
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataIGEEKS TECHNOLOGIES
 
Information seeking
Information seekingInformation seeking
Information seekingJohan Koren
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesItamar
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneJosiane Gamgo
 
Public key Cryptography & RSA
Public key Cryptography & RSAPublic key Cryptography & RSA
Public key Cryptography & RSAAmit Debnath
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalidKhalid Mahmood
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scriptingTony Fabeen
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 

En vedette (20)

An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index Explosion
 
The Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital TransformationThe Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital Transformation
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Product quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-reportProduct quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-report
 
Signature files
Signature filesSignature files
Signature files
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
 
Information seeking
Information seekingInformation seeking
Information seeking
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Solr
SolrSolr
Solr
 
Public key Cryptography & RSA
Public key Cryptography & RSAPublic key Cryptography & RSA
Public key Cryptography & RSA
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Index types
Index typesIndex types
Index types
 

Similaire à Inverted index

Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdfHabtamu100
 
Chapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfChapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfJemalNesre1
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfHabtamu100
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalcaptainmactavish1996
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfJemalNesre1
 
File Types in Data Structure
File Types in Data StructureFile Types in Data Structure
File Types in Data StructureProf Ansari
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingZainab Almugbel
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdflekhacce
 
lecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxlecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxpeter1097
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptxMBablu1
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 

Similaire à Inverted index (20)

Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
 
Chapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfChapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdf
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Ir 03
Ir   03Ir   03
Ir 03
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Search pitb
Search pitbSearch pitb
Search pitb
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
File Types in Data Structure
File Types in Data StructureFile Types in Data Structure
File Types in Data Structure
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashing
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdf
 
Chapter13
Chapter13Chapter13
Chapter13
 
lecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxlecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptx
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
 
Hashing
HashingHashing
Hashing
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
G0361034038
G0361034038G0361034038
G0361034038
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 

Dernier

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 

Dernier (20)

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 

Inverted index

  • 1.
  • 2.  Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task.  Structure of inverted file: ◦ Vocabulary: is the set of all distinct words in the text ◦ Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc.)
  • 3.  Inverted file index is list of terms that appear in the document collection (called a lexicon or vocabulary) and for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list.  Granularity of an index determines the accuracy of representation of the location of the word ◦ Coarse-grained index requires less storage and more query processing to eliminate false matches ◦ Word-level index enables queries involving adjacency and proximity, but has higher space requirements
  • 5. 5  Text:  Inverted file 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house 70 45, 58 18, 29 6 Vocabulary Occurrences
  • 6.  Prior example allows for boolean queries.  Need the document frequency and term frequency. Vocabulary entry Posting file entry k dk doc1 f1k doc2 f2k … dk : document frequency of term k doci : i-th document that contains term k fik : term frequency of term k in document i
  • 7.  The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(nβ ), where β is a constant between 0.4 and 0.6 in practice ◦ TREC-2: 1 GB text, 5 MB lexicon  On the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n)  To reduce space requirements, a technique called block addressing is used
  • 8.  The text is divided in blocks  The occurrences point to the blocks where the word appears  Advantages: ◦ the number of pointers is smaller than positions ◦ all the occurrences of a word inside a single block are collapsed to one reference  Disadvantages: ◦ online search over the qualifying blocks if exact positions are required
  • 9.  Text:  Inverted file beautiful flowers garden house 4 3 2 1 Vocabulary Occurrences Block 1 Block 2 Block 3 Block 4 That house has a garden. The garden has many flowers. The flowers are beautiful
  • 10.  How big are inverted files? ◦ In relation to original collection size  right column indexes stopwords while left removes stopwords  Blocks require text to be available for location of terms within blocks. 45% 27% 18% 73% 41% 25% 36% 18% 1.7% 64% 32% 2.4% 35% 5% 0.5% 63% 9% 0.7% Addressing words Addressing 256 blocks Addressing 64K blocks Index Small collection (1Mb) Medium collection (200Mb) Large collection (2Gb)
  • 11.  The search algorithm on an inverted index follows three steps: 1. Vocabulary search: the words present in the query are located in the vocabulary 2. Retrieval occurrences: the lists of the occurrences of all query words found are retrieved 3. Manipulation of occurrences: the occurrences are processed to solve the query
  • 12.  Searching inverted files starts with vocabulary ◦ store the vocabulary in a separate file  Structures used to store the vocabulary include ◦ Hashing : O (1) lookup, does not support range queries ◦ Tries : O (c) lookup, c = length (word) ◦ B-trees : O (log v) lookup  An alternative is simply storing the words in lexicographical order ◦ cheaper in space and very competitive with O(log v) cost
  • 13.  All the vocabulary is kept in a suitable data structure storing for each word and a list of its occurrences  Each word of each text in the corpus is read and searched for in the vocabulary  If it is not found, it is added to the vocabulary with a empty list of occurrences  The new position is added to the end of its list of occurrences for the word
  • 14.  Once the text is exhausted the vocabulary is written to disk with the list of occurrences.  Two files are created: ◦ in the first file, each list of word occurrences is stored contiguously ◦ in the second file, the vocabulary is stored in lexicographical order and, for each word, a pointer to its list in the first file is also included. This allows the vocabulary to be kept in memory at search time  The overall process is O(n) worst-case time
  • 15.  An option is to use the previous algorithm until the main memory is exhausted. When no more memory is available, the partial index Ii obtained up to now is written to disk and erased the main memory before continuing with the rest of the text  Once the text is exhausted, a number of partial indices Ii exist on disk  The partial indices are merged to obtain the final index
  • 16. I 1...8 I 1...4 I 5...8 I 1...2 I 3...4 I 5...6 I 7...8 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 1 2 4 5 3 6 7 final index initial dumps level 1 level 2 level 3
  • 17.  The total time to generate partial indices is O(n)  The number of partial indices is O(n/M)  To merge the O(n/M) partial indices are necessary log2(n/M) merging levels  The total cost of this algorithm is O(n log(n/M))
  • 18.  Inverted files are used to index text  The indices are appropriate when the text collection is large and semi-static  If the text collection is volatile online searching is the only option  Some techniques combine online and indexed searching
  • 19.  Vocabulary List ◦ Text preprocessing modules  lexical analysis, stemming, stopwords  Occurrences of Vocabulary Terms ◦ Inverted index creation  term frequency in documents, document frequency  Retrieval and Ranking Algorithm  Query and Ranking Interfaces  Browsing/Visualization Interface