SlideShare une entreprise Scribd logo
1  sur  19
 Definition: an inverted file is a word-oriented
mechanism for indexing a text collection in
order to speed up the searching task.
 Structure of inverted file:
◦ Vocabulary: is the set of all distinct words in the
text
◦ Occurrences: lists containing all information
necessary for each word of the vocabulary (text
position, frequency, documents where the word
appears, etc.)
 Inverted file index is list of terms that appear in the
document collection (called a lexicon or vocabulary) and
for each term in the lexicon, stores a list of pointers to all
occurrences of that term in the document collection. This
list is called an inverted list.
 Granularity of an index determines the accuracy of
representation of the location of the word
◦ Coarse-grained index requires less storage and more
query processing to eliminate false matches
◦ Word-level index enables queries involving adjacency
and proximity, but has higher space requirements
4
Indexed
Terms
Number of
occurrences
Occurrences Lists
Vocabulary
Posting File
This could be a tree like structure !
5
 Text:
 Inverted file
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
beautiful
flowers
garden
house
70
45, 58
18, 29
6
Vocabulary Occurrences
 Prior example allows for boolean
queries.
 Need the document frequency and term
frequency.
Vocabulary entry Posting file entry
k dk doc1 f1k doc2 f2k …
dk : document frequency of term k
doci : i-th document that contains term k
fik : term frequency of term k in document i
 The space required for the vocabulary is rather
small. According to Heaps’ law the vocabulary
grows as O(nβ
), where β is a constant between
0.4 and 0.6 in practice
◦ TREC-2: 1 GB text, 5 MB lexicon
 On the other hand, the occurrences demand
much more space. Since each word appearing
in the text is referenced once in that structure,
the extra space is O(n)
 To reduce space requirements, a technique
called block addressing is used
 The text is divided in blocks
 The occurrences point to the blocks where the
word appears
 Advantages:
◦ the number of pointers is smaller than positions
◦ all the occurrences of a word inside a single block
are collapsed to one reference
 Disadvantages:
◦ online search over the qualifying blocks if exact
positions are required
 Text:
 Inverted file
beautiful
flowers
garden
house
4
3
2
1
Vocabulary Occurrences
Block 1 Block 2 Block 3 Block 4
That house has a garden. The garden has many flowers. The flowers are
beautiful
 How big are inverted files?
◦ In relation to original collection size
 right column indexes stopwords while left removes
stopwords
 Blocks require text to be available for location of
terms within blocks.
45%
27%
18%
73%
41%
25%
36%
18%
1.7%
64%
32%
2.4%
35%
5%
0.5%
63%
9%
0.7%
Addressing words
Addressing 256 blocks
Addressing 64K blocks
Index Small collection
(1Mb)
Medium collection
(200Mb)
Large collection
(2Gb)
 The search algorithm on an inverted
index follows three steps:
1. Vocabulary search: the words present in
the query are located in the vocabulary
2. Retrieval occurrences: the lists of the
occurrences of all query words found are
retrieved
3. Manipulation of occurrences: the
occurrences are processed to solve the
query
 Searching inverted files starts with vocabulary
◦ store the vocabulary in a separate file
 Structures used to store the vocabulary
include
◦ Hashing : O (1) lookup, does not support range
queries
◦ Tries : O (c) lookup, c = length (word)
◦ B-trees : O (log v) lookup
 An alternative is simply storing the words in
lexicographical order
◦ cheaper in space and very competitive with O(log
v) cost
 All the vocabulary is kept in a suitable data
structure storing for each word and a list of
its occurrences
 Each word of each text in the corpus is
read and searched for in the vocabulary
 If it is not found, it is added to the
vocabulary with a empty list of occurrences
 The new position is added to the end of its
list of occurrences for the word
 Once the text is exhausted the vocabulary is
written to disk with the list of occurrences.
 Two files are created:
◦ in the first file, each list of word occurrences is
stored contiguously
◦ in the second file, the vocabulary is stored in
lexicographical order and, for each word, a pointer
to its list in the first file is also included. This allows
the vocabulary to be kept in memory at search time
 The overall process is O(n) worst-case time
 An option is to use the previous algorithm until
the main memory is exhausted. When no
more memory is available, the partial index Ii
obtained up to now is written to disk and
erased the main memory before continuing
with the rest of the text
 Once the text is exhausted, a number of
partial indices Ii exist on disk
 The partial indices are merged to obtain the
final index
I 1...8
I 1...4 I 5...8
I 1...2 I 3...4 I 5...6 I 7...8
I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8
1 2 4 5
3 6
7
final index
initial dumps
level 1
level 2
level 3
 The total time to generate partial indices is
O(n)
 The number of partial indices is O(n/M)
 To merge the O(n/M) partial indices are
necessary log2(n/M) merging levels
 The total cost of this algorithm is O(n log(n/M))
 Inverted files are used to index text
 The indices are appropriate when the
text collection is large and semi-static
 If the text collection is volatile online
searching is the only option
 Some techniques combine online and
indexed searching
 Vocabulary List
◦ Text preprocessing modules
 lexical analysis, stemming, stopwords
 Occurrences of Vocabulary Terms
◦ Inverted index creation
 term frequency in documents, document frequency
 Retrieval and Ranking Algorithm
 Query and Ranking Interfaces
 Browsing/Visualization Interface

Contenu connexe

Tendances

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
Selman Bozkır
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
koolkampus
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
Kuppusamy P
 

Tendances (20)

Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
 
Multimedia Information Retrieval
Multimedia Information RetrievalMultimedia Information Retrieval
Multimedia Information Retrieval
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
Suffix Tree and Suffix Array
Suffix Tree and Suffix ArraySuffix Tree and Suffix Array
Suffix Tree and Suffix Array
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
information retrieval Techniques and normalization
information retrieval Techniques and normalizationinformation retrieval Techniques and normalization
information retrieval Techniques and normalization
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 

En vedette

An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
weedge
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
Jeremy Coates
 
Public key Cryptography & RSA
Public key Cryptography & RSAPublic key Cryptography & RSA
Public key Cryptography & RSA
Amit Debnath
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
Khalid Mahmood
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
Tony Fabeen
 

En vedette (20)

An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index Explosion
 
The Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital TransformationThe Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital Transformation
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Product quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-reportProduct quantization for nearest neighbor search-report
Product quantization for nearest neighbor search-report
 
Privacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud dataPrivacy preserving multi-keyword ranked search over encrypted cloud data
Privacy preserving multi-keyword ranked search over encrypted cloud data
 
Information seeking
Information seekingInformation seeking
Information seeking
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Solr
SolrSolr
Solr
 
Public key Cryptography & RSA
Public key Cryptography & RSAPublic key Cryptography & RSA
Public key Cryptography & RSA
 
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Index types
Index typesIndex types
Index types
 
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
 A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD... A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
 

Similaire à Inverted index

Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
JemalNesre1
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdf
lekhacce
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
mghgk
 

Similaire à Inverted index (20)

Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
 
Chapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfChapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdf
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Ir 03
Ir   03Ir   03
Ir 03
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Search pitb
Search pitbSearch pitb
Search pitb
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
File Types in Data Structure
File Types in Data StructureFile Types in Data Structure
File Types in Data Structure
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashing
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdf
 
Chapter13
Chapter13Chapter13
Chapter13
 
lecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxlecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptx
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
 
Hashing
HashingHashing
Hashing
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
G0361034038
G0361034038G0361034038
G0361034038
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 

Dernier

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Dernier (20)

Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 

Inverted index

  • 1.
  • 2.  Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task.  Structure of inverted file: ◦ Vocabulary: is the set of all distinct words in the text ◦ Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc.)
  • 3.  Inverted file index is list of terms that appear in the document collection (called a lexicon or vocabulary) and for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list.  Granularity of an index determines the accuracy of representation of the location of the word ◦ Coarse-grained index requires less storage and more query processing to eliminate false matches ◦ Word-level index enables queries involving adjacency and proximity, but has higher space requirements
  • 5. 5  Text:  Inverted file 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house 70 45, 58 18, 29 6 Vocabulary Occurrences
  • 6.  Prior example allows for boolean queries.  Need the document frequency and term frequency. Vocabulary entry Posting file entry k dk doc1 f1k doc2 f2k … dk : document frequency of term k doci : i-th document that contains term k fik : term frequency of term k in document i
  • 7.  The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(nβ ), where β is a constant between 0.4 and 0.6 in practice ◦ TREC-2: 1 GB text, 5 MB lexicon  On the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n)  To reduce space requirements, a technique called block addressing is used
  • 8.  The text is divided in blocks  The occurrences point to the blocks where the word appears  Advantages: ◦ the number of pointers is smaller than positions ◦ all the occurrences of a word inside a single block are collapsed to one reference  Disadvantages: ◦ online search over the qualifying blocks if exact positions are required
  • 9.  Text:  Inverted file beautiful flowers garden house 4 3 2 1 Vocabulary Occurrences Block 1 Block 2 Block 3 Block 4 That house has a garden. The garden has many flowers. The flowers are beautiful
  • 10.  How big are inverted files? ◦ In relation to original collection size  right column indexes stopwords while left removes stopwords  Blocks require text to be available for location of terms within blocks. 45% 27% 18% 73% 41% 25% 36% 18% 1.7% 64% 32% 2.4% 35% 5% 0.5% 63% 9% 0.7% Addressing words Addressing 256 blocks Addressing 64K blocks Index Small collection (1Mb) Medium collection (200Mb) Large collection (2Gb)
  • 11.  The search algorithm on an inverted index follows three steps: 1. Vocabulary search: the words present in the query are located in the vocabulary 2. Retrieval occurrences: the lists of the occurrences of all query words found are retrieved 3. Manipulation of occurrences: the occurrences are processed to solve the query
  • 12.  Searching inverted files starts with vocabulary ◦ store the vocabulary in a separate file  Structures used to store the vocabulary include ◦ Hashing : O (1) lookup, does not support range queries ◦ Tries : O (c) lookup, c = length (word) ◦ B-trees : O (log v) lookup  An alternative is simply storing the words in lexicographical order ◦ cheaper in space and very competitive with O(log v) cost
  • 13.  All the vocabulary is kept in a suitable data structure storing for each word and a list of its occurrences  Each word of each text in the corpus is read and searched for in the vocabulary  If it is not found, it is added to the vocabulary with a empty list of occurrences  The new position is added to the end of its list of occurrences for the word
  • 14.  Once the text is exhausted the vocabulary is written to disk with the list of occurrences.  Two files are created: ◦ in the first file, each list of word occurrences is stored contiguously ◦ in the second file, the vocabulary is stored in lexicographical order and, for each word, a pointer to its list in the first file is also included. This allows the vocabulary to be kept in memory at search time  The overall process is O(n) worst-case time
  • 15.  An option is to use the previous algorithm until the main memory is exhausted. When no more memory is available, the partial index Ii obtained up to now is written to disk and erased the main memory before continuing with the rest of the text  Once the text is exhausted, a number of partial indices Ii exist on disk  The partial indices are merged to obtain the final index
  • 16. I 1...8 I 1...4 I 5...8 I 1...2 I 3...4 I 5...6 I 7...8 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 1 2 4 5 3 6 7 final index initial dumps level 1 level 2 level 3
  • 17.  The total time to generate partial indices is O(n)  The number of partial indices is O(n/M)  To merge the O(n/M) partial indices are necessary log2(n/M) merging levels  The total cost of this algorithm is O(n log(n/M))
  • 18.  Inverted files are used to index text  The indices are appropriate when the text collection is large and semi-static  If the text collection is volatile online searching is the only option  Some techniques combine online and indexed searching
  • 19.  Vocabulary List ◦ Text preprocessing modules  lexical analysis, stemming, stopwords  Occurrences of Vocabulary Terms ◦ Inverted index creation  term frequency in documents, document frequency  Retrieval and Ranking Algorithm  Query and Ranking Interfaces  Browsing/Visualization Interface