TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
Inverted index
1.
2. Definition: an inverted file is a word-oriented
mechanism for indexing a text collection in
order to speed up the searching task.
Structure of inverted file:
◦ Vocabulary: is the set of all distinct words in the
text
◦ Occurrences: lists containing all information
necessary for each word of the vocabulary (text
position, frequency, documents where the word
appears, etc.)
3. Inverted file index is list of terms that appear in the
document collection (called a lexicon or vocabulary) and
for each term in the lexicon, stores a list of pointers to all
occurrences of that term in the document collection. This
list is called an inverted list.
Granularity of an index determines the accuracy of
representation of the location of the word
◦ Coarse-grained index requires less storage and more
query processing to eliminate false matches
◦ Word-level index enables queries involving adjacency
and proximity, but has higher space requirements
5. 5
Text:
Inverted file
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
beautiful
flowers
garden
house
70
45, 58
18, 29
6
Vocabulary Occurrences
6. Prior example allows for boolean
queries.
Need the document frequency and term
frequency.
Vocabulary entry Posting file entry
k dk doc1 f1k doc2 f2k …
dk : document frequency of term k
doci : i-th document that contains term k
fik : term frequency of term k in document i
7. The space required for the vocabulary is rather
small. According to Heaps’ law the vocabulary
grows as O(nβ
), where β is a constant between
0.4 and 0.6 in practice
◦ TREC-2: 1 GB text, 5 MB lexicon
On the other hand, the occurrences demand
much more space. Since each word appearing
in the text is referenced once in that structure,
the extra space is O(n)
To reduce space requirements, a technique
called block addressing is used
8. The text is divided in blocks
The occurrences point to the blocks where the
word appears
Advantages:
◦ the number of pointers is smaller than positions
◦ all the occurrences of a word inside a single block
are collapsed to one reference
Disadvantages:
◦ online search over the qualifying blocks if exact
positions are required
9. Text:
Inverted file
beautiful
flowers
garden
house
4
3
2
1
Vocabulary Occurrences
Block 1 Block 2 Block 3 Block 4
That house has a garden. The garden has many flowers. The flowers are
beautiful
10. How big are inverted files?
◦ In relation to original collection size
right column indexes stopwords while left removes
stopwords
Blocks require text to be available for location of
terms within blocks.
45%
27%
18%
73%
41%
25%
36%
18%
1.7%
64%
32%
2.4%
35%
5%
0.5%
63%
9%
0.7%
Addressing words
Addressing 256 blocks
Addressing 64K blocks
Index Small collection
(1Mb)
Medium collection
(200Mb)
Large collection
(2Gb)
11. The search algorithm on an inverted
index follows three steps:
1. Vocabulary search: the words present in
the query are located in the vocabulary
2. Retrieval occurrences: the lists of the
occurrences of all query words found are
retrieved
3. Manipulation of occurrences: the
occurrences are processed to solve the
query
12. Searching inverted files starts with vocabulary
◦ store the vocabulary in a separate file
Structures used to store the vocabulary
include
◦ Hashing : O (1) lookup, does not support range
queries
◦ Tries : O (c) lookup, c = length (word)
◦ B-trees : O (log v) lookup
An alternative is simply storing the words in
lexicographical order
◦ cheaper in space and very competitive with O(log
v) cost
13. All the vocabulary is kept in a suitable data
structure storing for each word and a list of
its occurrences
Each word of each text in the corpus is
read and searched for in the vocabulary
If it is not found, it is added to the
vocabulary with a empty list of occurrences
The new position is added to the end of its
list of occurrences for the word
14. Once the text is exhausted the vocabulary is
written to disk with the list of occurrences.
Two files are created:
◦ in the first file, each list of word occurrences is
stored contiguously
◦ in the second file, the vocabulary is stored in
lexicographical order and, for each word, a pointer
to its list in the first file is also included. This allows
the vocabulary to be kept in memory at search time
The overall process is O(n) worst-case time
15. An option is to use the previous algorithm until
the main memory is exhausted. When no
more memory is available, the partial index Ii
obtained up to now is written to disk and
erased the main memory before continuing
with the rest of the text
Once the text is exhausted, a number of
partial indices Ii exist on disk
The partial indices are merged to obtain the
final index
16. I 1...8
I 1...4 I 5...8
I 1...2 I 3...4 I 5...6 I 7...8
I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8
1 2 4 5
3 6
7
final index
initial dumps
level 1
level 2
level 3
17. The total time to generate partial indices is
O(n)
The number of partial indices is O(n/M)
To merge the O(n/M) partial indices are
necessary log2(n/M) merging levels
The total cost of this algorithm is O(n log(n/M))
18. Inverted files are used to index text
The indices are appropriate when the
text collection is large and semi-static
If the text collection is volatile online
searching is the only option
Some techniques combine online and
indexed searching
19. Vocabulary List
◦ Text preprocessing modules
lexical analysis, stemming, stopwords
Occurrences of Vocabulary Terms
◦ Inverted index creation
term frequency in documents, document frequency
Retrieval and Ranking Algorithm
Query and Ranking Interfaces
Browsing/Visualization Interface