SlideShare une entreprise Scribd logo
1  sur  27
Quasi Succinct IndicesQuasi Succinct Indices ((WSDM’13)WSDM’13)
Author:Author: Sebastiano VignaSebastiano Vigna
Slides By:Slides By: Han JiangHan Jiang
AgendaAgenda
Related workRelated work
Representation of monotone sequencesRepresentation of monotone sequences
Practical examplePractical example
Theoretical estimationTheoretical estimation
Implementation detailsImplementation details
Index structureIndex structure
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
Related workRelated work
Why index compression:Why index compression:
Saves disk spaceSaves disk space
Reduce overhead between disk & memoryReduce overhead between disk & memory
[Index compression is good, especially for random access, CIKM’07]
Two tricks at the basis of index compression:Two tricks at the basis of index compression:
Instantaneous codes (or prefix codes)Instantaneous codes (or prefix codes)
e.g. Variable byte
Gap encodingGap encoding
e.g. [1, 3, 9]e.g. [1, 3, 9]  [1, 2, 6][1, 2, 6]
Related work +Related work +
Popular approaches:Popular approaches:
Variable BytesVariable Bytes
(VB, previously used in Lucene)
Gamma/Delta encodingGamma/Delta encoding
(at most 2*Theoretical lower bound)
Golomb codeGolomb code
(near theoretical lower bound)
PForDeltaPForDelta
(block encoding, efficient and cache friendly)
Unary: 8Unary: 8  000,000,001000,000,001
(stupidest, but efficient when combined with others, we’ll see this again)
……
AgendaAgenda
Related work √Related work √
Representation of monotone sequencesRepresentation of monotone sequences
Practical examplePractical example
Theoretical estimationTheoretical estimation
Implementation detailsImplementation details
Index structureIndex structure
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
Representation of monotone sequencesRepresentation of monotone sequences
5 88 15 32
1 01 0010 0010 1111 1000 00
List = { }
00110001 008321 2
101 01 01 000001
5101 1
d-gap
unary
Total bits: 23 bitsTotal bits: 23 bits
Gamma: 23 bitsGamma: 23 bits
Delta: 22 bitsDelta: 22 bits
VB: 40 bitsVB: 40 bits
Assume uu is the upper bound of this list (e.g. u=36)
Then lower width l is: (e.g. l=log(36/5)=2)
5 88 15 32
1 01 0010 0010 1111 1000 00
List = { }
101 01 01 000001 00110001 00High: Low:
Representation of monotone sequences +Representation of monotone sequences +
How to decide when splitting high/low bits?
Why don’t we operate d-gap before encoding?
We’ll leave it as implementation details
X0=5
1 01 0010 0010 1111 1000 00
List = { }
Theoretical estimationTheoretical estimation
101 01 01 000001 00110001 00High:High: Low:
For each value, we need:
n*L bits for lower part;
n bits for stop ‘1’ in unary code
But non-stop ‘0’s ?
X1=8 X2=8 X3=15 X4=32
Note that we only unary encode higher bits,
For each ‘0’, the value increases 2^l
This increment will only happen q times:
So the upper bound for this part is:
Then in total:
Theoretical estimation +Theoretical estimation +
So what?So what?
Let’s see the lower bound with ‘best’ format :Let’s see the lower bound with ‘best’ format :
Upper bound for Quasi-succinct encoding:Upper bound for Quasi-succinct encoding:
And it is proved that QS can achieve a ‘quasi’ optimalAnd it is proved that QS can achieve a ‘quasi’ optimal
resultresult : “: “ less than half a bit per element away”.less than half a bit per element away”.
That’s why it’s called ‘quasi’ succinct…That’s why it’s called ‘quasi’ succinct…
The information-theoretical lower bound for a non-strict monotoneThe information-theoretical lower bound for a non-strict monotone
list of n elements, within interval [0,u]: (thelist of n elements, within interval [0,u]: (the ≈ cancan
also be replaced byalso be replaced by >))
Short conclusionShort conclusion
No distribution of document gapsNo distribution of document gaps
Document reordering won’t affect index size muchDocument reordering won’t affect index size much
GeneralGeneral
Works for sequences both monotonic or notWorks for sequences both monotonic or not
Unary code is enoughUnary code is enough
And we’ll see it works well for skipping
SimpleSimple
A few unary reads and bit shifts
AgendaAgenda
Related work √Related work √
Representation of monotone sequences √Representation of monotone sequences √
Practical example √Practical example √
Theoretical estimation √Theoretical estimation √
Implementation detailsImplementation details
Index structureIndex structure
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
Index structure (no skipping)Index structure (no skipping)
Given bound ‘b’, advance to xGiven bound ‘b’, advance to xii so that xso that xii >= b>= b
X0=5
1 01 0010 0010 1111 1000 00
List = { }
101 01 01 000001 00110001 00High:High: Low:
X1=8 X2=8 X3=15 X4=32
It is easy to see that, xIt is easy to see that, xii must be after zeros.must be after zeros.
So, walking on the high bits list, when we reach bit position p, andSo, walking on the high bits list, when we reach bit position p, and
have already past zeros, we must be in the middle ofhave already past zeros, we must be in the middle of
This is why we don’t need d-gap on original List: the unary highThis is why we don’t need d-gap on original List: the unary high
bits should act as a ‘skip table’, with skip interval=2^lbits should act as a ‘skip table’, with skip interval=2^l
Index structure + (with skipping)Index structure + (with skipping)
X0=5
1 01 0010 0010 1111 1000 00
List = { }
101 01 01 000001 00110001 00High:High: Low:
X1=8 X2=8 X3=15 X4=32
The skipper can be surprisingly simple…The skipper can be surprisingly simple…
So, the skipper only need to store theSo, the skipper only need to store the locationlocation for everyfor every
q unary codes. (and the value j = p - i = p - q)q unary codes. (and the value j = p - i = p - q)
Note that, when scanning in the higher bits tableNote that, when scanning in the higher bits table
p = current bit locationp = current bit location
i = number of ‘1’s we read, telling us we’re reading Xi = number of ‘1’s we read, telling us we’re reading Xii
j = number of ‘0’s we read, telling us the value of higher bits isj = number of ‘0’s we read, telling us the value of higher bits is
i + j = pi + j = p
Index structure ++ (example)Index structure ++ (example)
X0=5
1 01 0010 0010 1111 1000 00
List = { }
1
00110001 00
High:High:
Low:
X1=8 X2=8 X3=15 X4=32
0 10 01 01 00 00 1
Skip interval=4, next pos=7
value before next skip = (pos – interval) * 2^l = 3 * 4 = 12
Advance Target = 22
so we can skip, and should walk three bits to get 24 > 22
complete current unary, then read lower bits, got result X4 = 32
Index structure +++ (conceptual layout)Index structure +++ (conceptual layout)
Size of each sectionSize of each section
Metadata sectionMetadata section records n: num of elements, u: value upper bound, etcrecords n: num of elements, u: value upper bound, etc
Skip tableSkip table p*w bits, (p: skip interval, w: data width)p*w bits, (p: skip interval, w: data width)
Lower bitsLower bits n*l bits, (l: estimated width)n*l bits, (l: estimated width)
Upper bitsUpper bits unknown without metadata, so put in last sectionunknown without metadata, so put in last section
For doc ids, the sequence is strictly monotonicFor doc ids, the sequence is strictly monotonic
For doc freqs, the sequence is ‘prefix sum of freq’, i.e.For doc freqs, the sequence is ‘prefix sum of freq’, i.e.
For positions, the format is a little different, and we’ll leave this for nowFor positions, the format is a little different, and we’ll leave this for now
Index structure ++++ (for dense sequence)Index structure ++++ (for dense sequence)
However it’s not efficient when the sequence is very dense…However it’s not efficient when the sequence is very dense…
Here we’ll encode the sequence as a bit sequence insteadHere we’ll encode the sequence as a bit sequence instead
where: Bit k is set when Xwhere: Bit k is set when Xii == k== k
10 11 10 10 0
X0=1List = { }X1=2 X2=3 X3=5 X4=7
This is only for ‘strictly monotone sequence’This is only for ‘strictly monotone sequence’
Skipper will be set for every q positions, and store num of ‘1’ s before that.Skipper will be set for every q positions, and store num of ‘1’ s before that.
We’ll cutover to this format when n > u/3We’ll cutover to this format when n > u/3
AgendaAgenda
Related work √Related work √
Representation of monotone sequences √Representation of monotone sequences √
Practical example √Practical example √
Theoretical estimation √Theoretical estimation √
Implementation detailsImplementation details
Index structure √Index structure √
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
Miscellaneous (design of position list)Miscellaneous (design of position list)
For a term t, all its position lists are stored as one sequence:For a term t, all its position lists are stored as one sequence:
The length of this sequence is total_term_freq, and the upper bound is:The length of this sequence is total_term_freq, and the upper bound is:
To revive positions from document i, we need:To revive positions from document i, we need:
Sum of frq from previous documentsSum of frq from previous documents
Sum of p from previous documentsSum of p from previous documents
(also from current document, if we need more frequent skip)(also from current document, if we need more frequent skip)
These will be store in skipper for position listThese will be store in skipper for position list
Miscellaneous + (reuse logic)Miscellaneous + (reuse logic)
101 01 01 000001High:High:
To read past 4 values, we need unary decodingTo read past 4 values, we need unary decoding
To read past 4 ‘zero’s, we simply need ‘negated unary decoding’To read past 4 ‘zero’s, we simply need ‘negated unary decoding’
Another aspect of higher bits:Another aspect of higher bits:
0 10High:High: 110 10 0 0 0 0 1
AgendaAgenda
Related work √Related work √
Representation of monotone sequences √Representation of monotone sequences √
Practical example √Practical example √
Theoretical estimation √Theoretical estimation √
Implementation details √Implementation details √
Index structure √Index structure √
Miscellaneous √Miscellaneous √
ExperimentsExperiments
DiscussionsDiscussions
ExperimentsExperiments
Five competitors:Five competitors:
Lucene 3.6 (VB)Lucene 3.6 (VB) [sigh, not the latest version]
MG4J (gamma/delta)MG4J (gamma/delta) [an old version written by the author]
Zettair (VB)Zettair (VB)
Kamikaze (PForDelta)Kamikaze (PForDelta)
Optimized PForDelta implementation in COptimized PForDelta implementation in C
Four datasets with different statistics:Four datasets with different statistics:
TREC GOV2 (25M documents)TREC GOV2 (25M documents)
.uk dataset (132M documents).uk dataset (132M documents)
Mimir index (1M documents)Mimir index (1M documents)
Tweet data (13M documents)Tweet data (13M documents)
Aside from whole HTML index, title field is also extracted as another test groupAside from whole HTML index, title field is also extracted as another test group
To make sure the tests is fair enough between competitors, input data is a pre-parsedTo make sure the tests is fair enough between competitors, input data is a pre-parsed
stream of UTF-8 text documents.stream of UTF-8 text documents.
Experiments + (compression)Experiments + (compression)
Experiments ++ (speed)Experiments ++ (speed)
Design of queries:Design of queries:
150 Queries from Terabyte track (04~06), as150 Queries from Terabyte track (04~06), as
Conjunctive QueryConjunctive Query
Phrasal QueryPhrasal Query
Proximity Query (query words must appear within a window of 16)Proximity Query (query words must appear within a window of 16)
Term Scanning Query (pure test)Term Scanning Query (pure test)
Design of task:Design of task:
All engines will be set up to return exactly one resultAll engines will be set up to return exactly one result
The QS format is implemented with both Java and C++ for fair testThe QS format is implemented with both Java and C++ for fair test
Since both Lucene and MG4J interleaves doc id and freq, pure boolean query willSince both Lucene and MG4J interleaves doc id and freq, pure boolean query will
hurt when reading unused freq data, the QS* is a modified version to make test fairhurt when reading unused freq data, the QS* is a modified version to make test fair
Experiments +++ (speed)Experiments +++ (speed)
Experiments ++++ (examples from old paper)Experiments ++++ (examples from old paper)
Almost pure unary reads
Without skipping
With heavy skipping
Heavy position addressing,
Hmm… however note that Lucene doesn’t
have skip table for position list…
DiscussionDiscussion
A DocIdSet with this representation is already implemented in LuceneA DocIdSet with this representation is already implemented in Lucene
(https://issues.apache.org/jira/browse/LUCENE-5084)(https://issues.apache.org/jira/browse/LUCENE-5084)
We’ll see performance comparison soon!We’ll see performance comparison soon!
Drawbacks?Drawbacks?
It might take more time during index construction:It might take more time during index construction:
Many statistics needed for encoding (upper bound, total_term_frq, etc)Many statistics needed for encoding (upper bound, total_term_frq, etc)
It is possible to pre-store a postings list with VB in memory, then translated as QSIt is possible to pre-store a postings list with VB in memory, then translated as QS
To be digested…To be digested…
““storing positions with PForDelta codes is know to give a compression rate close to thatstoring positions with PForDelta codes is know to give a compression rate close to that
provided by VB coding” ?provided by VB coding” ?
Thank You !Thank You !

Contenu connexe

Tendances

Push down automata
Push down automataPush down automata
Push down automata
Somya Bagai
 
Aae oop xp_06
Aae oop xp_06Aae oop xp_06
Aae oop xp_06
Niit Care
 

Tendances (20)

Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDAPush Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
 
Huffman coding01
Huffman coding01Huffman coding01
Huffman coding01
 
Introduction to Turing Machine
Introduction to Turing MachineIntroduction to Turing Machine
Introduction to Turing Machine
 
Implementation Of String Functions In C
Implementation Of String Functions In CImplementation Of String Functions In C
Implementation Of String Functions In C
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Automata theory - Push Down Automata (PDA)
Automata theory - Push Down Automata (PDA)Automata theory - Push Down Automata (PDA)
Automata theory - Push Down Automata (PDA)
 
Push down automata
Push down automataPush down automata
Push down automata
 
Turing machine-TOC
Turing machine-TOCTuring machine-TOC
Turing machine-TOC
 
Turing Machine
Turing MachineTuring Machine
Turing Machine
 
Turing machines
Turing machinesTuring machines
Turing machines
 
Multimedia lossless compression algorithms
Multimedia lossless compression algorithmsMultimedia lossless compression algorithms
Multimedia lossless compression algorithms
 
COm1407: Character & Strings
COm1407: Character & StringsCOm1407: Character & Strings
COm1407: Character & Strings
 
Improved security system using steganography and elliptic curve crypto...
Improved  security  system using  steganography  and  elliptic  curve  crypto...Improved  security  system using  steganography  and  elliptic  curve  crypto...
Improved security system using steganography and elliptic curve crypto...
 
04 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x204 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x2
 
Multimedia Communication Lec02: Info Theory and Entropy
Multimedia Communication Lec02: Info Theory and EntropyMultimedia Communication Lec02: Info Theory and Entropy
Multimedia Communication Lec02: Info Theory and Entropy
 
Aae oop xp_06
Aae oop xp_06Aae oop xp_06
Aae oop xp_06
 
Data Protection Techniques and Cryptography
Data Protection Techniques and CryptographyData Protection Techniques and Cryptography
Data Protection Techniques and Cryptography
 
Huffman coding
Huffman coding Huffman coding
Huffman coding
 
Headerfiles
HeaderfilesHeaderfiles
Headerfiles
 
Arithmetic coding
Arithmetic codingArithmetic coding
Arithmetic coding
 

En vedette

A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequences
Takeshi Yamamuro
 
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
Preferred Networks
 

En vedette (20)

PFI Christmas seminar 2009
PFI Christmas seminar 2009PFI Christmas seminar 2009
PFI Christmas seminar 2009
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
 
2009年4月8日セミナー 2.Sedue新機能
2009年4月8日セミナー 2.Sedue新機能2009年4月8日セミナー 2.Sedue新機能
2009年4月8日セミナー 2.Sedue新機能
 
Pfi Seminar 2010 1 7
Pfi Seminar 2010 1 7Pfi Seminar 2010 1 7
Pfi Seminar 2010 1 7
 
PFI Seminar 2010/01/21
PFI Seminar 2010/01/21PFI Seminar 2010/01/21
PFI Seminar 2010/01/21
 
2009年4月8日セミナー 3.SSD向け全文検索エンジン
2009年4月8日セミナー 3.SSD向け全文検索エンジン2009年4月8日セミナー 3.SSD向け全文検索エンジン
2009年4月8日セミナー 3.SSD向け全文検索エンジン
 
Prosym53
Prosym53Prosym53
Prosym53
 
PFI Corporate Profile
PFI Corporate ProfilePFI Corporate Profile
PFI Corporate Profile
 
2009年4月8日セミナー 4.レコメンデーション Q&A
2009年4月8日セミナー 4.レコメンデーション Q&A2009年4月8日セミナー 4.レコメンデーション Q&A
2009年4月8日セミナー 4.レコメンデーション Q&A
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequences
 
【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
 
mlabforum2012_okanohara
mlabforum2012_okanoharamlabforum2012_okanohara
mlabforum2012_okanohara
 
2009年4月8日セミナー 1.オープニング
2009年4月8日セミナー 1.オープニング2009年4月8日セミナー 1.オープニング
2009年4月8日セミナー 1.オープニング
 
PFI Seminar 2012/02/24
PFI Seminar 2012/02/24PFI Seminar 2012/02/24
PFI Seminar 2012/02/24
 
tut_pfi_2012
tut_pfi_2012tut_pfi_2012
tut_pfi_2012
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
 
Jubatus Invited Talk at XLDB Asia
Jubatus Invited Talk at XLDB AsiaJubatus Invited Talk at XLDB Asia
Jubatus Invited Talk at XLDB Asia
 
PFI会社案内
PFI会社案内PFI会社案内
PFI会社案内
 
Session2:「グローバル化する情報処理」/伊藤敬彦
Session2:「グローバル化する情報処理」/伊藤敬彦Session2:「グローバル化する情報処理」/伊藤敬彦
Session2:「グローバル化する情報処理」/伊藤敬彦
 
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
 

Similaire à Quasi succinct indices

16 -ansi-iso_standards
16  -ansi-iso_standards16  -ansi-iso_standards
16 -ansi-iso_standards
Hector Garzo
 
19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02
Muhammad Aslam
 
Counting Sort Lowerbound
Counting Sort LowerboundCounting Sort Lowerbound
Counting Sort Lowerbound
despicable me
 
lecture 9
lecture 9lecture 9
lecture 9
sajinsc
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
James Wong
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
Fraboni Ec
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
Tony Nguyen
 
SPU Optimizations-part 1
SPU Optimizations-part 1SPU Optimizations-part 1
SPU Optimizations-part 1
Naughty Dog
 

Similaire à Quasi succinct indices (20)

app4.pptx
app4.pptxapp4.pptx
app4.pptx
 
16 -ansi-iso_standards
16  -ansi-iso_standards16  -ansi-iso_standards
16 -ansi-iso_standards
 
Python ppt
Python pptPython ppt
Python ppt
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect Extraction
 
19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02
 
Counting Sort Lowerbound
Counting Sort LowerboundCounting Sort Lowerbound
Counting Sort Lowerbound
 
lecture 9
lecture 9lecture 9
lecture 9
 
Concur15slides
Concur15slidesConcur15slides
Concur15slides
 
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
 
Cheat Sheets for Hard Problems
Cheat Sheets for Hard ProblemsCheat Sheets for Hard Problems
Cheat Sheets for Hard Problems
 
Python programming –part 3
Python programming –part 3Python programming –part 3
Python programming –part 3
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
SPU Optimizations-part 1
SPU Optimizations-part 1SPU Optimizations-part 1
SPU Optimizations-part 1
 
iPython
iPythoniPython
iPython
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Quasi succinct indices

  • 1. Quasi Succinct IndicesQuasi Succinct Indices ((WSDM’13)WSDM’13) Author:Author: Sebastiano VignaSebastiano Vigna Slides By:Slides By: Han JiangHan Jiang
  • 2. AgendaAgenda Related workRelated work Representation of monotone sequencesRepresentation of monotone sequences Practical examplePractical example Theoretical estimationTheoretical estimation Implementation detailsImplementation details Index structureIndex structure MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
  • 3. Related workRelated work Why index compression:Why index compression: Saves disk spaceSaves disk space Reduce overhead between disk & memoryReduce overhead between disk & memory [Index compression is good, especially for random access, CIKM’07] Two tricks at the basis of index compression:Two tricks at the basis of index compression: Instantaneous codes (or prefix codes)Instantaneous codes (or prefix codes) e.g. Variable byte Gap encodingGap encoding e.g. [1, 3, 9]e.g. [1, 3, 9]  [1, 2, 6][1, 2, 6]
  • 4. Related work +Related work + Popular approaches:Popular approaches: Variable BytesVariable Bytes (VB, previously used in Lucene) Gamma/Delta encodingGamma/Delta encoding (at most 2*Theoretical lower bound) Golomb codeGolomb code (near theoretical lower bound) PForDeltaPForDelta (block encoding, efficient and cache friendly) Unary: 8Unary: 8  000,000,001000,000,001 (stupidest, but efficient when combined with others, we’ll see this again) ……
  • 5. AgendaAgenda Related work √Related work √ Representation of monotone sequencesRepresentation of monotone sequences Practical examplePractical example Theoretical estimationTheoretical estimation Implementation detailsImplementation details Index structureIndex structure MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
  • 6. Representation of monotone sequencesRepresentation of monotone sequences 5 88 15 32 1 01 0010 0010 1111 1000 00 List = { } 00110001 008321 2 101 01 01 000001 5101 1 d-gap unary Total bits: 23 bitsTotal bits: 23 bits Gamma: 23 bitsGamma: 23 bits Delta: 22 bitsDelta: 22 bits VB: 40 bitsVB: 40 bits
  • 7. Assume uu is the upper bound of this list (e.g. u=36) Then lower width l is: (e.g. l=log(36/5)=2) 5 88 15 32 1 01 0010 0010 1111 1000 00 List = { } 101 01 01 000001 00110001 00High: Low: Representation of monotone sequences +Representation of monotone sequences + How to decide when splitting high/low bits? Why don’t we operate d-gap before encoding? We’ll leave it as implementation details
  • 8. X0=5 1 01 0010 0010 1111 1000 00 List = { } Theoretical estimationTheoretical estimation 101 01 01 000001 00110001 00High:High: Low: For each value, we need: n*L bits for lower part; n bits for stop ‘1’ in unary code But non-stop ‘0’s ? X1=8 X2=8 X3=15 X4=32 Note that we only unary encode higher bits, For each ‘0’, the value increases 2^l This increment will only happen q times: So the upper bound for this part is: Then in total:
  • 9. Theoretical estimation +Theoretical estimation + So what?So what? Let’s see the lower bound with ‘best’ format :Let’s see the lower bound with ‘best’ format : Upper bound for Quasi-succinct encoding:Upper bound for Quasi-succinct encoding: And it is proved that QS can achieve a ‘quasi’ optimalAnd it is proved that QS can achieve a ‘quasi’ optimal resultresult : “: “ less than half a bit per element away”.less than half a bit per element away”. That’s why it’s called ‘quasi’ succinct…That’s why it’s called ‘quasi’ succinct… The information-theoretical lower bound for a non-strict monotoneThe information-theoretical lower bound for a non-strict monotone list of n elements, within interval [0,u]: (thelist of n elements, within interval [0,u]: (the ≈ cancan also be replaced byalso be replaced by >))
  • 10. Short conclusionShort conclusion No distribution of document gapsNo distribution of document gaps Document reordering won’t affect index size muchDocument reordering won’t affect index size much GeneralGeneral Works for sequences both monotonic or notWorks for sequences both monotonic or not Unary code is enoughUnary code is enough And we’ll see it works well for skipping SimpleSimple A few unary reads and bit shifts
  • 11. AgendaAgenda Related work √Related work √ Representation of monotone sequences √Representation of monotone sequences √ Practical example √Practical example √ Theoretical estimation √Theoretical estimation √ Implementation detailsImplementation details Index structureIndex structure MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
  • 12. Index structure (no skipping)Index structure (no skipping) Given bound ‘b’, advance to xGiven bound ‘b’, advance to xii so that xso that xii >= b>= b X0=5 1 01 0010 0010 1111 1000 00 List = { } 101 01 01 000001 00110001 00High:High: Low: X1=8 X2=8 X3=15 X4=32 It is easy to see that, xIt is easy to see that, xii must be after zeros.must be after zeros. So, walking on the high bits list, when we reach bit position p, andSo, walking on the high bits list, when we reach bit position p, and have already past zeros, we must be in the middle ofhave already past zeros, we must be in the middle of This is why we don’t need d-gap on original List: the unary highThis is why we don’t need d-gap on original List: the unary high bits should act as a ‘skip table’, with skip interval=2^lbits should act as a ‘skip table’, with skip interval=2^l
  • 13. Index structure + (with skipping)Index structure + (with skipping) X0=5 1 01 0010 0010 1111 1000 00 List = { } 101 01 01 000001 00110001 00High:High: Low: X1=8 X2=8 X3=15 X4=32 The skipper can be surprisingly simple…The skipper can be surprisingly simple… So, the skipper only need to store theSo, the skipper only need to store the locationlocation for everyfor every q unary codes. (and the value j = p - i = p - q)q unary codes. (and the value j = p - i = p - q) Note that, when scanning in the higher bits tableNote that, when scanning in the higher bits table p = current bit locationp = current bit location i = number of ‘1’s we read, telling us we’re reading Xi = number of ‘1’s we read, telling us we’re reading Xii j = number of ‘0’s we read, telling us the value of higher bits isj = number of ‘0’s we read, telling us the value of higher bits is i + j = pi + j = p
  • 14. Index structure ++ (example)Index structure ++ (example) X0=5 1 01 0010 0010 1111 1000 00 List = { } 1 00110001 00 High:High: Low: X1=8 X2=8 X3=15 X4=32 0 10 01 01 00 00 1 Skip interval=4, next pos=7 value before next skip = (pos – interval) * 2^l = 3 * 4 = 12 Advance Target = 22 so we can skip, and should walk three bits to get 24 > 22 complete current unary, then read lower bits, got result X4 = 32
  • 15. Index structure +++ (conceptual layout)Index structure +++ (conceptual layout) Size of each sectionSize of each section Metadata sectionMetadata section records n: num of elements, u: value upper bound, etcrecords n: num of elements, u: value upper bound, etc Skip tableSkip table p*w bits, (p: skip interval, w: data width)p*w bits, (p: skip interval, w: data width) Lower bitsLower bits n*l bits, (l: estimated width)n*l bits, (l: estimated width) Upper bitsUpper bits unknown without metadata, so put in last sectionunknown without metadata, so put in last section For doc ids, the sequence is strictly monotonicFor doc ids, the sequence is strictly monotonic For doc freqs, the sequence is ‘prefix sum of freq’, i.e.For doc freqs, the sequence is ‘prefix sum of freq’, i.e. For positions, the format is a little different, and we’ll leave this for nowFor positions, the format is a little different, and we’ll leave this for now
  • 16. Index structure ++++ (for dense sequence)Index structure ++++ (for dense sequence) However it’s not efficient when the sequence is very dense…However it’s not efficient when the sequence is very dense… Here we’ll encode the sequence as a bit sequence insteadHere we’ll encode the sequence as a bit sequence instead where: Bit k is set when Xwhere: Bit k is set when Xii == k== k 10 11 10 10 0 X0=1List = { }X1=2 X2=3 X3=5 X4=7 This is only for ‘strictly monotone sequence’This is only for ‘strictly monotone sequence’ Skipper will be set for every q positions, and store num of ‘1’ s before that.Skipper will be set for every q positions, and store num of ‘1’ s before that. We’ll cutover to this format when n > u/3We’ll cutover to this format when n > u/3
  • 17. AgendaAgenda Related work √Related work √ Representation of monotone sequences √Representation of monotone sequences √ Practical example √Practical example √ Theoretical estimation √Theoretical estimation √ Implementation detailsImplementation details Index structure √Index structure √ MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
  • 18. Miscellaneous (design of position list)Miscellaneous (design of position list) For a term t, all its position lists are stored as one sequence:For a term t, all its position lists are stored as one sequence: The length of this sequence is total_term_freq, and the upper bound is:The length of this sequence is total_term_freq, and the upper bound is: To revive positions from document i, we need:To revive positions from document i, we need: Sum of frq from previous documentsSum of frq from previous documents Sum of p from previous documentsSum of p from previous documents (also from current document, if we need more frequent skip)(also from current document, if we need more frequent skip) These will be store in skipper for position listThese will be store in skipper for position list
  • 19. Miscellaneous + (reuse logic)Miscellaneous + (reuse logic) 101 01 01 000001High:High: To read past 4 values, we need unary decodingTo read past 4 values, we need unary decoding To read past 4 ‘zero’s, we simply need ‘negated unary decoding’To read past 4 ‘zero’s, we simply need ‘negated unary decoding’ Another aspect of higher bits:Another aspect of higher bits: 0 10High:High: 110 10 0 0 0 0 1
  • 20. AgendaAgenda Related work √Related work √ Representation of monotone sequences √Representation of monotone sequences √ Practical example √Practical example √ Theoretical estimation √Theoretical estimation √ Implementation details √Implementation details √ Index structure √Index structure √ Miscellaneous √Miscellaneous √ ExperimentsExperiments DiscussionsDiscussions
  • 21. ExperimentsExperiments Five competitors:Five competitors: Lucene 3.6 (VB)Lucene 3.6 (VB) [sigh, not the latest version] MG4J (gamma/delta)MG4J (gamma/delta) [an old version written by the author] Zettair (VB)Zettair (VB) Kamikaze (PForDelta)Kamikaze (PForDelta) Optimized PForDelta implementation in COptimized PForDelta implementation in C Four datasets with different statistics:Four datasets with different statistics: TREC GOV2 (25M documents)TREC GOV2 (25M documents) .uk dataset (132M documents).uk dataset (132M documents) Mimir index (1M documents)Mimir index (1M documents) Tweet data (13M documents)Tweet data (13M documents) Aside from whole HTML index, title field is also extracted as another test groupAside from whole HTML index, title field is also extracted as another test group To make sure the tests is fair enough between competitors, input data is a pre-parsedTo make sure the tests is fair enough between competitors, input data is a pre-parsed stream of UTF-8 text documents.stream of UTF-8 text documents.
  • 23. Experiments ++ (speed)Experiments ++ (speed) Design of queries:Design of queries: 150 Queries from Terabyte track (04~06), as150 Queries from Terabyte track (04~06), as Conjunctive QueryConjunctive Query Phrasal QueryPhrasal Query Proximity Query (query words must appear within a window of 16)Proximity Query (query words must appear within a window of 16) Term Scanning Query (pure test)Term Scanning Query (pure test) Design of task:Design of task: All engines will be set up to return exactly one resultAll engines will be set up to return exactly one result The QS format is implemented with both Java and C++ for fair testThe QS format is implemented with both Java and C++ for fair test Since both Lucene and MG4J interleaves doc id and freq, pure boolean query willSince both Lucene and MG4J interleaves doc id and freq, pure boolean query will hurt when reading unused freq data, the QS* is a modified version to make test fairhurt when reading unused freq data, the QS* is a modified version to make test fair
  • 25. Experiments ++++ (examples from old paper)Experiments ++++ (examples from old paper) Almost pure unary reads Without skipping With heavy skipping Heavy position addressing, Hmm… however note that Lucene doesn’t have skip table for position list…
  • 26. DiscussionDiscussion A DocIdSet with this representation is already implemented in LuceneA DocIdSet with this representation is already implemented in Lucene (https://issues.apache.org/jira/browse/LUCENE-5084)(https://issues.apache.org/jira/browse/LUCENE-5084) We’ll see performance comparison soon!We’ll see performance comparison soon! Drawbacks?Drawbacks? It might take more time during index construction:It might take more time during index construction: Many statistics needed for encoding (upper bound, total_term_frq, etc)Many statistics needed for encoding (upper bound, total_term_frq, etc) It is possible to pre-store a postings list with VB in memory, then translated as QSIt is possible to pre-store a postings list with VB in memory, then translated as QS To be digested…To be digested… ““storing positions with PForDelta codes is know to give a compression rate close to thatstoring positions with PForDelta codes is know to give a compression rate close to that provided by VB coding” ?provided by VB coding” ?

Notes de l'éditeur

  1. Introduction?
  2. And of course, IPC
  3. Consider there are u numbers in a basket, each time after we pick up one, we then put the number back into the basket, so the possible combinations should be C(u+n, n), it is also the number of solutions for this: X1 + X2 + X3 + … + Xu = n ( Xi >= 0) And, when the sequence is strictly monotonic, the lower bound’s lower bound Z ~ nlog(u/n), So QS will achieves an index size with Z + O(n) here
  4. Later when discuss about position list, we’ll mention why doc freq is encoded like this
  5. That’s why we need to encode frq list as a monotone sequence