SlideShare a Scribd company logo
1 of 36
Download to read offline
Compressing column-oriented indexes

                        Daniel Lemire

  http://www.professeurs.uqam.ca/pages/lemire.daniel.htm
            blog: http://www.daniel-lemire.com/

Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc).




                   November 19, 2009




                   Daniel Lemire   Compressing column-oriented indexes
Row Stores




    name, date, age, sex, salary


    name, date, age, sex, salary


    name, date, age, sex, salary                   Dominant paradigm
    name, date, age, sex, salary
                                                   Transactional: Quick append and delete

    name, date, age, sex, salary




                                   Daniel Lemire      Compressing column-oriented indexes
Column Stores



                                             Goes back to StatCan in the
                                             seventies [Turner et al., 1979]
                                             Made fashionable again in Data
name   date   age   sex   salary             Warehousing by
                                             Stonebraker [Stonebraker et al., 2005]
                                             New: Oracle Exadata hybrid columnar
                                             compression
                                             Favors run-length encoding (compression)




                             Daniel Lemire      Compressing column-oriented indexes
Main column-oriented indexes




     (1) Bitmap indexes [O’Neil, 1989]
     (2) Projection indexes [O’Neil and Quass, 1997]
     Both are compressible.




                       Daniel Lemire   Compressing column-oriented indexes
Bitmap indexes



                                         Bitmap indexes have a long
  SELECT * FROM                          history. (1972 at IBM.)
  T WHERE x=a                            Long history with DW & OLAP.
  AND y=b;                               (Sybase IQ since mid 1990s).
                                         Main competition: B-trees.
  Above, compute
   {r | r is the row id of a row where x = a} ∩
   {r | r is the row id of a row where y = b}




                         Daniel Lemire    Compressing column-oriented indexes
Bitmaps and fast AND/OR operations


     Computing the union of two sets of integers between 1 and 64
     (eg row ids, trivial table). . .
     E.g., {1, 5, 8} ∪ {1, 3, 5}?
     Can be done in one operation by a CPU:
     BitwiseOR( 10001001, 10101000)
     Extend to sets from 1..N using N/64 operations.
     To compute [a0 , . . . , aN−1 ] ∨ [b0 , b1 , . . . , bN−1 ] :
     a0 , . . . , a63 BitwiseOR b0 , . . . , b63 ;
     a64 , . . . , a127 BitwiseOR b64 , . . . , b127 ;
     a128 , . . . , a192 BitwiseOR b128 , . . . , b192 ;
     ...
     It is a form of vectorization.


                            Daniel Lemire   Compressing column-oriented indexes
Common applications of the bitmaps




     The Java language has had a bitmap class since the
     beginning: java.util.BitSet. (Sun’s implementation is based
     on 64-bit words.)
     Search engines use bitmaps to filter queries, e.g. Apache
     Lucene




                       Daniel Lemire   Compressing column-oriented indexes
Bitmap compression



                                               A column with n rows and L distinct
column    index bitmaps
                                               values ⇒ nL bits
                         x=3
             x=1

                   x=2
    x                                          E.g., n = 106 , L = 104 → 10 Gbits
     1       1     0     0
                                               Uncompressed bitmaps are often
     3       0     0     1
                                               impractical
n




     1       1     0     0
                                               Moreover, bitmaps often contain long
     2       0     1     0
                                               streams of zeroes. . .
    ...




                   ...

                         ...
             ...




                                               Logical operations over these zeroes is a
                   L
                                               waste of CPU cycles.




                               Daniel Lemire      Compressing column-oriented indexes
How to compress bitmaps?




     Must handle long streams of zeroes efficiently ⇒
     Run-length encoding? (RLE)
     Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .
     So just encode the run lengths, e.g.,
     0001111100010111 →
      3, 5, 3, 1,1,3




                        Daniel Lemire   Compressing column-oriented indexes
Compressing better with delta codes


      RLE can make things worse. E.g., Use 8-bit counters, then
      11 may become 000000101.
      How many bits to use for the counters?
      Universal coding like delta codes use no more than c log x
      bits to represent value x.
      Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is
      4, etc.
      Delta codes build on Gamma codes. Has two steps:
      x = 2N + (x mod 2N ).
          Write N − 1 as gamma code;
          write x mod 2N as an N − 1-bit number.
      E.g. 17 = 24 + 1, 0010001



                        Daniel Lemire   Compressing column-oriented indexes
RLE with delta codes is pretty good




  In some (weak) sense, RLE compression with delta codes is
  optimal!
  Theorem
  A bitmap index over an N-value column of length n, compressed
  with RLE and delta codes, uses O(n log N) bits.




                        Daniel Lemire   Compressing column-oriented indexes
Byte/Word-aligned RLE



      RLE variants can focus on runs that align with machine-word
      boundaries.
      Trade compression for speed.
      That is what Oracle is doing.
      Variants: BBC (byte aligned), WAH
      Our EWAH extends Wu et al.’s (was known to Wu as WBC)
      word-aligned hybrid.
  0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .
  ⇒ dirty word,   run of 2 “clean 0” words,       dirty word. . .




                        Daniel Lemire   Compressing column-oriented indexes
What are bitmap indexes for?




     Construction time is proportional to index size. (Data is
     written sequentially on disk.)
     Implementation scales to millions of bitmaps.
     Myth: bitmap indexes are for low cardinality columns.
              the Bitmap index is the conclusive choice for data
              warehouse design for columns with high or low
              cardinality [Zaker et al., 2008].




                       Daniel Lemire   Compressing column-oriented indexes
What about other compression types?




     With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 in
     time O(|B1 | + |B2 |).
     Hence, with RLE, compress saves both storage and CPU
     cycles!!!!
     Not always true with other techniques such as Huffman,
     LZ77, Arithmetic Coding, . . .




                      Daniel Lemire   Compressing column-oriented indexes
What happens when you have many bitmaps?




     Consider B1 ∨ B2 ∨ . . . ∨ BN .
     First compute the first two : B1 ∨ B2 in time O(|B1 | + |B2 |).
     |B3 ∨ B4 | is in O(|B3 | + |B4 |).
     Thus (B1 ∨ B2 ) ∨ (B3 ∨ B4 ) takes O(2                i   |Bi |). . .
     Total is in O( N |Bi | log N), can be
                     i=1
     generalized [Lemire et al., 2009].




                         Daniel Lemire    Compressing column-oriented indexes
How do 64-bit words compare to 32-bit words?




     We implemented EWAH using 16-bit, 32-bit and 64-bit words;
     Only 32-bit and 64-bit are efficient;
     64-bit indexes are nearly twice as large;
     64-bit indexes are between 5%-40% faster (despite higher
     I/O costs).




                       Daniel Lemire   Compressing column-oriented indexes
Open Source Software?




     Lemur Bitmap Index C++ Library:
     http://code.google.com/p/lemurbitmapindex/.
     JavaEWAH: A compressed alternative to the Java BitSet class
     http://code.google.com/p/javaewah/.




                      Daniel Lemire   Compressing column-oriented indexes
Projection indexes




                                     Simply write out the values
   SELECT                            sequentially.
   sum(number*price)                 Ideal for low selectivity queries
   FROM T;                           on few columns.
                                     Compressible with RLE.




                     Daniel Lemire    Compressing column-oriented indexes
Improving compression by sorting the table




      RLE are order-sensitive:
      they compress sorted tables better;
      But finding the best row ordering is
      NP-hard [Lemire et al., 2009].
      So we sort:
          lexicographically
          with Gray codes
          Hilbert, . . .




                         Daniel Lemire   Compressing column-oriented indexes
How many ways to sort? (1)



    Lexicographic row sorting
    is
        fast, even for very
        large tables.
        easy: sort is a Unix
        staple.
    Substantial index-size
    reductions (often 2.5
    times, benefits grow with
    table size)




                       Daniel Lemire   Compressing column-oriented indexes
How many ways to sort? (2)

    Gray Codes are list of
    tuples with successive
    (Hamming) distance of
    1 [Knuth, 2005,
    § 7.2.1.1].
    Reflected Gray Code order
    is
        sometimes slightly
        better than
        lexicographical. . .
        . . . but benefit goes as
        ≈ 1/N with column
        cardinality N
        poorly supported by
        existing software.


                          Daniel Lemire   Compressing column-oriented indexes
How many ways to sort? (3)




    Reflected Gray Code order
    is not the only Gray code.
    Knuth also presents
    Modular Gray-code.
    But alternatives to
    reflected are never better?




                       Daniel Lemire   Compressing column-oriented indexes
How many ways to sort? (4)




                                            Can also try esoteric
                                            orders.
                                            Hilbert Index
                                            [Hamilton and Rau-Chaplin, 2007]
                                            Gives very bad results for
                                            column-oriented indexes.




                  Daniel Lemire   Compressing column-oriented indexes
Modelling the size of an index




      Any formal result?
      Tricky: There are many variations on RLE.
      Use: number of runs of identical value in a column




                       Daniel Lemire   Compressing column-oriented indexes
Recursive orders


  Lexicographical, reflected Gray code and modular Gray code
  belong to a larger class:
  Definition
  A recursive order over c-tuples is such that it generates a recursive
  order over c − 1-tuples. All orders over 1-tuples are recursive.

    This is a recursive order:                     This is not recursive:
            1 0 0                                        1 0 0
            1 0 1                                        0 1 1
            0 1 1                                        1 0 1




                          Daniel Lemire   Compressing column-oriented indexes
When sorting, column order matters




  Question
  Given a phone directory, to minimize the number of runs, should
  sort by first or last names?




                        Daniel Lemire   Compressing column-oriented indexes
When sorting, column order matters



      c columns
      any recursive order
      in practice, column order is very significant (factor of two or
      more)

  Proposition
  The number of column runs vary by a factor of ≈ c under the
  permutation of the columns.




                        Daniel Lemire   Compressing column-oriented indexes
But column reordering fails to buy optimality




  From some tables. . .
  Lemma
  No recursive order minimizes the number of runs—even after
  reordering the columns.

      Open problem: how far from optimality?




                          Daniel Lemire   Compressing column-oriented indexes
Best column order?


  We almost have this result [Lemire and Kaser, ]:
      any recursive order
      order the columns by increasing cardinality (small to
      LARGE)

  Proposition

  The expected number of runs is minimized.

      Truth is complicated.
      Assume uniformly distributed tables.




                        Daniel Lemire   Compressing column-oriented indexes
What about non-uniform or dependent columns?




     Real columns have skewed distributions [Missaoui et al., 2007]
     and they are statistically dependent.
     It can impact column ordering in unpredictable ways.




                       Daniel Lemire   Compressing column-oriented indexes
Take away messages




     Column stores are good because of RLE and sorting;
     Lexicographical sort with right column order is good;
     More exotic sorting (such as Hilbert) might be bad.




                       Daniel Lemire   Compressing column-oriented indexes
Future direction?




      Need better mathematical modelling of skewed and
      dependent columns;
      New column-oriented indexes?
      Better ways to sort?




                        Daniel Lemire   Compressing column-oriented indexes
Questions?




                             ?




             Daniel Lemire       Compressing column-oriented indexes
Hamilton, C. H. and Rau-Chaplin, A. (2007).
Compact Hilbert indices: Space-filling curves for domains with
unequal side lengths.
Information Processing Letters, 105(5):155–163.
Knuth, D. E. (2005).
The Art of Computer Programming, volume 4, chapter fascicle
2.
Addison Wesley.
Lemire, D. and Kaser, O.
Reordering columns for smaller indexes.
in preparation, available from
http://arxiv.org/abs/0909.1346.
Lemire, D., Kaser, O., and Aouiche, K. (2009).
Sorting improves word-aligned bitmap indexes.
to appear in Data & Knowledge Engineering, preprint available
from http://arxiv.org/abs/0901.3751.

                  Daniel Lemire   Compressing column-oriented indexes
Missaoui, R., Goutte, C., Choupo, A. K., and Boujenoui, A.
(2007).
A probabilistic model for data cube compression and query
approximation.
In DOLAP, pages 33–40.
O’Neil, P. and Quass, D. (1997).
Improved query performance with variant indexes.
In SIGMOD ’97, pages 38–49.
O’Neil, P. E. (1989).
Model 204 architecture and performance.
In 2nd International Workshop on High Performance
Transaction Systems, pages 40–59.
Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X.,
Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S.,
O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S.
(2005).
C-store: a column-oriented DBMS.
                   Daniel Lemire   Compressing column-oriented indexes
In VLDB’05, pages 553–564.
Turner, M. J., Hammond, R., and Cotton, P. (1979).
A DBMS for large statistical databases.
In VLDB’79, pages 319–327.
Wu, K., Otoo, E. J., and Shoshani, A. (2006).
Optimizing bitmap indices with efficient compression.
ACM Transactions on Database Systems, 31(1):1–38.
Zaker, M., Phon-Amnuaisuk, S., and Haw, S. (2008).
An adequate design for large data warehouse systems: Bitmap
index versus b-tree index.
IJCC, 2(2).




                  Daniel Lemire   Compressing column-oriented indexes

More Related Content

What's hot

Identificar la red y el host
Identificar la red y el hostIdentificar la red y el host
Identificar la red y el hostAlejandra Ortega
 
Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...Shrikant Samarth
 
Wan configuration in packet tracer by Tanjilur Rahman
Wan configuration in packet tracer by Tanjilur RahmanWan configuration in packet tracer by Tanjilur Rahman
Wan configuration in packet tracer by Tanjilur RahmanTanjilurRahman6
 
Ejercicio creacion de ipv6 freddy beltran
Ejercicio creacion de ipv6  freddy beltranEjercicio creacion de ipv6  freddy beltran
Ejercicio creacion de ipv6 freddy beltranbeppo
 
Transitioning IPv4 to IPv6
Transitioning IPv4 to IPv6Transitioning IPv4 to IPv6
Transitioning IPv4 to IPv6Jhoni Guerrero
 
Documents.tips metodo para-el-calculo-de-subredes
Documents.tips metodo para-el-calculo-de-subredesDocuments.tips metodo para-el-calculo-de-subredes
Documents.tips metodo para-el-calculo-de-subredesCristian Oporta Villalobos
 
Finite automata-for-lexical-analysis
Finite automata-for-lexical-analysisFinite automata-for-lexical-analysis
Finite automata-for-lexical-analysisDattatray Gandhmal
 
Routing and switching essentials companion guide
Routing and switching essentials companion guideRouting and switching essentials companion guide
Routing and switching essentials companion guideSiddhartha Rajbhatt
 
Ejercicio de subneteo vlsm y cidr
Ejercicio de subneteo vlsm y cidrEjercicio de subneteo vlsm y cidr
Ejercicio de subneteo vlsm y cidrcesartg65
 
MERCURY MERCRUISER 496 MAG HO DIAGNOSTICS Service Repair Manual SN:0M000000 a...
MERCURY MERCRUISER 496 MAG HO DIAGNOSTICS Service Repair Manual SN:0M000000 a...MERCURY MERCRUISER 496 MAG HO DIAGNOSTICS Service Repair Manual SN:0M000000 a...
MERCURY MERCRUISER 496 MAG HO DIAGNOSTICS Service Repair Manual SN:0M000000 a...kjsmemm
 
Shannon-Fano algorithm
Shannon-Fano algorithmShannon-Fano algorithm
Shannon-Fano algorithmMANISH T I
 
ELF(executable and linkable format)
ELF(executable and linkable format)ELF(executable and linkable format)
ELF(executable and linkable format)Seungha Son
 

What's hot (20)

Subnet calculation Tutorial
Subnet calculation TutorialSubnet calculation Tutorial
Subnet calculation Tutorial
 
Identificar la red y el host
Identificar la red y el hostIdentificar la red y el host
Identificar la red y el host
 
Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...Sales force- Housing society management system | Strategic ICT and eBusiness ...
Sales force- Housing society management system | Strategic ICT and eBusiness ...
 
Ensamblador
EnsambladorEnsamblador
Ensamblador
 
IPV4 vs IPV6
IPV4 vs IPV6IPV4 vs IPV6
IPV4 vs IPV6
 
Wan configuration in packet tracer by Tanjilur Rahman
Wan configuration in packet tracer by Tanjilur RahmanWan configuration in packet tracer by Tanjilur Rahman
Wan configuration in packet tracer by Tanjilur Rahman
 
Ejercicio creacion de ipv6 freddy beltran
Ejercicio creacion de ipv6  freddy beltranEjercicio creacion de ipv6  freddy beltran
Ejercicio creacion de ipv6 freddy beltran
 
Transitioning IPv4 to IPv6
Transitioning IPv4 to IPv6Transitioning IPv4 to IPv6
Transitioning IPv4 to IPv6
 
Huffman y-lzw
Huffman y-lzwHuffman y-lzw
Huffman y-lzw
 
Chapter 8 ooad
Chapter  8 ooadChapter  8 ooad
Chapter 8 ooad
 
Routers CIsco: configu
Routers CIsco: configuRouters CIsco: configu
Routers CIsco: configu
 
Subnetting
SubnettingSubnetting
Subnetting
 
Documents.tips metodo para-el-calculo-de-subredes
Documents.tips metodo para-el-calculo-de-subredesDocuments.tips metodo para-el-calculo-de-subredes
Documents.tips metodo para-el-calculo-de-subredes
 
Finite automata-for-lexical-analysis
Finite automata-for-lexical-analysisFinite automata-for-lexical-analysis
Finite automata-for-lexical-analysis
 
Multi level,multi transition
Multi level,multi transitionMulti level,multi transition
Multi level,multi transition
 
Routing and switching essentials companion guide
Routing and switching essentials companion guideRouting and switching essentials companion guide
Routing and switching essentials companion guide
 
Ejercicio de subneteo vlsm y cidr
Ejercicio de subneteo vlsm y cidrEjercicio de subneteo vlsm y cidr
Ejercicio de subneteo vlsm y cidr
 
MERCURY MERCRUISER 496 MAG HO DIAGNOSTICS Service Repair Manual SN:0M000000 a...
MERCURY MERCRUISER 496 MAG HO DIAGNOSTICS Service Repair Manual SN:0M000000 a...MERCURY MERCRUISER 496 MAG HO DIAGNOSTICS Service Repair Manual SN:0M000000 a...
MERCURY MERCRUISER 496 MAG HO DIAGNOSTICS Service Repair Manual SN:0M000000 a...
 
Shannon-Fano algorithm
Shannon-Fano algorithmShannon-Fano algorithm
Shannon-Fano algorithm
 
ELF(executable and linkable format)
ELF(executable and linkable format)ELF(executable and linkable format)
ELF(executable and linkable format)
 

Similar to Compressing column-oriented indexes

Faster Column-Oriented Indexes
Faster Column-Oriented IndexesFaster Column-Oriented Indexes
Faster Column-Oriented IndexesDaniel Lemire
 
digital logic design number system
digital logic design number systemdigital logic design number system
digital logic design number systemNallapati Anindra
 
Lesson4.1 u4 l1 binary representation
Lesson4.1 u4 l1 binary representationLesson4.1 u4 l1 binary representation
Lesson4.1 u4 l1 binary representationLexume1
 
Introduction of c_language
Introduction of c_languageIntroduction of c_language
Introduction of c_languageSINGH PROJECTS
 
Chapter 3-Data Representation in Computers.ppt
Chapter 3-Data Representation in Computers.pptChapter 3-Data Representation in Computers.ppt
Chapter 3-Data Representation in Computers.pptKalGetachew2
 
Digitaltechnology 090926105236-phpapp02
Digitaltechnology 090926105236-phpapp02Digitaltechnology 090926105236-phpapp02
Digitaltechnology 090926105236-phpapp02Msbiswa
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxSreeLaya9
 
data representation
 data representation data representation
data representationHaroon_007
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLexume1
 
Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesRakuten Group, Inc.
 
1.Digital Electronics overview & Number Systems.pptx
1.Digital Electronics overview & Number Systems.pptx1.Digital Electronics overview & Number Systems.pptx
1.Digital Electronics overview & Number Systems.pptxLibanMohamed26
 
Design and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).pptDesign and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).pptZeenaJaba
 

Similar to Compressing column-oriented indexes (20)

Faster Column-Oriented Indexes
Faster Column-Oriented IndexesFaster Column-Oriented Indexes
Faster Column-Oriented Indexes
 
digital logic design number system
digital logic design number systemdigital logic design number system
digital logic design number system
 
Lesson4.1 u4 l1 binary representation
Lesson4.1 u4 l1 binary representationLesson4.1 u4 l1 binary representation
Lesson4.1 u4 l1 binary representation
 
Introduction of c_language
Introduction of c_languageIntroduction of c_language
Introduction of c_language
 
Editors l21 l24
Editors l21 l24Editors l21 l24
Editors l21 l24
 
Chapter 3-Data Representation in Computers.ppt
Chapter 3-Data Representation in Computers.pptChapter 3-Data Representation in Computers.ppt
Chapter 3-Data Representation in Computers.ppt
 
fundamentals.ppt
fundamentals.pptfundamentals.ppt
fundamentals.ppt
 
Digitaltechnology 090926105236-phpapp02
Digitaltechnology 090926105236-phpapp02Digitaltechnology 090926105236-phpapp02
Digitaltechnology 090926105236-phpapp02
 
fundamentals.ppt
fundamentals.pptfundamentals.ppt
fundamentals.ppt
 
Data representation
Data representationData representation
Data representation
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptx
 
data representation
 data representation data representation
data representation
 
Linear sorting
Linear sortingLinear sorting
Linear sorting
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squences
 
Binomial Coefficient
Binomial CoefficientBinomial Coefficient
Binomial Coefficient
 
Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select Dictionaries
 
1.Digital Electronics overview & Number Systems.pptx
1.Digital Electronics overview & Number Systems.pptx1.Digital Electronics overview & Number Systems.pptx
1.Digital Electronics overview & Number Systems.pptx
 
Design and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).pptDesign and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).ppt
 
Visual Techniques
Visual TechniquesVisual Techniques
Visual Techniques
 
digitalelectronics.ppt
digitalelectronics.pptdigitalelectronics.ppt
digitalelectronics.ppt
 

More from Daniel Lemire

Accurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarksAccurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarksDaniel Lemire
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Daniel Lemire
 
Parsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons LearnedParsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons LearnedDaniel Lemire
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire
 
Ingénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnéesIngénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnéesDaniel Lemire
 
SIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted IntegersSIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted IntegersDaniel Lemire
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorizationDecoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorizationDaniel Lemire
 
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Daniel Lemire
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Daniel Lemire
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexesDaniel Lemire
 
MaskedVByte: SIMD-accelerated VByte
MaskedVByte: SIMD-accelerated VByteMaskedVByte: SIMD-accelerated VByte
MaskedVByte: SIMD-accelerated VByteDaniel Lemire
 
Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)Daniel Lemire
 
Roaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 reportRoaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 reportDaniel Lemire
 
La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression Daniel Lemire
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization  Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization Daniel Lemire
 
Extracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific DataExtracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific DataDaniel Lemire
 
Innovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQLInnovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQLDaniel Lemire
 
All About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting ThemAll About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting ThemDaniel Lemire
 

More from Daniel Lemire (20)

Accurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarksAccurate and efficient software microbenchmarks
Accurate and efficient software microbenchmarks
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10
 
Parsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons LearnedParsing JSON Really Quickly: Lessons Learned
Parsing JSON Really Quickly: Lessons Learned
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
Ingénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnéesIngénierie de la performance au sein des mégadonnées
Ingénierie de la performance au sein des mégadonnées
 
SIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted IntegersSIMD Compression and the Intersection of Sorted Integers
SIMD Compression and the Intersection of Sorted Integers
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorizationDecoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization
 
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
MaskedVByte: SIMD-accelerated VByte
MaskedVByte: SIMD-accelerated VByteMaskedVByte: SIMD-accelerated VByte
MaskedVByte: SIMD-accelerated VByte
 
Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)Roaring Bitmaps (January 2016)
Roaring Bitmaps (January 2016)
 
Roaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 reportRoaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 report
 
La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression La vectorisation des algorithmes de compression
La vectorisation des algorithmes de compression
 
OLAP and more
OLAP and moreOLAP and more
OLAP and more
 
Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization  Decoding billions of integers per second through vectorization
Decoding billions of integers per second through vectorization
 
Extracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific DataExtracting, Transforming and Archiving Scientific Data
Extracting, Transforming and Archiving Scientific Data
 
Innovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQLInnovation without permission: from Codd to NoSQL
Innovation without permission: from Codd to NoSQL
 
Write good papers
Write good papersWrite good papers
Write good papers
 
All About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting ThemAll About Bitmap Indexes... And Sorting Them
All About Bitmap Indexes... And Sorting Them
 

Recently uploaded

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Compressing column-oriented indexes

  • 1. Compressing column-oriented indexes Daniel Lemire http://www.professeurs.uqam.ca/pages/lemire.daniel.htm blog: http://www.daniel-lemire.com/ Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc). November 19, 2009 Daniel Lemire Compressing column-oriented indexes
  • 2. Row Stores name, date, age, sex, salary name, date, age, sex, salary name, date, age, sex, salary Dominant paradigm name, date, age, sex, salary Transactional: Quick append and delete name, date, age, sex, salary Daniel Lemire Compressing column-oriented indexes
  • 3. Column Stores Goes back to StatCan in the seventies [Turner et al., 1979] Made fashionable again in Data name date age sex salary Warehousing by Stonebraker [Stonebraker et al., 2005] New: Oracle Exadata hybrid columnar compression Favors run-length encoding (compression) Daniel Lemire Compressing column-oriented indexes
  • 4. Main column-oriented indexes (1) Bitmap indexes [O’Neil, 1989] (2) Projection indexes [O’Neil and Quass, 1997] Both are compressible. Daniel Lemire Compressing column-oriented indexes
  • 5. Bitmap indexes Bitmap indexes have a long SELECT * FROM history. (1972 at IBM.) T WHERE x=a Long history with DW & OLAP. AND y=b; (Sybase IQ since mid 1990s). Main competition: B-trees. Above, compute {r | r is the row id of a row where x = a} ∩ {r | r is the row id of a row where y = b} Daniel Lemire Compressing column-oriented indexes
  • 6. Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table). . . E.g., {1, 5, 8} ∪ {1, 3, 5}? Can be done in one operation by a CPU: BitwiseOR( 10001001, 10101000) Extend to sets from 1..N using N/64 operations. To compute [a0 , . . . , aN−1 ] ∨ [b0 , b1 , . . . , bN−1 ] : a0 , . . . , a63 BitwiseOR b0 , . . . , b63 ; a64 , . . . , a127 BitwiseOR b64 , . . . , b127 ; a128 , . . . , a192 BitwiseOR b128 , . . . , b192 ; ... It is a form of vectorization. Daniel Lemire Compressing column-oriented indexes
  • 7. Common applications of the bitmaps The Java language has had a bitmap class since the beginning: java.util.BitSet. (Sun’s implementation is based on 64-bit words.) Search engines use bitmaps to filter queries, e.g. Apache Lucene Daniel Lemire Compressing column-oriented indexes
  • 8. Bitmap compression A column with n rows and L distinct column index bitmaps values ⇒ nL bits x=3 x=1 x=2 x E.g., n = 106 , L = 104 → 10 Gbits 1 1 0 0 Uncompressed bitmaps are often 3 0 0 1 impractical n 1 1 0 0 Moreover, bitmaps often contain long 2 0 1 0 streams of zeroes. . . ... ... ... ... Logical operations over these zeroes is a L waste of CPU cycles. Daniel Lemire Compressing column-oriented indexes
  • 9. How to compress bitmaps? Must handle long streams of zeroes efficiently ⇒ Run-length encoding? (RLE) Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . . So just encode the run lengths, e.g., 0001111100010111 → 3, 5, 3, 1,1,3 Daniel Lemire Compressing column-oriented indexes
  • 10. Compressing better with delta codes RLE can make things worse. E.g., Use 8-bit counters, then 11 may become 000000101. How many bits to use for the counters? Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2N + (x mod 2N ). Write N − 1 as gamma code; write x mod 2N as an N − 1-bit number. E.g. 17 = 24 + 1, 0010001 Daniel Lemire Compressing column-oriented indexes
  • 11. RLE with delta codes is pretty good In some (weak) sense, RLE compression with delta codes is optimal! Theorem A bitmap index over an N-value column of length n, compressed with RLE and delta codes, uses O(n log N) bits. Daniel Lemire Compressing column-oriented indexes
  • 12. Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed. That is what Oracle is doing. Variants: BBC (byte aligned), WAH Our EWAH extends Wu et al.’s (was known to Wu as WBC) word-aligned hybrid. 0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . . ⇒ dirty word, run of 2 “clean 0” words, dirty word. . . Daniel Lemire Compressing column-oriented indexes
  • 13. What are bitmap indexes for? Construction time is proportional to index size. (Data is written sequentially on disk.) Implementation scales to millions of bitmaps. Myth: bitmap indexes are for low cardinality columns. the Bitmap index is the conclusive choice for data warehouse design for columns with high or low cardinality [Zaker et al., 2008]. Daniel Lemire Compressing column-oriented indexes
  • 14. What about other compression types? With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 in time O(|B1 | + |B2 |). Hence, with RLE, compress saves both storage and CPU cycles!!!! Not always true with other techniques such as Huffman, LZ77, Arithmetic Coding, . . . Daniel Lemire Compressing column-oriented indexes
  • 15. What happens when you have many bitmaps? Consider B1 ∨ B2 ∨ . . . ∨ BN . First compute the first two : B1 ∨ B2 in time O(|B1 | + |B2 |). |B3 ∨ B4 | is in O(|B3 | + |B4 |). Thus (B1 ∨ B2 ) ∨ (B3 ∨ B4 ) takes O(2 i |Bi |). . . Total is in O( N |Bi | log N), can be i=1 generalized [Lemire et al., 2009]. Daniel Lemire Compressing column-oriented indexes
  • 16. How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words; Only 32-bit and 64-bit are efficient; 64-bit indexes are nearly twice as large; 64-bit indexes are between 5%-40% faster (despite higher I/O costs). Daniel Lemire Compressing column-oriented indexes
  • 17. Open Source Software? Lemur Bitmap Index C++ Library: http://code.google.com/p/lemurbitmapindex/. JavaEWAH: A compressed alternative to the Java BitSet class http://code.google.com/p/javaewah/. Daniel Lemire Compressing column-oriented indexes
  • 18. Projection indexes Simply write out the values SELECT sequentially. sum(number*price) Ideal for low selectivity queries FROM T; on few columns. Compressible with RLE. Daniel Lemire Compressing column-oriented indexes
  • 19. Improving compression by sorting the table RLE are order-sensitive: they compress sorted tables better; But finding the best row ordering is NP-hard [Lemire et al., 2009]. So we sort: lexicographically with Gray codes Hilbert, . . . Daniel Lemire Compressing column-oriented indexes
  • 20. How many ways to sort? (1) Lexicographic row sorting is fast, even for very large tables. easy: sort is a Unix staple. Substantial index-size reductions (often 2.5 times, benefits grow with table size) Daniel Lemire Compressing column-oriented indexes
  • 21. How many ways to sort? (2) Gray Codes are list of tuples with successive (Hamming) distance of 1 [Knuth, 2005, § 7.2.1.1]. Reflected Gray Code order is sometimes slightly better than lexicographical. . . . . . but benefit goes as ≈ 1/N with column cardinality N poorly supported by existing software. Daniel Lemire Compressing column-oriented indexes
  • 22. How many ways to sort? (3) Reflected Gray Code order is not the only Gray code. Knuth also presents Modular Gray-code. But alternatives to reflected are never better? Daniel Lemire Compressing column-oriented indexes
  • 23. How many ways to sort? (4) Can also try esoteric orders. Hilbert Index [Hamilton and Rau-Chaplin, 2007] Gives very bad results for column-oriented indexes. Daniel Lemire Compressing column-oriented indexes
  • 24. Modelling the size of an index Any formal result? Tricky: There are many variations on RLE. Use: number of runs of identical value in a column Daniel Lemire Compressing column-oriented indexes
  • 25. Recursive orders Lexicographical, reflected Gray code and modular Gray code belong to a larger class: Definition A recursive order over c-tuples is such that it generates a recursive order over c − 1-tuples. All orders over 1-tuples are recursive. This is a recursive order: This is not recursive: 1 0 0 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 Daniel Lemire Compressing column-oriented indexes
  • 26. When sorting, column order matters Question Given a phone directory, to minimize the number of runs, should sort by first or last names? Daniel Lemire Compressing column-oriented indexes
  • 27. When sorting, column order matters c columns any recursive order in practice, column order is very significant (factor of two or more) Proposition The number of column runs vary by a factor of ≈ c under the permutation of the columns. Daniel Lemire Compressing column-oriented indexes
  • 28. But column reordering fails to buy optimality From some tables. . . Lemma No recursive order minimizes the number of runs—even after reordering the columns. Open problem: how far from optimality? Daniel Lemire Compressing column-oriented indexes
  • 29. Best column order? We almost have this result [Lemire and Kaser, ]: any recursive order order the columns by increasing cardinality (small to LARGE) Proposition The expected number of runs is minimized. Truth is complicated. Assume uniformly distributed tables. Daniel Lemire Compressing column-oriented indexes
  • 30. What about non-uniform or dependent columns? Real columns have skewed distributions [Missaoui et al., 2007] and they are statistically dependent. It can impact column ordering in unpredictable ways. Daniel Lemire Compressing column-oriented indexes
  • 31. Take away messages Column stores are good because of RLE and sorting; Lexicographical sort with right column order is good; More exotic sorting (such as Hilbert) might be bad. Daniel Lemire Compressing column-oriented indexes
  • 32. Future direction? Need better mathematical modelling of skewed and dependent columns; New column-oriented indexes? Better ways to sort? Daniel Lemire Compressing column-oriented indexes
  • 33. Questions? ? Daniel Lemire Compressing column-oriented indexes
  • 34. Hamilton, C. H. and Rau-Chaplin, A. (2007). Compact Hilbert indices: Space-filling curves for domains with unequal side lengths. Information Processing Letters, 105(5):155–163. Knuth, D. E. (2005). The Art of Computer Programming, volume 4, chapter fascicle 2. Addison Wesley. Lemire, D. and Kaser, O. Reordering columns for smaller indexes. in preparation, available from http://arxiv.org/abs/0909.1346. Lemire, D., Kaser, O., and Aouiche, K. (2009). Sorting improves word-aligned bitmap indexes. to appear in Data & Knowledge Engineering, preprint available from http://arxiv.org/abs/0901.3751. Daniel Lemire Compressing column-oriented indexes
  • 35. Missaoui, R., Goutte, C., Choupo, A. K., and Boujenoui, A. (2007). A probabilistic model for data cube compression and query approximation. In DOLAP, pages 33–40. O’Neil, P. and Quass, D. (1997). Improved query performance with variant indexes. In SIGMOD ’97, pages 38–49. O’Neil, P. E. (1989). Model 204 architecture and performance. In 2nd International Workshop on High Performance Transaction Systems, pages 40–59. Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S. (2005). C-store: a column-oriented DBMS. Daniel Lemire Compressing column-oriented indexes
  • 36. In VLDB’05, pages 553–564. Turner, M. J., Hammond, R., and Cotton, P. (1979). A DBMS for large statistical databases. In VLDB’79, pages 319–327. Wu, K., Otoo, E. J., and Shoshani, A. (2006). Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 31(1):1–38. Zaker, M., Phon-Amnuaisuk, S., and Haw, S. (2008). An adequate design for large data warehouse systems: Bitmap index versus b-tree index. IJCC, 2(2). Daniel Lemire Compressing column-oriented indexes