SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Heavy Committing:
Flexible Indexing in Lucene 4.0
                   Uwe Schindler
 SD DataSolutions GmbH, uschindler@sd-datasolutions.de
My Background
• I am committer and PMC member of Apache Lucene and
  Solr. My main focus is on development of Lucene Java.
• Implemented fast numerical search and maintaining the new
  attribute-based text analysis API. Well known as Generics
  and Sophisticated Backwards Compatibility Policeman.
• Working as consultant and software architect for SD
  DataSolutions GmbH in Bremen, Germany. The main task is
  maintaining PANGAEA (Publishing Network for Geoscientific
  & Environmental Data) where I implemented the portal's
  geo-spatial retrieval functions with Lucene Java.
• Talks about Lucene at various international conferences like
  ApacheCon EU/US, Lucene Revolution, Lucene Eurocon,
  Berlin Buzzwords and various local meetups.
Agenda

• Motivation

• API changes in general

• Codecs

• Future

• Wrap up

                           4
Lucene up to version 3
• Lucene started > 10 years ago
   • Lucene‟s vInt format is old and not as friendly as new
     compression algorithms to CPU‟s optimizers (exists since
     Lucene 1.0)


• It‟s hard to add additional statistics for scoring to the
  index
   • IR researchers don‟t use Lucene to try out new algorithms


• Small changes to index format are often huge patches
  covering tons of files
   • Example from days of Lucene 2.4: final omit-TFaP patch of
     LUCENE-1340 is ~70 KiB covering ~25 files

                                                                 5
Why Flexible Indexing?
• Many new compression approaches ready to
  be implemented for Lucene
• Separate index encoding from terms/postings
  enumeration API
• Make innovations to the postings format easier

 Targets to make Lucene extensible even on
              the lowest level



                                                   6
Agenda

• Motivation

• API changes in general

• Codecs

• Future

• Wrap up

                           7
Quick Overview
• Will be Apache Lucene ≥ 4.0 only!

• Allows to
  • store new information into the index
  • change the way existing information is stored
• Under heavy development
  • almost stable API, may break suddenly
  • lots of classes still in refactoring
• Replaces a lot of existing classes and
  interfaces → Lucene 4.0 will not be backwards
  compatible (API wise)
                                                    8
Quick Overview

•   Pre-4.0 indexes are still readable, but
    < 3.0 is no longer supported

•   Index upgrade tool will be available as of
    Lucene 3.2
    •   Supports upgrade of pre-3.0 indexes in-place →
        two-step migration to 4.0 (LUCENE-3082)




                                                         9
Architecture




               10
New 4-dimensional
 Enumeration-API




                    11
Flex - Enum API Properties                                  (1)



•   Replaces the TermEnum / TermDocs /
    TermPositions

•   Unified iterator-like behavior: no longer strange
    do…while vs. while

•   Decoupling of term from field
    •   IndexReader returns enum of fields, each field has separate
        TermsEnum
    •   No interning of field names needed anymore (will be removed soon)


•   Improved RAM efficiency:
    •   efficient re-usage of byte buffer with the BytesRef class


                                                                            12
Flex - Enum API Properties                         (2)


• Terms are arbitrary byte slices, no 16bit UTF-16 chars
  anymore
• Term now using byte[] instead of char[]
   • compact representation of numeric terms
• Can be any encoding
   • Default is UTF-8 → changes sort order for supplementary
     characters
   • Compact representation with low memory footprint for western
     languages
   • Support for e.g. BOCU1 (LUCENE-1799) possible, but some
     queries rely on UTF-8 (especially MultiTermQuery)
• Indexer consumes BytesRefAttribute, which
  converts the good old char[] to UTF-8

                                                                    13
Flex - Enum API Properties                   (3)



•   All Flex Enums make use of AttributeSource
    •   Custom Attribute deserialization (planned)
    •   E.g., BoostAttribute for FuzzyQuery


•   TermsEnum supports seeking (enables fast
    automaton queries like FuzzyQuery)

•   DocsEnum / DocsAndPositionsEnum now
    support separate skipDocs property: Custom
    control of excluded documents (default
    implementation by IndexReader returns deleted
    docs)
                                                           14
FieldCache
• FieldCache consumes the flex APIs
• Terms (previously strings) field cache more
  RAM efficient, low GC load
  • Used with SortField.STRING
• Shared byte[] blocks instead of separate String
  instances
  • Terms remain as byte[] in few big byte[] slices
• PackedInts for ords & addresses (LUCENE-
  1990)
• RAM reduction ~ 40-60%

                                                      15
Agenda

• Motivation

• API changes in general

• Codecs

• Future

• Wrap up

                           16
Extending Flex with Codecs

•   A Codec represents the primary Flex-API
    extension point
•   Directly passed to SegmentReader to
    decode the index format
•   Provides implementations of the enum
    classes to SegmentReader
•   Provides writer for index files to
    IndexWriter

                                              17
Architecture




               18
Components of a Codec
• Codec provides read/write for one segment
  • Unique name (String)
  • FieldsConsumer (for writing)
  • FieldsProducer is 4D enum API + close
• CodecProvider creates Codec instance
  • Passed to IndexWriter/IndexReader
• You can override merging
• Reusable building blocks
  • Terms dict + index, postings


                                              19
Build-In Codecs: PreFlex

•   Pre-Flex-Indexes will be “read-only” in
    Lucene 4.0
•   all additions to indexes will be done with
    CodecProvider„s default codec


Warning: PreFlexCodec needs to
reorder terms in term index on-the-fly:
„surrogates dance“
→ use index upgrader, LUCENE-3082


                                                 20
Default: StandardCodec             (1)




Default codec was moved out of
o.a.l.index:

  • lives now in its own package
    o.a.l.index.codecs.standard
  • is the default implementation
  • similar to the pre-flex index format
  • requires far less ram to load term index


                                               21
Default: StandardCodec                          (2)



• Terms index:
  VariableGapTermsIndexWriter/Reader
  • uses finite state transducers from new FST package
  • removes useless suffixes                Talk yesterday morning:
                                      “Finite State Automata in Lucene”
  • reusable by other codecs                    by Dawid Weiss

• Terms dict: BlockTermsWriter/Reader
  • more efficient metadata decode
  • faster TermsEnum
  • reusable by other codecs
• Postings: StandardPostingsWriter/Reader
  • Similar to pre-flex postings
                                                                      22
Build-In Codecs: PulsingCodec                (1)



• Inlines low doc-freq terms into terms dict
• Saves extra seek to get the postings
• Excellent match for primary key fields, but also
  “normal” field (Zipf‟s law)
• Wraps any other codec
• Likely default codec will use Pulsing

           See Mike McCandless’ blog:
            http://s.apache.org/XEw


                                                     23
Build-In Codecs: PulsingCodec   (2)




                                      24
Build-In Codecs: SimpleText
• All postings stored in _X.pst text file
• Read / write
• Not performant
   • Do not use in production!
• Fully functional
   • Passes all Lucene/Solr unit tests
     (slowly...)
• Useful / fun for debugging

      See Mike McCandless’ blog:
        http://s.apache.org/eh
                                            25
Other Codecs shipped with Lucene /
         Lucene Contrib

• PerFieldCodecWrapper (core)
  • Can be used to delegate specific fields to different
    codecs (e.g., use pulsing for primary key)


• AppendingCodec (contrib/misc)
  • Never rewinds a file pointer during write
  • Useful for Hadoop




                                                           26
Heavy Development: Block Codecs
• Separate branch for developing block codecs:
  • http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings
• All codecs use default terms index, but different
  encodings of postings
• Support for recent compression developments
  • FOR, PFORDelta, Simple64, BulkVInt
• Framework for blocks of encoded ints (using
  SepCodec)
  • Easy to plugin a new compression by providing a
    encoder/decoder between a block int[] ↔
    IndexOutput/IndexInput

                                                                       27
Block Codecs: Use Case
• Smaller index & better compression for certain
  index contents by replacing vInt
  • Frequent terms (stop words) have low doc deltas
  • Short fields (city names) have small position values
• TermQuery
  • High doc frequency terms
• MultiTermQuery
  • Lots of terms (possibly with low docFreq) in order
    request doc postings



                                                           28
Block Codecs: Disadvantages
• Crazy API for consumers, e.g. queries
• Removes encapsulation of DocsEnum
  • Consumer needs to do loading of blocks
  • Consumer needs to do doc delta
    decoding
• Lot„s of queries may need duplicate
  code
  • Impl for codecs that support „bulk
    postings“
  • Classical DocsEnum impl


                                             29
Agenda

• Motivation

• API changes in general

• Codecs

• Future

• Wrap up

                           30
Flexible Indexing - Current State
• All tests pass with a pool of random codecs
  • If you find a bug, post the test output to mailing list!
  • Test output contains random seed to reproduce


            Talk yesterday evening:
“Using Lucene's Test Framework” by Robert Muir

• many more tests and documentation is needed
• community feedback is highly appreciated


                                                               31
Flexible Indexing - TODO
• Serialize custom attributes to the index
• Convert all remaining queries to use internally
  BytesRef-terms
• Remove interning of field names
• Die, Term (& Token) class, die! ☺
• Make block-/bulk-postings API more user-
  friendly and merge to SVN trunk
• Merge DocValues branch
      (Talk yesterday: “DocValues & Column Stride Fields in Lucene” by Simon Willnauer)

• More heavy refactoring!

                                                                                          32
Agenda

• Motivation

• API changes in general

• Codecs

• Future

• Wrap up

                           33
Summary
• New 4D postings enum APIs

• Pluggable codec lets you customize
  index format
  • Many codecs already available
• State-of-the-art postings formats are in
  progress

• Innovation is easier!
  • Exciting time for Lucene...
• Sizable performance gains, RAM/GC
                                             34
We need YOU!

 • Download nightly builds or check
   out SVN trunk
 • Run tests, run tests, run tests,…
 • Use in your own developments

 • Report back experience with
   developing own codecs



                                       35
Contact
                       Uwe Schindler
                 uschindler@sd-datasolutions.de
                       http://www.thetaphi.de
                               @thetaph1

Many thanks to Mike McCandless for slides & help on preparing this talk!




      SD DataSolutions GmbH
            Wätjenstr. 49
      28213 Bremen, Germany
         +49 421 40889785-0
   http://www.sd-datasolutions.de

                                                                           36

Contenu connexe

Tendances

Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
rcmuir
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 

Tendances (20)

Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Lucene
LuceneLucene
Lucene
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 

En vedette

Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
Spanish bombss
Spanish bombssSpanish bombss
Spanish bombss
tanica
 
The mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the marketThe mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the market
Paul Williamson
 
Mujer, pajaro y estrella
Mujer, pajaro y estrellaMujer, pajaro y estrella
Mujer, pajaro y estrella
guest986e5ae
 

En vedette (20)

Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル
 
C:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3bC:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3b
 
Web Design Course Overview
Web Design Course OverviewWeb Design Course Overview
Web Design Course Overview
 
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Updated: Marketing your Technology
Updated: Marketing your TechnologyUpdated: Marketing your Technology
Updated: Marketing your Technology
 
Spanish bombss
Spanish bombssSpanish bombss
Spanish bombss
 
The mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the marketThe mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the market
 
Solr lucene search revolution
Solr lucene search revolutionSolr lucene search revolution
Solr lucene search revolution
 
Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4
 
Presentation
PresentationPresentation
Presentation
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Coterie 9 11
Coterie 9 11Coterie 9 11
Coterie 9 11
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
Mujer, pajaro y estrella
Mujer, pajaro y estrellaMujer, pajaro y estrella
Mujer, pajaro y estrella
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 
Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012
 
Ob12 01st
Ob12 01stOb12 01st
Ob12 01st
 
All Data Big and Small
All Data Big and SmallAll Data Big and Small
All Data Big and Small
 

Similaire à Flexible Indexing in Lucene 4.0

Similaire à Flexible Indexing in Lucene 4.0 (20)

Fun with flexible indexing
Fun with flexible indexingFun with flexible indexing
Fun with flexible indexing
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
 
Kafka overview v0.1
Kafka overview v0.1Kafka overview v0.1
Kafka overview v0.1
 
Containers and HPC
Containers and HPCContainers and HPC
Containers and HPC
 
What’s the Deal with Containers, Anyway?
What’s the Deal with Containers, Anyway?What’s the Deal with Containers, Anyway?
What’s the Deal with Containers, Anyway?
 
assem.ppt
assem.pptassem.ppt
assem.ppt
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
 
Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...
 
Managing ejabberd Platforms with Docker - ejabberd Workshop #1
Managing ejabberd Platforms with Docker - ejabberd Workshop #1Managing ejabberd Platforms with Docker - ejabberd Workshop #1
Managing ejabberd Platforms with Docker - ejabberd Workshop #1
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project Update
 
Serverless design with Fn project
Serverless design with Fn projectServerless design with Fn project
Serverless design with Fn project
 
.NET Core Blimey! (dotnetsheff Jan 2016)
.NET Core Blimey! (dotnetsheff Jan 2016).NET Core Blimey! (dotnetsheff Jan 2016)
.NET Core Blimey! (dotnetsheff Jan 2016)
 
ASP.NET Core Demos
ASP.NET Core DemosASP.NET Core Demos
ASP.NET Core Demos
 
What's New in ASP.NET Core 2.0
What's New in ASP.NET Core 2.0What's New in ASP.NET Core 2.0
What's New in ASP.NET Core 2.0
 
Getting started with Docker
Getting started with DockerGetting started with Docker
Getting started with Docker
 
Solr 4
Solr 4Solr 4
Solr 4
 
TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.
 
Docker and the K computer
Docker and the K computerDocker and the K computer
Docker and the K computer
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 

Plus de Lucidworks (Archived)

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Lucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Lucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Lucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Lucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 

Plus de Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Flexible Indexing in Lucene 4.0

  • 1. Heavy Committing: Flexible Indexing in Lucene 4.0 Uwe Schindler SD DataSolutions GmbH, uschindler@sd-datasolutions.de
  • 2. My Background • I am committer and PMC member of Apache Lucene and Solr. My main focus is on development of Lucene Java. • Implemented fast numerical search and maintaining the new attribute-based text analysis API. Well known as Generics and Sophisticated Backwards Compatibility Policeman. • Working as consultant and software architect for SD DataSolutions GmbH in Bremen, Germany. The main task is maintaining PANGAEA (Publishing Network for Geoscientific & Environmental Data) where I implemented the portal's geo-spatial retrieval functions with Lucene Java. • Talks about Lucene at various international conferences like ApacheCon EU/US, Lucene Revolution, Lucene Eurocon, Berlin Buzzwords and various local meetups.
  • 3. Agenda • Motivation • API changes in general • Codecs • Future • Wrap up 4
  • 4. Lucene up to version 3 • Lucene started > 10 years ago • Lucene‟s vInt format is old and not as friendly as new compression algorithms to CPU‟s optimizers (exists since Lucene 1.0) • It‟s hard to add additional statistics for scoring to the index • IR researchers don‟t use Lucene to try out new algorithms • Small changes to index format are often huge patches covering tons of files • Example from days of Lucene 2.4: final omit-TFaP patch of LUCENE-1340 is ~70 KiB covering ~25 files 5
  • 5. Why Flexible Indexing? • Many new compression approaches ready to be implemented for Lucene • Separate index encoding from terms/postings enumeration API • Make innovations to the postings format easier Targets to make Lucene extensible even on the lowest level 6
  • 6. Agenda • Motivation • API changes in general • Codecs • Future • Wrap up 7
  • 7. Quick Overview • Will be Apache Lucene ≥ 4.0 only! • Allows to • store new information into the index • change the way existing information is stored • Under heavy development • almost stable API, may break suddenly • lots of classes still in refactoring • Replaces a lot of existing classes and interfaces → Lucene 4.0 will not be backwards compatible (API wise) 8
  • 8. Quick Overview • Pre-4.0 indexes are still readable, but < 3.0 is no longer supported • Index upgrade tool will be available as of Lucene 3.2 • Supports upgrade of pre-3.0 indexes in-place → two-step migration to 4.0 (LUCENE-3082) 9
  • 11. Flex - Enum API Properties (1) • Replaces the TermEnum / TermDocs / TermPositions • Unified iterator-like behavior: no longer strange do…while vs. while • Decoupling of term from field • IndexReader returns enum of fields, each field has separate TermsEnum • No interning of field names needed anymore (will be removed soon) • Improved RAM efficiency: • efficient re-usage of byte buffer with the BytesRef class 12
  • 12. Flex - Enum API Properties (2) • Terms are arbitrary byte slices, no 16bit UTF-16 chars anymore • Term now using byte[] instead of char[] • compact representation of numeric terms • Can be any encoding • Default is UTF-8 → changes sort order for supplementary characters • Compact representation with low memory footprint for western languages • Support for e.g. BOCU1 (LUCENE-1799) possible, but some queries rely on UTF-8 (especially MultiTermQuery) • Indexer consumes BytesRefAttribute, which converts the good old char[] to UTF-8 13
  • 13. Flex - Enum API Properties (3) • All Flex Enums make use of AttributeSource • Custom Attribute deserialization (planned) • E.g., BoostAttribute for FuzzyQuery • TermsEnum supports seeking (enables fast automaton queries like FuzzyQuery) • DocsEnum / DocsAndPositionsEnum now support separate skipDocs property: Custom control of excluded documents (default implementation by IndexReader returns deleted docs) 14
  • 14. FieldCache • FieldCache consumes the flex APIs • Terms (previously strings) field cache more RAM efficient, low GC load • Used with SortField.STRING • Shared byte[] blocks instead of separate String instances • Terms remain as byte[] in few big byte[] slices • PackedInts for ords & addresses (LUCENE- 1990) • RAM reduction ~ 40-60% 15
  • 15. Agenda • Motivation • API changes in general • Codecs • Future • Wrap up 16
  • 16. Extending Flex with Codecs • A Codec represents the primary Flex-API extension point • Directly passed to SegmentReader to decode the index format • Provides implementations of the enum classes to SegmentReader • Provides writer for index files to IndexWriter 17
  • 18. Components of a Codec • Codec provides read/write for one segment • Unique name (String) • FieldsConsumer (for writing) • FieldsProducer is 4D enum API + close • CodecProvider creates Codec instance • Passed to IndexWriter/IndexReader • You can override merging • Reusable building blocks • Terms dict + index, postings 19
  • 19. Build-In Codecs: PreFlex • Pre-Flex-Indexes will be “read-only” in Lucene 4.0 • all additions to indexes will be done with CodecProvider„s default codec Warning: PreFlexCodec needs to reorder terms in term index on-the-fly: „surrogates dance“ → use index upgrader, LUCENE-3082 20
  • 20. Default: StandardCodec (1) Default codec was moved out of o.a.l.index: • lives now in its own package o.a.l.index.codecs.standard • is the default implementation • similar to the pre-flex index format • requires far less ram to load term index 21
  • 21. Default: StandardCodec (2) • Terms index: VariableGapTermsIndexWriter/Reader • uses finite state transducers from new FST package • removes useless suffixes Talk yesterday morning: “Finite State Automata in Lucene” • reusable by other codecs by Dawid Weiss • Terms dict: BlockTermsWriter/Reader • more efficient metadata decode • faster TermsEnum • reusable by other codecs • Postings: StandardPostingsWriter/Reader • Similar to pre-flex postings 22
  • 22. Build-In Codecs: PulsingCodec (1) • Inlines low doc-freq terms into terms dict • Saves extra seek to get the postings • Excellent match for primary key fields, but also “normal” field (Zipf‟s law) • Wraps any other codec • Likely default codec will use Pulsing See Mike McCandless’ blog: http://s.apache.org/XEw 23
  • 24. Build-In Codecs: SimpleText • All postings stored in _X.pst text file • Read / write • Not performant • Do not use in production! • Fully functional • Passes all Lucene/Solr unit tests (slowly...) • Useful / fun for debugging See Mike McCandless’ blog: http://s.apache.org/eh 25
  • 25. Other Codecs shipped with Lucene / Lucene Contrib • PerFieldCodecWrapper (core) • Can be used to delegate specific fields to different codecs (e.g., use pulsing for primary key) • AppendingCodec (contrib/misc) • Never rewinds a file pointer during write • Useful for Hadoop 26
  • 26. Heavy Development: Block Codecs • Separate branch for developing block codecs: • http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings • All codecs use default terms index, but different encodings of postings • Support for recent compression developments • FOR, PFORDelta, Simple64, BulkVInt • Framework for blocks of encoded ints (using SepCodec) • Easy to plugin a new compression by providing a encoder/decoder between a block int[] ↔ IndexOutput/IndexInput 27
  • 27. Block Codecs: Use Case • Smaller index & better compression for certain index contents by replacing vInt • Frequent terms (stop words) have low doc deltas • Short fields (city names) have small position values • TermQuery • High doc frequency terms • MultiTermQuery • Lots of terms (possibly with low docFreq) in order request doc postings 28
  • 28. Block Codecs: Disadvantages • Crazy API for consumers, e.g. queries • Removes encapsulation of DocsEnum • Consumer needs to do loading of blocks • Consumer needs to do doc delta decoding • Lot„s of queries may need duplicate code • Impl for codecs that support „bulk postings“ • Classical DocsEnum impl 29
  • 29. Agenda • Motivation • API changes in general • Codecs • Future • Wrap up 30
  • 30. Flexible Indexing - Current State • All tests pass with a pool of random codecs • If you find a bug, post the test output to mailing list! • Test output contains random seed to reproduce Talk yesterday evening: “Using Lucene's Test Framework” by Robert Muir • many more tests and documentation is needed • community feedback is highly appreciated 31
  • 31. Flexible Indexing - TODO • Serialize custom attributes to the index • Convert all remaining queries to use internally BytesRef-terms • Remove interning of field names • Die, Term (& Token) class, die! ☺ • Make block-/bulk-postings API more user- friendly and merge to SVN trunk • Merge DocValues branch (Talk yesterday: “DocValues & Column Stride Fields in Lucene” by Simon Willnauer) • More heavy refactoring! 32
  • 32. Agenda • Motivation • API changes in general • Codecs • Future • Wrap up 33
  • 33. Summary • New 4D postings enum APIs • Pluggable codec lets you customize index format • Many codecs already available • State-of-the-art postings formats are in progress • Innovation is easier! • Exciting time for Lucene... • Sizable performance gains, RAM/GC 34
  • 34. We need YOU! • Download nightly builds or check out SVN trunk • Run tests, run tests, run tests,… • Use in your own developments • Report back experience with developing own codecs 35
  • 35. Contact Uwe Schindler uschindler@sd-datasolutions.de http://www.thetaphi.de @thetaph1 Many thanks to Mike McCandless for slides & help on preparing this talk! SD DataSolutions GmbH Wätjenstr. 49 28213 Bremen, Germany +49 421 40889785-0 http://www.sd-datasolutions.de 36