SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Finite-State Queries in Lucene
             Robert Muir
          rmuir@apache.org
Agenda
• Introduction to Lucene
• Improving inexact matching:
  – Background
  – Regular Expression, Wildcard, Fuzzy Queries
• Additional use cases:
  – Language support: expansion versus
    stemming
  – Improved spellchecking
• Other ongoing developments in Lucene
Introduction to Lucene
• Open Source Search Engine Library
  – Not just Java, ported to other languages too.
  – Commercial support via several companies
• Just the library
  – Embed for your own uses, e.g. Eclipse
  – For a search server, see Solr
  – For web search + crawler, see Nutch
• Website: http://lucene.apache.org
Inverted Index
• Like a TreeMap<Terms, Documents>
• Before 2.9, queries only operate on one
  “subtree”.
  – For example, Regex and Fuzzy exhaustively
    evaluate all terms, unless you give them a
    “constant prefix”.
• TermQuery is just a special case, looks at
  one leaf node.
Lucene 2.9: Fast Numeric Ranges
• Indexes at different levels of precision.
• Enumerates multiple subtrees.
  – But typically this is a small number: e.g. 15
• Query APIs improved to support this.

                   4                       5                           6



            42           44            52                  63                    64



      421    423   445   446   448   521       522   632   633   634       641   642   644
Automaton Queries
• Only explore subtrees that can lead to an
  accept state of some finite state machine.
• AutomatonQuery traverses the term
  dictionary and the state machine in parallel
Another way to think of it
• Index as a state machine that recognizes Terms
  and transduces matching Documents.
• AutomatonQuery represents a user’s search
  need as a FSM.
• The intersection of the two emits search results.
Query API improvements
• Automata might need to do many seeks
  around the term dictionary.
  – Depends on what is in term dictionary
  – Depends on state machine structure
• MultiTermQuery API further improved
  – Easier and more efficient to skip around.
  – Explicitly supports seeking.
Regex, Wildcard, Fuzzy
• Without constant prefix, exhaustive
  – Regex: (http|ftp)://foo.com
  – Wildcard: ?oo?ar
  – Fuzzy: foobar~
• Re-implemented as automata queries
  – Just parsers that produce a DFA
  – Improved performance and scalability
  – (http|ftp)://foo.com examines 2 terms.
Additional/Advanced Use Cases
Stemming
• Stemmers work at index and query time
  – walked, walking -> walk
  – Can increase retrieval effectiveness
• Some problems
  – Mistakes: international -> intern
  – Must determine language of documents
  – Multilingual cases can get messy
  – Tuning is difficult: must re-index
  – Unfriendly: wildcards on stemmed terms…
Expansion instead
• Don’t remove data at index time
  – Expand the query instead.
  – Single field now works well for all queries:
    exact match, wildcard, expanded, etc.
• Simplifies search configuration
  – Tuning relevance is easier, no re-indexing.
  – No need to worry about language ID for docs.
  – Multilingual case is much simpler.
Automata expansion
• Natural fit for morphology
• Use set intersection operators
  – Minus to subtract exact match case
  – Union to search multiple languages
• Efficient operation
  – Doesn’t explode for languages with complex
    morphology
Experimental results
• 125k docs English test collection
   • Results are for TD queries
• Inverted the “S-Stemmer”
   • 6 declarative rewrite rules to regex
• Competitive with traditional stemming.
                  No      Porter    S-Stem    Automaton
               Stemming                        S-Stem
      MAP       0.4575    0.5069    0.5029     0.4979
      MRR       0.8070    0.7862    0.7587     0.8220
     # Terms   336,675    280,061   305,710    336,675
TODO:
• Support expansion models, too in Lucene.
• Language-specific resources
  – lucene-hunspell could provide these
• Language-independent tokenization
  – Unicode rules go a long way.
• Scoring that doesn’t need stopwords
  – For now, use CommonGrams!
Spellchecking




• Lucene spellchecker builds a separate
  index to find correction candidates
• Perhaps our fuzzy enumeration is now fast
  enough for small edit distances (e.g. 1,2)
  to just use the index directly.
• Could simplify configurations, especially
  distributed ones.
Ongoing developments in Lucene
Community
• Merging Lucene and Solr development
  – Still two separate released “products”!!!
  – Share mailing list and code repository
  – Solr dev code in sync with Lucene dev code
• Benefits to both Lucene and Solr users
  – Lucene features exposed to Solr faster
  – Solr features available to Lucene users
Indexing
• Flexible Indexing
  – Customize the format of the index
  – Decreased RAM usage
  – Faster IndexReader open and faster seeking
• Future
  – Serialize custom attributes to the index
  – More RAM savings
  – Improved index compression
  – Faster Near-Real-Time performance
Text Analysis
• Improved Unicode Support
  – Unicode 4/Supplementary support
  – ICU integration (to support Unicode 5.2)
• Improved Language Support
  – More languages
  – Faster indexing performance
  – Easier packaging and ease-of-use
• Improvements to Analysis API
Relevance and Scoring
• Improved relevance benchmarking
  – Open Relevance Project
  – Quality Benchmarking Package
• Future: More flexible scoring
  – How to support BM25, DFR, more vector
    space?
  – Some packages/patches available, but
     • Additional index statistics needed to be simple
Other
• Improvements to Spatial support
  – See Chris Male’s talk here:
    http://vimeo.com/10204365
  – Progress in both Lucene and Solr
• Steps towards a more modular
  architecture
  – Smaller Lucene core
  – Separate modules with more functionality
Questions?
Backup slides
Finite State Queries In Lucene

Contenu connexe

Tendances

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...DataWorks Summit
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기NHN FORWARD
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.jsHeeJung Hwang
 
Apache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDBApache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDBMongoDB
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDBMongoDB
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Install Redis on Oracle Linux
Install Redis on Oracle LinuxInstall Redis on Oracle Linux
Install Redis on Oracle LinuxJohan Louwers
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring SANG WON PARK
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersDataWorks Summit
 

Tendances (20)

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
 
Apache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDBApache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDB
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Install Redis on Oracle Linux
Install Redis on Oracle LinuxInstall Redis on Oracle Linux
Install Redis on Oracle Linux
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 

En vedette

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
Lucandra
LucandraLucandra
Lucandraotisg
 
Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)otisg
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 

En vedette (20)

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Lucene
LuceneLucene
Lucene
 
Lucandra
LucandraLucandra
Lucandra
 
Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 

Similaire à Finite State Queries In Lucene

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.NetDean Thrasher
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and AbstractionsMetosin Oy
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmaplucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road maplucenerevolution
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL DatabasesEmanuel Calvo
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsMasoud Kalali
 
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...TAUS - The Language Data Network
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteLucidworks
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 

Similaire à Finite State Queries In Lucene (20)

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Lucene 101
Lucene 101Lucene 101
Lucene 101
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutions
 
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
 
Solr 4
Solr 4Solr 4
Solr 4
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Fun with flexible indexing
Fun with flexible indexingFun with flexible indexing
Fun with flexible indexing
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 

Finite State Queries In Lucene

  • 1. Finite-State Queries in Lucene Robert Muir rmuir@apache.org
  • 2. Agenda • Introduction to Lucene • Improving inexact matching: – Background – Regular Expression, Wildcard, Fuzzy Queries • Additional use cases: – Language support: expansion versus stemming – Improved spellchecking • Other ongoing developments in Lucene
  • 3. Introduction to Lucene • Open Source Search Engine Library – Not just Java, ported to other languages too. – Commercial support via several companies • Just the library – Embed for your own uses, e.g. Eclipse – For a search server, see Solr – For web search + crawler, see Nutch • Website: http://lucene.apache.org
  • 4. Inverted Index • Like a TreeMap<Terms, Documents> • Before 2.9, queries only operate on one “subtree”. – For example, Regex and Fuzzy exhaustively evaluate all terms, unless you give them a “constant prefix”. • TermQuery is just a special case, looks at one leaf node.
  • 5. Lucene 2.9: Fast Numeric Ranges • Indexes at different levels of precision. • Enumerates multiple subtrees. – But typically this is a small number: e.g. 15 • Query APIs improved to support this. 4 5 6 42 44 52 63 64 421 423 445 446 448 521 522 632 633 634 641 642 644
  • 6. Automaton Queries • Only explore subtrees that can lead to an accept state of some finite state machine. • AutomatonQuery traverses the term dictionary and the state machine in parallel
  • 7. Another way to think of it • Index as a state machine that recognizes Terms and transduces matching Documents. • AutomatonQuery represents a user’s search need as a FSM. • The intersection of the two emits search results.
  • 8. Query API improvements • Automata might need to do many seeks around the term dictionary. – Depends on what is in term dictionary – Depends on state machine structure • MultiTermQuery API further improved – Easier and more efficient to skip around. – Explicitly supports seeking.
  • 9. Regex, Wildcard, Fuzzy • Without constant prefix, exhaustive – Regex: (http|ftp)://foo.com – Wildcard: ?oo?ar – Fuzzy: foobar~ • Re-implemented as automata queries – Just parsers that produce a DFA – Improved performance and scalability – (http|ftp)://foo.com examines 2 terms.
  • 11. Stemming • Stemmers work at index and query time – walked, walking -> walk – Can increase retrieval effectiveness • Some problems – Mistakes: international -> intern – Must determine language of documents – Multilingual cases can get messy – Tuning is difficult: must re-index – Unfriendly: wildcards on stemmed terms…
  • 12. Expansion instead • Don’t remove data at index time – Expand the query instead. – Single field now works well for all queries: exact match, wildcard, expanded, etc. • Simplifies search configuration – Tuning relevance is easier, no re-indexing. – No need to worry about language ID for docs. – Multilingual case is much simpler.
  • 13. Automata expansion • Natural fit for morphology • Use set intersection operators – Minus to subtract exact match case – Union to search multiple languages • Efficient operation – Doesn’t explode for languages with complex morphology
  • 14. Experimental results • 125k docs English test collection • Results are for TD queries • Inverted the “S-Stemmer” • 6 declarative rewrite rules to regex • Competitive with traditional stemming. No Porter S-Stem Automaton Stemming S-Stem MAP 0.4575 0.5069 0.5029 0.4979 MRR 0.8070 0.7862 0.7587 0.8220 # Terms 336,675 280,061 305,710 336,675
  • 15. TODO: • Support expansion models, too in Lucene. • Language-specific resources – lucene-hunspell could provide these • Language-independent tokenization – Unicode rules go a long way. • Scoring that doesn’t need stopwords – For now, use CommonGrams!
  • 16. Spellchecking • Lucene spellchecker builds a separate index to find correction candidates • Perhaps our fuzzy enumeration is now fast enough for small edit distances (e.g. 1,2) to just use the index directly. • Could simplify configurations, especially distributed ones.
  • 18. Community • Merging Lucene and Solr development – Still two separate released “products”!!! – Share mailing list and code repository – Solr dev code in sync with Lucene dev code • Benefits to both Lucene and Solr users – Lucene features exposed to Solr faster – Solr features available to Lucene users
  • 19. Indexing • Flexible Indexing – Customize the format of the index – Decreased RAM usage – Faster IndexReader open and faster seeking • Future – Serialize custom attributes to the index – More RAM savings – Improved index compression – Faster Near-Real-Time performance
  • 20. Text Analysis • Improved Unicode Support – Unicode 4/Supplementary support – ICU integration (to support Unicode 5.2) • Improved Language Support – More languages – Faster indexing performance – Easier packaging and ease-of-use • Improvements to Analysis API
  • 21. Relevance and Scoring • Improved relevance benchmarking – Open Relevance Project – Quality Benchmarking Package • Future: More flexible scoring – How to support BM25, DFR, more vector space? – Some packages/patches available, but • Additional index statistics needed to be simple
  • 22. Other • Improvements to Spatial support – See Chris Male’s talk here: http://vimeo.com/10204365 – Progress in both Lucene and Solr • Steps towards a more modular architecture – Smaller Lucene core – Separate modules with more functionality