SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
Full Text Search
David LeBer
Align Software Inc.
What is full text search?
How?

•   Wild card database queries

•   Database implementations

•   Third party search engines

•   Text indexing libraries
Wild Card Queries

SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'
Wild Card Queries



•   Easy
Wild Card Queries


•   Slow

•   Hard to optimize

•   Difficult to rank
Database Implementations


•   MySQL FULLTEXT index and MATCH queries

•   PostgreSQL tsvector & tsquery
Database Implementations



•   Fairly Easy
Database Implementations

•   Database specific SQL

•   May include additional limitations
    (i.e: MySQL - MyISAM tables only)

•   Functionality define by the DB engine
Third Party Search Engines



•   Google indexing / searching of your content
Third Party Search Engines


•   Easy

•   Matches user expectations
Third Party Search Engines


•   Content must be available for indexing

•   Loss of control

•   Enhances the Google hegemony
Text Indexing Library



•   Lucene
Text Indexing Library

•   Complete control

•   Database independent

•   Flexible search behaviour

•   Ranked results
Text Indexing Library


•   Adds complexity

•   Additional query language

•   Parallel index
Lucene Overview

•   Open Source - part of the Apache Project

•   Very flexible

•   Wickedly fast

•   Index based
Lucene : Installing


•   Add the Lucene jars to your classpath

•   Use ERIndexing
Lucene : Tasks


•   Indexing

•   Searching
Indexing
What is Indexing?
Indexing : Steps


•   Conversion (to plain text)

•   Analysis (clean and convert the text to tokens)

•   Index (save the tokens to the index)
Indexing : Parts


•   Index - either file or memory based

•   Document - represents a unique object added to the index

•   Field - identifies a chunk of data in the document
Indexing : Classes

•   IndexWriter

•   Directory

•   Analyzer

•   Document

•   Field
Creating an Index

URL indexDirectoryURL = ... // assume exists
File indexFile = new File(indexDirectoryURL.getPath());
FSDirectory indexDirectory = FSDirectory.open(indexFile);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
IndexWriter indexWriter = new IndexWriter(index, analyzer, true,
                                IndexWriter.MaxFieldLength.UNLIMITED);
Indexing : Field Parameters


•   Stored or not

•   Analyzed or not, with and without norms

•   Include position, offset, and term frequency
Indexing : Analyzers

•   SimpleAnalyzer

•   StopAnalyzer

•   StandardAnalyzer

•   ...
Adding a Document

String value = ... // assume exists
Document doc = new Document();
Field docField = new Field("title", value,
                            Field.Store.YES, Field.Index.ANALYZED);
doc.add(docField);
...
indexWriter.addDocument(doc);
Indexing : Fun with indexes



•   Multiple Access
Searching
What is Searching
Searching : Steps

•   Clean the user input

•   Create a Query

•   Query the Index

•   Return the results
Searching : Search Classes
•   IndexReader

•   IndexSearcher

•   Query

•   QueryParser

•   TopDocs/ScoreDocs

•   Document
Searching : QueryTypes
•   TermQuery

•   RangeQuery

•   PrefixQuery

•   BooleanQuery

•   PhraseQuery

•   WildCardQuery

•   FuzzyQuery
Searching : QueryParser
•   'webobjects' - contains an exact match - TermQuery

•   'webobjects apple', 'webobjects OR apple' - an OR Query

•   +webobjects +apple / webobjects AND apple - an AND Query

•   title:webobjects - Contains the term in title field

•   title:webobjects -subject:iTunes / title:webobjects AND NOT
    subject:iTunes

•   (webobjects OR apple) AND iTunes
Searching : QueryParser

•   title:"apple webobjects" - Phrase Query

•   title:"apple webobjects"~5 - slop of 5

•   webobj* - Prefix Query

•   webobjicts~ - Fuzzy Query

•   lastmodified:[1/1/10 TO 1/1/11] - Range Query
Performing a Search

Query q = ... // assume exists
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Using a QueryParser

QueryParser queryParser = new QueryParser(Version.LUCENE_2.9,
                                          "content", analyzer);
Query query = queryParser.parse(queryString);
Demo
Scoring
“The more times a query term appears in a
document relative to the number of times the term
 appears in all the documents in the collection, the
   more relevant that document is to the query”
Boost

•   While Indexing

    •   Document

    •   Field

•   While Searching

    •   Query
Luke
Demo
ERIndexing
ERIndexing : Strengths

•   Hides some of the complexity of integrating Lucene with WO

•   Offers lots of utility and helper methods

•   Speaks WebObjects collection classes

•   Simplifies index creation
ERIndexing : Weaknesses


•   Hides some of the complexity of integrating Lucene with WO

•   Not fully baked

•   Auto indexing may be dangerous
Demo
Beyond Lucene


•   Solr

•   Compass

•   ElasticSearch
Q&A
Lucene: http://lucene.apache.org
Luke: http://code.google.com/p/luke/
Solr: http://lucene.apache.org/solr/
Compass: http://www.compass-project.org/overview.html
ElasticSearch: http://www.elasticsearch.com/

Contenu connexe

Tendances

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaPaolo Mottadelli
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tikaJukka Zitting
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 

Tendances (19)

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Azure search
Azure searchAzure search
Azure search
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Full text search
Full text searchFull text search
Full text search
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 

Similaire à Full Text Search with Lucene

Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenchesIsmail Mayat
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.NetDean Thrasher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Ferret A Ruby Search Engine
Ferret A Ruby Search EngineFerret A Ruby Search Engine
Ferret A Ruby Search Engineelliando dias
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBAndrew Siemer
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced editionAlexander Tokarev
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHPMike Lively
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 

Similaire à Full Text Search with Lucene (20)

Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Ferret A Ruby Search Engine
Ferret A Ruby Search EngineFerret A Ruby Search Engine
Ferret A Ruby Search Engine
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDB
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced edition
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHP
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 

Plus de WO Community

In memory OLAP engine
In memory OLAP engineIn memory OLAP engine
In memory OLAP engineWO Community
 
Using Nagios to monitor your WO systems
Using Nagios to monitor your WO systemsUsing Nagios to monitor your WO systems
Using Nagios to monitor your WO systemsWO Community
 
Build and deployment
Build and deploymentBuild and deployment
Build and deploymentWO Community
 
Reenabling SOAP using ERJaxWS
Reenabling SOAP using ERJaxWSReenabling SOAP using ERJaxWS
Reenabling SOAP using ERJaxWSWO Community
 
Chaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real WorldChaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real WorldWO Community
 
D2W Stateful Controllers
D2W Stateful ControllersD2W Stateful Controllers
D2W Stateful ControllersWO Community
 
Deploying WO on Windows
Deploying WO on WindowsDeploying WO on Windows
Deploying WO on WindowsWO Community
 
Unit Testing with WOUnit
Unit Testing with WOUnitUnit Testing with WOUnit
Unit Testing with WOUnitWO Community
 
Apache Cayenne for WO Devs
Apache Cayenne for WO DevsApache Cayenne for WO Devs
Apache Cayenne for WO DevsWO Community
 
Advanced Apache Cayenne
Advanced Apache CayenneAdvanced Apache Cayenne
Advanced Apache CayenneWO Community
 
Migrating existing Projects to Wonder
Migrating existing Projects to WonderMigrating existing Projects to Wonder
Migrating existing Projects to WonderWO Community
 
iOS for ERREST - alternative version
iOS for ERREST - alternative versioniOS for ERREST - alternative version
iOS for ERREST - alternative versionWO Community
 
"Framework Principal" pattern
"Framework Principal" pattern"Framework Principal" pattern
"Framework Principal" patternWO Community
 
Filtering data with D2W
Filtering data with D2W Filtering data with D2W
Filtering data with D2W WO Community
 
Localizing your apps for multibyte languages
Localizing your apps for multibyte languagesLocalizing your apps for multibyte languages
Localizing your apps for multibyte languagesWO Community
 

Plus de WO Community (20)

KAAccessControl
KAAccessControlKAAccessControl
KAAccessControl
 
In memory OLAP engine
In memory OLAP engineIn memory OLAP engine
In memory OLAP engine
 
Using Nagios to monitor your WO systems
Using Nagios to monitor your WO systemsUsing Nagios to monitor your WO systems
Using Nagios to monitor your WO systems
 
Build and deployment
Build and deploymentBuild and deployment
Build and deployment
 
High availability
High availabilityHigh availability
High availability
 
Reenabling SOAP using ERJaxWS
Reenabling SOAP using ERJaxWSReenabling SOAP using ERJaxWS
Reenabling SOAP using ERJaxWS
 
Chaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real WorldChaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real World
 
D2W Stateful Controllers
D2W Stateful ControllersD2W Stateful Controllers
D2W Stateful Controllers
 
Deploying WO on Windows
Deploying WO on WindowsDeploying WO on Windows
Deploying WO on Windows
 
Unit Testing with WOUnit
Unit Testing with WOUnitUnit Testing with WOUnit
Unit Testing with WOUnit
 
Life outside WO
Life outside WOLife outside WO
Life outside WO
 
Apache Cayenne for WO Devs
Apache Cayenne for WO DevsApache Cayenne for WO Devs
Apache Cayenne for WO Devs
 
Advanced Apache Cayenne
Advanced Apache CayenneAdvanced Apache Cayenne
Advanced Apache Cayenne
 
Migrating existing Projects to Wonder
Migrating existing Projects to WonderMigrating existing Projects to Wonder
Migrating existing Projects to Wonder
 
iOS for ERREST - alternative version
iOS for ERREST - alternative versioniOS for ERREST - alternative version
iOS for ERREST - alternative version
 
iOS for ERREST
iOS for ERRESTiOS for ERREST
iOS for ERREST
 
"Framework Principal" pattern
"Framework Principal" pattern"Framework Principal" pattern
"Framework Principal" pattern
 
Filtering data with D2W
Filtering data with D2W Filtering data with D2W
Filtering data with D2W
 
WOver
WOverWOver
WOver
 
Localizing your apps for multibyte languages
Localizing your apps for multibyte languagesLocalizing your apps for multibyte languages
Localizing your apps for multibyte languages
 

Dernier

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Full Text Search with Lucene

  • 1. Full Text Search David LeBer Align Software Inc.
  • 2. What is full text search?
  • 3.
  • 4. How? • Wild card database queries • Database implementations • Third party search engines • Text indexing libraries
  • 5. Wild Card Queries SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'
  • 7. Wild Card Queries • Slow • Hard to optimize • Difficult to rank
  • 8. Database Implementations • MySQL FULLTEXT index and MATCH queries • PostgreSQL tsvector & tsquery
  • 10. Database Implementations • Database specific SQL • May include additional limitations (i.e: MySQL - MyISAM tables only) • Functionality define by the DB engine
  • 11. Third Party Search Engines • Google indexing / searching of your content
  • 12. Third Party Search Engines • Easy • Matches user expectations
  • 13. Third Party Search Engines • Content must be available for indexing • Loss of control • Enhances the Google hegemony
  • 15. Text Indexing Library • Complete control • Database independent • Flexible search behaviour • Ranked results
  • 16. Text Indexing Library • Adds complexity • Additional query language • Parallel index
  • 17. Lucene Overview • Open Source - part of the Apache Project • Very flexible • Wickedly fast • Index based
  • 18. Lucene : Installing • Add the Lucene jars to your classpath • Use ERIndexing
  • 19. Lucene : Tasks • Indexing • Searching
  • 22. Indexing : Steps • Conversion (to plain text) • Analysis (clean and convert the text to tokens) • Index (save the tokens to the index)
  • 23. Indexing : Parts • Index - either file or memory based • Document - represents a unique object added to the index • Field - identifies a chunk of data in the document
  • 24. Indexing : Classes • IndexWriter • Directory • Analyzer • Document • Field
  • 25. Creating an Index URL indexDirectoryURL = ... // assume exists File indexFile = new File(indexDirectoryURL.getPath()); FSDirectory indexDirectory = FSDirectory.open(indexFile); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); IndexWriter indexWriter = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
  • 26. Indexing : Field Parameters • Stored or not • Analyzed or not, with and without norms • Include position, offset, and term frequency
  • 27. Indexing : Analyzers • SimpleAnalyzer • StopAnalyzer • StandardAnalyzer • ...
  • 28. Adding a Document String value = ... // assume exists Document doc = new Document(); Field docField = new Field("title", value, Field.Store.YES, Field.Index.ANALYZED); doc.add(docField); ... indexWriter.addDocument(doc);
  • 29. Indexing : Fun with indexes • Multiple Access
  • 32. Searching : Steps • Clean the user input • Create a Query • Query the Index • Return the results
  • 33. Searching : Search Classes • IndexReader • IndexSearcher • Query • QueryParser • TopDocs/ScoreDocs • Document
  • 34. Searching : QueryTypes • TermQuery • RangeQuery • PrefixQuery • BooleanQuery • PhraseQuery • WildCardQuery • FuzzyQuery
  • 35. Searching : QueryParser • 'webobjects' - contains an exact match - TermQuery • 'webobjects apple', 'webobjects OR apple' - an OR Query • +webobjects +apple / webobjects AND apple - an AND Query • title:webobjects - Contains the term in title field • title:webobjects -subject:iTunes / title:webobjects AND NOT subject:iTunes • (webobjects OR apple) AND iTunes
  • 36. Searching : QueryParser • title:"apple webobjects" - Phrase Query • title:"apple webobjects"~5 - slop of 5 • webobj* - Prefix Query • webobjicts~ - Fuzzy Query • lastmodified:[1/1/10 TO 1/1/11] - Range Query
  • 37. Performing a Search Query q = ... // assume exists IndexSearcher searcher = new IndexSearcher(index, true); TopScoreDocCollector collector = TopScoreDocCollector.create(10, true); searcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs;
  • 38. Using a QueryParser QueryParser queryParser = new QueryParser(Version.LUCENE_2.9, "content", analyzer); Query query = queryParser.parse(queryString);
  • 39. Demo
  • 41. “The more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query”
  • 42. Boost • While Indexing • Document • Field • While Searching • Query
  • 43. Luke
  • 44. Demo
  • 46. ERIndexing : Strengths • Hides some of the complexity of integrating Lucene with WO • Offers lots of utility and helper methods • Speaks WebObjects collection classes • Simplifies index creation
  • 47. ERIndexing : Weaknesses • Hides some of the complexity of integrating Lucene with WO • Not fully baked • Auto indexing may be dangerous
  • 48. Demo
  • 49. Beyond Lucene • Solr • Compass • ElasticSearch
  • 50. Q&A Lucene: http://lucene.apache.org Luke: http://code.google.com/p/luke/ Solr: http://lucene.apache.org/solr/ Compass: http://www.compass-project.org/overview.html ElasticSearch: http://www.elasticsearch.com/