SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
The Lucene Search Engine
Kira Radinsky
Based on the material from: Thomas Paul and Steven J. Owens
What is Lucene?
• Doug Cutting’s grandmother’s middle name
• A open source set of Java Classses
– Search Engine/Document Classifier/Indexer
– Developed by Doug Cutting (1996)
• Xerox/Apple/Excite/Nutch/Yahoo/Cloudera
• Hadoop founder, Board of directors of the Apache Software
• Jakarta Apache Product. Strong open source
community support.
• High-performance, full-featured text search
engine library
• Easy to use yet powerful API
Use the Source, Luke
• Document
• Field
– Represents a section of a Document: name for the section + the actual data.
• Analyzer
– Abstract class (to provide interface)
– Document -> tokens (for later indexing)
– StandardAnalyzer class.
• IndexWriter
– Creates and maintains indexes.
• IndexSearcher
– Searches through an index.
• QueryParser
– Builds a parser that can search through an index.
• Query
– Abstract class that contains the search criteria created by the QueryParser.
• Hits
– Contains the Document objects that are returned by running the Query object against the index.
Indexing a Document
Document from an article
private Document createDocument(String article, String author,
String title, String topic,
String url,
Date dateWritten)
{
Document document = new Document();
document.add(Field.Text("author", author));
document.add(Field.Text("title", title));
document.add(Field.Text("topic", topic));
document.add(Field.UnIndexed("url", url));
document.add(Field.Keyword("date", dateWritten));
document.add(Field.UnStored("article", article));
return document;
}
The Field Object
Factory Method Tokenized Indexed Stored Use for
Field.Text(String name,
String value)
Yes Yes Yes
contents you want
stored
Field.Text(String name,
Reader value)
Yes Yes No
contents you don't
want stored
Field.Keyword(String
name, String value)
No Yes Yes
values you don't want
broken down
Field.UnIndexed(String
name, String value)
No No Yes
values you don't want
indexed
Field.UnStored(String
name, String value)
Yes Yes No
values you don't want
stored
Store a Document in the index
String indexDirectory = "lucene-index";
private void indexDocument(Document document)
throws Exception
{
Analyzer analyzer = new StandardAnalyzer();
IndexWriter writer = new IndexWriter(
indexDirectory,
analyzer, false
);
writer.addDocument(document);
writer.optimize();
writer.close();
}
Analyzers and Tokenizers
SimpleAnalyzer SimpleAnalyzer seems to just use a Tokenizer that converts all
of the input to lower case.
StopAnalyzer StopAnalyzer includes the lower-case filter, and also has a filter
that drops out any "stop words", words like articles (a, an, the,
etc) that occur so commonly in english that they might as well
be noise for searching purposes. StopAnalyzer comes with a
set of stop words, but you can instantiate it with your own
array of stop words.
StandardAnalyzer StandardAnalyzer does both lower-case and stop-word
filtering, and in addition tries to do some basic clean-up of
words, for example taking out apostrophes ( ' ) and removing
periods from acronyms (i.e. "T.L.A." becomes "TLA").
Lucene Sandbox Here you can find analyzers in your own language
Adding to an Index
public void indexArticle(
String article,
String author,
String title, String topic,
String url, Date dateWritten)
throws Exception
{
Document document = createDocument
(
article, author,
title, topic, url,
dateWritten
);
indexDocument(document);
}
Searching the Index
Searching
IndexSearcher is = new
IndexSearcher(indexDirectory);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("article",
analyzer);
Query query = parser.parse(searchCriteria);
Hits hits = is.search(query);
Extracting Document objects
for (int i=0; i<hits.length(); i++)
{
Document doc = hits.doc(i);
// display the articles that were
found to the user
}
Search Criteria
Supports several searches: AND OR and NOT,
fuzzy, proximity searches, wildcard searches, and
range searches
– author:Henry relativity AND "quantum physics“
– "string theory" NOT Einstein
– "Galileo Kepler"~5
– author:Johnson date:[01/01/2004 TO 01/31/2004]
Thread Safety
• Indexing and searching are not only thread safe,
but process safe. What this means is that:
– Multiple index searchers can read the lucene index
files at the same time.
– An index writer or reader can edit the lucene index
files while searches are ongoing
– Multiple index writers or readers can try to edit the
lucene index files at the same time (it's important for
the index writer/reader to be closed so it will release
the file lock).
• The query parser is not thread safe,
• The index writer however, is thread safe,

Contenu connexe

Tendances

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaPaolo Mottadelli
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tikaJukka Zitting
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 

Tendances (19)

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Azure search
Azure searchAzure search
Azure search
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Full text search
Full text searchFull text search
Full text search
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 

Similaire à Tutorial 5 (lucene)

Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenchesIsmail Mayat
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Elasticsearch python
Elasticsearch pythonElasticsearch python
Elasticsearch pythonvaliantval2
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearchJoey Wen
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
ElasticSearch Basics
ElasticSearch BasicsElasticSearch Basics
ElasticSearch BasicsAmresh Singh
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiRobert Calcavecchia
 
Apache lucene - full text search
Apache lucene - full text searchApache lucene - full text search
Apache lucene - full text searchMarcelo Cure
 
DIY Percolator
DIY PercolatorDIY Percolator
DIY Percolatorjdhok
 
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data ModelAnno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data ModelEmanuel Berndl
 
Amazon Elasticsearch and Databases
Amazon Elasticsearch and DatabasesAmazon Elasticsearch and Databases
Amazon Elasticsearch and DatabasesAmazon Web Services
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 

Similaire à Tutorial 5 (lucene) (20)

Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
IR with lucene
IR with luceneIR with lucene
IR with lucene
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Lucene in Action
Lucene in ActionLucene in Action
Lucene in Action
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Elasticsearch python
Elasticsearch pythonElasticsearch python
Elasticsearch python
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
ElasticSearch Basics
ElasticSearch BasicsElasticSearch Basics
ElasticSearch Basics
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
Apache lucene - full text search
Apache lucene - full text searchApache lucene - full text search
Apache lucene - full text search
 
DIY Percolator
DIY PercolatorDIY Percolator
DIY Percolator
 
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data ModelAnno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
 
Amazon Elasticsearch and Databases
Amazon Elasticsearch and DatabasesAmazon Elasticsearch and Databases
Amazon Elasticsearch and Databases
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 

Plus de Kira

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Kira
 

Plus de Kira (13)

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
 

Dernier

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Dernier (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Tutorial 5 (lucene)

  • 1. The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens
  • 2. What is Lucene? • Doug Cutting’s grandmother’s middle name • A open source set of Java Classses – Search Engine/Document Classifier/Indexer – Developed by Doug Cutting (1996) • Xerox/Apple/Excite/Nutch/Yahoo/Cloudera • Hadoop founder, Board of directors of the Apache Software • Jakarta Apache Product. Strong open source community support. • High-performance, full-featured text search engine library • Easy to use yet powerful API
  • 3. Use the Source, Luke • Document • Field – Represents a section of a Document: name for the section + the actual data. • Analyzer – Abstract class (to provide interface) – Document -> tokens (for later indexing) – StandardAnalyzer class. • IndexWriter – Creates and maintains indexes. • IndexSearcher – Searches through an index. • QueryParser – Builds a parser that can search through an index. • Query – Abstract class that contains the search criteria created by the QueryParser. • Hits – Contains the Document objects that are returned by running the Query object against the index.
  • 5. Document from an article private Document createDocument(String article, String author, String title, String topic, String url, Date dateWritten) { Document document = new Document(); document.add(Field.Text("author", author)); document.add(Field.Text("title", title)); document.add(Field.Text("topic", topic)); document.add(Field.UnIndexed("url", url)); document.add(Field.Keyword("date", dateWritten)); document.add(Field.UnStored("article", article)); return document; }
  • 6. The Field Object Factory Method Tokenized Indexed Stored Use for Field.Text(String name, String value) Yes Yes Yes contents you want stored Field.Text(String name, Reader value) Yes Yes No contents you don't want stored Field.Keyword(String name, String value) No Yes Yes values you don't want broken down Field.UnIndexed(String name, String value) No No Yes values you don't want indexed Field.UnStored(String name, String value) Yes Yes No values you don't want stored
  • 7. Store a Document in the index String indexDirectory = "lucene-index"; private void indexDocument(Document document) throws Exception { Analyzer analyzer = new StandardAnalyzer(); IndexWriter writer = new IndexWriter( indexDirectory, analyzer, false ); writer.addDocument(document); writer.optimize(); writer.close(); }
  • 8. Analyzers and Tokenizers SimpleAnalyzer SimpleAnalyzer seems to just use a Tokenizer that converts all of the input to lower case. StopAnalyzer StopAnalyzer includes the lower-case filter, and also has a filter that drops out any "stop words", words like articles (a, an, the, etc) that occur so commonly in english that they might as well be noise for searching purposes. StopAnalyzer comes with a set of stop words, but you can instantiate it with your own array of stop words. StandardAnalyzer StandardAnalyzer does both lower-case and stop-word filtering, and in addition tries to do some basic clean-up of words, for example taking out apostrophes ( ' ) and removing periods from acronyms (i.e. "T.L.A." becomes "TLA"). Lucene Sandbox Here you can find analyzers in your own language
  • 9. Adding to an Index public void indexArticle( String article, String author, String title, String topic, String url, Date dateWritten) throws Exception { Document document = createDocument ( article, author, title, topic, url, dateWritten ); indexDocument(document); }
  • 11. Searching IndexSearcher is = new IndexSearcher(indexDirectory); Analyzer analyzer = new StandardAnalyzer(); QueryParser parser = new QueryParser("article", analyzer); Query query = parser.parse(searchCriteria); Hits hits = is.search(query);
  • 12. Extracting Document objects for (int i=0; i<hits.length(); i++) { Document doc = hits.doc(i); // display the articles that were found to the user }
  • 13. Search Criteria Supports several searches: AND OR and NOT, fuzzy, proximity searches, wildcard searches, and range searches – author:Henry relativity AND "quantum physics“ – "string theory" NOT Einstein – "Galileo Kepler"~5 – author:Johnson date:[01/01/2004 TO 01/31/2004]
  • 14. Thread Safety • Indexing and searching are not only thread safe, but process safe. What this means is that: – Multiple index searchers can read the lucene index files at the same time. – An index writer or reader can edit the lucene index files while searches are ongoing – Multiple index writers or readers can try to edit the lucene index files at the same time (it's important for the index writer/reader to be closed so it will release the file lock). • The query parser is not thread safe, • The index writer however, is thread safe,