SlideShare une entreprise Scribd logo
1  sur  32
Lucene Intro
About Me

•   Cristian Vat

•   Java Developer / Geek / Enthusiast

•   Contact

    •   @deathy

    •   ... or TM JUG mailing list
About YOU


•   Heard about Lucene / Solr ?

•   Used Lucene / Solr ?
Databases / Text Search
Databases

•   Select/Search on (usually) exact values or
    ranges

•   Group/Summarize Results

•   Sort results by value(s) of certain result
    column(s)
Text Search

•   Search for individual words/tokens

•   Search long text documents

•   More language-aware

•   “Sorting” by Relevance of results by default
IR Quick Intro
IR = Information Retrieval
IR Quick Intro

•   Doc 1: “I did enact Julius Caesar: I was killed i’
    the Capitol; Brutus killed me.”

•   Doc 2: “So let it be with Caesar. The noble
    Brutus hath told you Caesar was ambitious:”
IR Quick Intro

•   Index

    •   “I” -> Doc 1

    •   “Caesar” -> Doc 1, Doc 2

    •   “enact” -> Doc 1

    •   “noble” -> Doc 1
IR Quick Intro
•   Search

    •   caesar

    •   c?es*

    •   caesar AND noble

    •   “Julius Caesar”

    •   Caesar NOT Brutus
Lucene Ecosystem




             ...and many more
Lucene

•   IR Library

•   Just API for Indexing/Searching

•   No GUI

•   No parsers for different file formats
Lucene

•   Fast

•   Thread-Safe/Multi-Threaded indexing and
    searching

•   No dependencies! (not even logging
    framework)
Solr

•   Search Server / Layer over Lucene

•   Provides REST-like HTTP (JSON/XML) API

•   Client libraries in Java, PHP, Python, Ruby,
    Perl, .NET, ...
Solr

•   More structured indexes

•   Replication / Distribution, Master-Slave, etc.

•   Faceted Search / Filtering

•   Indexing of rich document types (via Tika)
Tika

•   “Content Analysis Toolkit”

•   Text and Metadata extraction from various
    rich document types

•   Used by Solr for indexing rich document
    types
Lucene (in more detail)
Lucene Index Structure


•   Index = One or more Documents

•   Document = one or more Fields with values

•   NO Schema/Structure restrictions
Adding documents
Lucene Search
Query Parser
•   AND, OR, NOT ( +/- )

    •   “apache AND lucene NOT solr” ( “+apache
        +lucene -solr” )

•   Range Queries

    •   year:[1994 TO 2011]

•   Wildcard/Fuzzy:

    •   “ap?che”, “apac*”, “appche”˜0.8
Sorting or Results


•   Default sort by Relevance

•   Possible to use custom sort fields
Relevance


•   Score is calculated for each document based
    on individual document/fields and the current
    search query
For the nerds




http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html
Analysis


•   From long continuous text to individual
    tokens/words used for indexing
Analysis


•   Text -> Tokenizer -> (TokenFilter)* -> Tokens
Tokenizer

•   Splits main text into words, by whitespace,
    punctuation, other rules

•   Text: “So, it has come to this!”

•   Tokens: [ “So”, “it”, “has”, “come”, “to”, “this” ]
Token Filters

•   Change existing tokens or add new ones

    •   Case-Folding

    •   Synonyms

    •   Stemming
Token Filters
•   Text: “The Pandorica was constructed to
    ensure the safety of the Alliance.”

•   Tokens: [“The”, “Pandorica”, “was”,
    “constructed”, “to”, “ensure”, “the”, “safety”,
    “of”, “the”, “Alliance” ]

•   Filtered: [ “pandorica”, “was”, “construct”,
    “to”, “ensure”, “safe”, “of”, “alliance” ]
Q&A
Questions?
Thanks

Contenu connexe

Tendances

Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
lucenerevolution
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 

Tendances (20)

Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch plugins
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
elasticsearch
elasticsearchelasticsearch
elasticsearch
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 

Similaire à Lucene intro

Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 

Similaire à Lucene intro (20)

Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Search in the Biblical Domain - BibleTech: 2011
Search in the Biblical Domain - BibleTech: 2011Search in the Biblical Domain - BibleTech: 2011
Search in the Biblical Domain - BibleTech: 2011
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Apache solr
Apache solrApache solr
Apache solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
SOLR
SOLRSOLR
SOLR
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Solr
SolrSolr
Solr
 

Plus de Cristian Vat (6)

Ten years later
Ten years laterTen years later
Ten years later
 
Are we security yet
Are we security yetAre we security yet
Are we security yet
 
Timisoara Wireless Survey
Timisoara Wireless SurveyTimisoara Wireless Survey
Timisoara Wireless Survey
 
Introduction to Full-Text Search
Introduction to Full-Text SearchIntroduction to Full-Text Search
Introduction to Full-Text Search
 
A A A
A A AA A A
A A A
 
Language Barriers
Language BarriersLanguage Barriers
Language Barriers
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Lucene intro

  • 2. About Me • Cristian Vat • Java Developer / Geek / Enthusiast • Contact • @deathy • ... or TM JUG mailing list
  • 3. About YOU • Heard about Lucene / Solr ? • Used Lucene / Solr ?
  • 5. Databases • Select/Search on (usually) exact values or ranges • Group/Summarize Results • Sort results by value(s) of certain result column(s)
  • 6. Text Search • Search for individual words/tokens • Search long text documents • More language-aware • “Sorting” by Relevance of results by default
  • 8. IR = Information Retrieval
  • 9. IR Quick Intro • Doc 1: “I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.” • Doc 2: “So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:”
  • 10. IR Quick Intro • Index • “I” -> Doc 1 • “Caesar” -> Doc 1, Doc 2 • “enact” -> Doc 1 • “noble” -> Doc 1
  • 11. IR Quick Intro • Search • caesar • c?es* • caesar AND noble • “Julius Caesar” • Caesar NOT Brutus
  • 12. Lucene Ecosystem ...and many more
  • 13. Lucene • IR Library • Just API for Indexing/Searching • No GUI • No parsers for different file formats
  • 14. Lucene • Fast • Thread-Safe/Multi-Threaded indexing and searching • No dependencies! (not even logging framework)
  • 15. Solr • Search Server / Layer over Lucene • Provides REST-like HTTP (JSON/XML) API • Client libraries in Java, PHP, Python, Ruby, Perl, .NET, ...
  • 16. Solr • More structured indexes • Replication / Distribution, Master-Slave, etc. • Faceted Search / Filtering • Indexing of rich document types (via Tika)
  • 17. Tika • “Content Analysis Toolkit” • Text and Metadata extraction from various rich document types • Used by Solr for indexing rich document types
  • 18. Lucene (in more detail)
  • 19. Lucene Index Structure • Index = One or more Documents • Document = one or more Fields with values • NO Schema/Structure restrictions
  • 22. Query Parser • AND, OR, NOT ( +/- ) • “apache AND lucene NOT solr” ( “+apache +lucene -solr” ) • Range Queries • year:[1994 TO 2011] • Wildcard/Fuzzy: • “ap?che”, “apac*”, “appche”˜0.8
  • 23. Sorting or Results • Default sort by Relevance • Possible to use custom sort fields
  • 24. Relevance • Score is calculated for each document based on individual document/fields and the current search query
  • 26. Analysis • From long continuous text to individual tokens/words used for indexing
  • 27. Analysis • Text -> Tokenizer -> (TokenFilter)* -> Tokens
  • 28. Tokenizer • Splits main text into words, by whitespace, punctuation, other rules • Text: “So, it has come to this!” • Tokens: [ “So”, “it”, “has”, “come”, “to”, “this” ]
  • 29. Token Filters • Change existing tokens or add new ones • Case-Folding • Synonyms • Stemming
  • 30. Token Filters • Text: “The Pandorica was constructed to ensure the safety of the Alliance.” • Tokens: [“The”, “Pandorica”, “was”, “constructed”, “to”, “ensure”, “the”, “safety”, “of”, “the”, “Alliance” ] • Filtered: [ “pandorica”, “was”, “construct”, “to”, “ensure”, “safe”, “of”, “alliance” ]

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. Office (Word,Excel,PowerPoint), OpenOffice, PDF, Images(metadata), audio (ID3 for mp3 files), RTF, etc..\n
  18. \n
  19. similar to NoSQL databases. Not all documents need to contain the same fields.\n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html\n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n