SlideShare a Scribd company logo
1 of 16
The Once and Future History of Enterprise Search and Open Source Marc Krellenstein, CTOmarc@lucidimagination.com
Evolving challenges in full text search  Finding something in a lot of content (recall, scalability) IBM/STAIRS vs. US gov/Basis, BRS, Dialog, Verity Lycos and Fast , Infoseek, Excite  AltaVista Centralized search  Distributed search (Fast, Google, Lucene/Solr) Finding just the good stuff (precision) SMART, Autonomy, Google, Lucene/Solr, authority scores, browsing/clustering/faceting,… Finding it fast (performance)  Fast, Google, Lucene/Solr Making it easy (simplicity) Google, Lucene/Solr* Deploying good search everywhere (all of the above plus price, flexibility) Lucene/Solr
Google	 Breakthrough in precision of Internet search Popularity algorithm hides the bad stuff Proved importance of understanding data & users Set expectations for accuracy of enterprise search Set a new standard for search performance Sub-second (or near) Proved value of good adaptive spell-checking Demonstrated the power of distributed search for scale Reinforced the importance of simplicity and a single search box Proved the value of search Search needs to be everywhere
But Google is not like most enterprise search applications Google Most data is bad, many good enough answers…task is to screen out the bad Many privacy issues among users No security issues Many naïve users with little patience…speed is important Enterprise search Most or all the data may be good, often only one answer to a search need Many security issues Few or no privacy issues between users Naïve and sophisticated users motivated by an organizational purpose The best enterprise search tools will fit enterprise needs
Best practice recall and precision	  Recall Percent of relevant documents (items) returned 50 good answers in system, 25 returned = 50% recall Precision Percent of documents returned that are relevant 100 returned, 25 are relevant = 25% precision Ideal is 100% recall and 100% precision: return all relevant documents and only those 100% recall is easy – return all documents…but precision so low they can’t be found…precision harder Need adequate recall & enough precision for the task That will vary by application (data & users)… 6
How to get good recall Collect, index and search all the data Check for missing or corrupt data Index everything – stop words not usually needed today Search everything…limit results by category AFTER the search (clustering/faceting) Normalize the data Convert to lower case, strip/handle special characters, stemming, … Use spell-checking, synonyms to match users’ vocabulary with content Adaptive spell-checking, application-specific synonyms Light (or real) natural language processing for abstract concepts ‘Recent documents on Asia’
How to get good precision Term frequency (TF) – more occurrences of query terms is better Inverse document frequency (IDF) – rarer query terms are more important Phrase boost – query terms near each other is better Field boost – where the query term is in doc matters (e.g., in ‘title’ better) Length normalization – avoid penalizing short docs Recency – all things being equal, recent is better Authority – items linked to, clicked on or bought by others may be better Implicit and explicit relevance feedback, more-like-this – expand query (queries usually underdetermined…intent??)  Clustering/faceting – when above fail or intent is not specific Lots of data…Watson, Google Translate 8
The emergence of open source Lucene/Solr Lucene Built in late 90’s by Doug Cutting…. Apache release 2001 State of the art Java library for indexing and ranking…many ports since Contributed to open source to keep it going and reusable Wide acceptance by 2005, mostly by technology organizations, products Solr Build in 2005 by Yonik Seeley to meet CNET needs for quicker-to-build applications and faceting…had to be open source…Apache release 2006 Lucene over HTTP, schema, cache management, replication,… and faceting Open source as a development model, not a religion 4,000+ sites – Apple, Cisco, EMC, HP, IBM, LinkedIn, MySpace, Netflix, Salesforce, Twitter, Gov, Wikipedia…
Current Lucene/Solr: strengths  Best practice segmented index (like Google, Fast) Scalability via SolrCloud distributed search  billions of documents Best practice, flexible ranking (term/field/doc boosts, function queries, custom scoring…) Best overall query performance and complete query capabilities (unlimited Boolean operations, wildcards, find-similar, synonyms, spell-check…) Multilingual, query filters, geo search, memory mapped indexes, near real-time search, advanced proximity operators… Rapid innovation Extensible architecture, complete control (open source) No license fees (open source) CORE TECHNOLOGY AS GOOD OR BETTER THAN ANY OTHER…AND OPEN SOURCE
Open source Lucene/Solr: weaknesses Those typical of open source No formal support Limited access to training, consulting Lack of stringent integrated QA Pace of development and open source environment too complex for some (e.g., what version should I download? What patches? GUI?  Others Lucene/Solr development has tended to focus on core capabilities, so missing certain features for enterprise search (e.g., connectors, security, alerts, advanced query operations)
Addressing open source Lucene/Solr weaknesses Lucene/Solr Community Apache Lucene/Solr community has a wealth of information on web sites, wikis and mailing lists Community members usually respond quickly to questions Consultants May be especially helpful for systems integration or addressing gaps Commercialization  Companies commercializing open source  provide commercial support, certified versions, training and consulting…may fill in gaps or address ease of use Examples: Red Hat, MySQL ,Lucid Imagination Internal resources – usually in combination with one or more of the above
Product strengths of top commercial competitors Well established players tend to be full-featured Some organizations have focused on a particular application or domain (e.g., ecommerce, publishing, legal, help desk) Some competitors have focused on appliance-like simplicity
Weaknesses of top commercial competitors Usually expensive, especially at scale Platform or portability limitations Limited transparency Limited flexibility, especially for other than intended application or domain Limited customization, especially for appliance-like products Sometimes limited scalability Technical debt and/or lack of rapid innovation Customers are dependent on the company’s continued business success
Current competitive landscape	 For last 5 years commercial companies have felt increasing competition from Lucene/Solr because of the combination of its capability and price Very hard to justify multi-million dollar deals given Lucene/Solr Lucene/Solr sometimes wins on performance alone Some competitors have responded with diversification Re-invent themselves as a business intelligence or other kind of company Produce search derivative applications Focus on specific domains Some have been acquired But the need for good, affordable, flexible search remains
The competitive future Basic search has become commoditized and widespread…but Top commercial companies usually often have one or more key weaknesses Existing search is often mediocre and too expensive or difficult to maintain, grow or customize/enhance Producing best practice search is still hard (and search remains a hard problem…intent, context, NLP…) Market strength and features of competitors will keep competitors going a while…but Very hard to justify high prices, especially for large applications Very hard to justify closed and proprietary technology  Lucene/Solr capabilities, performance, control, price and continued rapid innovation (and addressing weaknesses) will likely lead to its dominance
Resources Lucene in Action, Second Edition, by Michael McCandless, Erik Hatcher and Otis Gospodnetic. Manning, 2010. Solr 1.4 Enterprise Search Server, by David Smiley and Eric Pugh. Packt Publishing, 2009.   Solr reference guide: http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide 17

More Related Content

Viewers also liked

Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Lucidworks (Archived)
 
The scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena GomezThe scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena Gomez
tanica
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...
Lucidworks (Archived)
 

Viewers also liked (19)

Integrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into SolrIntegrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into Solr
 
What’s new in apache lucene 3.0
What’s new in apache lucene 3.0What’s new in apache lucene 3.0
What’s new in apache lucene 3.0
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to Information
 
Jonh Lennon
Jonh LennonJonh Lennon
Jonh Lennon
 
Transforming the house hunting experience
Transforming the house hunting experienceTransforming the house hunting experience
Transforming the house hunting experience
 
Practical Search with Solr: Beyond just Looking it Up
Practical Search with Solr: Beyond just Looking it UpPractical Search with Solr: Beyond just Looking it Up
Practical Search with Solr: Beyond just Looking it Up
 
Metacognicion
MetacognicionMetacognicion
Metacognicion
 
Azure と世間様
Azure と世間様Azure と世間様
Azure と世間様
 
Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2
 
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ..."A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
 
What’s new in apache solr 1.4
What’s new in apache solr 1.4What’s new in apache solr 1.4
What’s new in apache solr 1.4
 
The scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena GomezThe scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena Gomez
 
Highly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law EnforcementHighly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law Enforcement
 
Ecma 262 5th Edition を読む #5 第9条
Ecma 262 5th Edition を読む #5 第9条Ecma 262 5th Edition を読む #5 第9条
Ecma 262 5th Edition を読む #5 第9条
 
C:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3bC:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3b
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1
 
Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...
 

More from Lucidworks (Archived)

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Lucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Lucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Lucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

The Once and Future History of Enterprise Search and Open Source

  • 1. The Once and Future History of Enterprise Search and Open Source Marc Krellenstein, CTOmarc@lucidimagination.com
  • 2. Evolving challenges in full text search Finding something in a lot of content (recall, scalability) IBM/STAIRS vs. US gov/Basis, BRS, Dialog, Verity Lycos and Fast , Infoseek, Excite  AltaVista Centralized search  Distributed search (Fast, Google, Lucene/Solr) Finding just the good stuff (precision) SMART, Autonomy, Google, Lucene/Solr, authority scores, browsing/clustering/faceting,… Finding it fast (performance) Fast, Google, Lucene/Solr Making it easy (simplicity) Google, Lucene/Solr* Deploying good search everywhere (all of the above plus price, flexibility) Lucene/Solr
  • 3. Google Breakthrough in precision of Internet search Popularity algorithm hides the bad stuff Proved importance of understanding data & users Set expectations for accuracy of enterprise search Set a new standard for search performance Sub-second (or near) Proved value of good adaptive spell-checking Demonstrated the power of distributed search for scale Reinforced the importance of simplicity and a single search box Proved the value of search Search needs to be everywhere
  • 4. But Google is not like most enterprise search applications Google Most data is bad, many good enough answers…task is to screen out the bad Many privacy issues among users No security issues Many naïve users with little patience…speed is important Enterprise search Most or all the data may be good, often only one answer to a search need Many security issues Few or no privacy issues between users Naïve and sophisticated users motivated by an organizational purpose The best enterprise search tools will fit enterprise needs
  • 5. Best practice recall and precision Recall Percent of relevant documents (items) returned 50 good answers in system, 25 returned = 50% recall Precision Percent of documents returned that are relevant 100 returned, 25 are relevant = 25% precision Ideal is 100% recall and 100% precision: return all relevant documents and only those 100% recall is easy – return all documents…but precision so low they can’t be found…precision harder Need adequate recall & enough precision for the task That will vary by application (data & users)… 6
  • 6. How to get good recall Collect, index and search all the data Check for missing or corrupt data Index everything – stop words not usually needed today Search everything…limit results by category AFTER the search (clustering/faceting) Normalize the data Convert to lower case, strip/handle special characters, stemming, … Use spell-checking, synonyms to match users’ vocabulary with content Adaptive spell-checking, application-specific synonyms Light (or real) natural language processing for abstract concepts ‘Recent documents on Asia’
  • 7. How to get good precision Term frequency (TF) – more occurrences of query terms is better Inverse document frequency (IDF) – rarer query terms are more important Phrase boost – query terms near each other is better Field boost – where the query term is in doc matters (e.g., in ‘title’ better) Length normalization – avoid penalizing short docs Recency – all things being equal, recent is better Authority – items linked to, clicked on or bought by others may be better Implicit and explicit relevance feedback, more-like-this – expand query (queries usually underdetermined…intent??) Clustering/faceting – when above fail or intent is not specific Lots of data…Watson, Google Translate 8
  • 8. The emergence of open source Lucene/Solr Lucene Built in late 90’s by Doug Cutting…. Apache release 2001 State of the art Java library for indexing and ranking…many ports since Contributed to open source to keep it going and reusable Wide acceptance by 2005, mostly by technology organizations, products Solr Build in 2005 by Yonik Seeley to meet CNET needs for quicker-to-build applications and faceting…had to be open source…Apache release 2006 Lucene over HTTP, schema, cache management, replication,… and faceting Open source as a development model, not a religion 4,000+ sites – Apple, Cisco, EMC, HP, IBM, LinkedIn, MySpace, Netflix, Salesforce, Twitter, Gov, Wikipedia…
  • 9. Current Lucene/Solr: strengths Best practice segmented index (like Google, Fast) Scalability via SolrCloud distributed search  billions of documents Best practice, flexible ranking (term/field/doc boosts, function queries, custom scoring…) Best overall query performance and complete query capabilities (unlimited Boolean operations, wildcards, find-similar, synonyms, spell-check…) Multilingual, query filters, geo search, memory mapped indexes, near real-time search, advanced proximity operators… Rapid innovation Extensible architecture, complete control (open source) No license fees (open source) CORE TECHNOLOGY AS GOOD OR BETTER THAN ANY OTHER…AND OPEN SOURCE
  • 10. Open source Lucene/Solr: weaknesses Those typical of open source No formal support Limited access to training, consulting Lack of stringent integrated QA Pace of development and open source environment too complex for some (e.g., what version should I download? What patches? GUI? Others Lucene/Solr development has tended to focus on core capabilities, so missing certain features for enterprise search (e.g., connectors, security, alerts, advanced query operations)
  • 11. Addressing open source Lucene/Solr weaknesses Lucene/Solr Community Apache Lucene/Solr community has a wealth of information on web sites, wikis and mailing lists Community members usually respond quickly to questions Consultants May be especially helpful for systems integration or addressing gaps Commercialization Companies commercializing open source provide commercial support, certified versions, training and consulting…may fill in gaps or address ease of use Examples: Red Hat, MySQL ,Lucid Imagination Internal resources – usually in combination with one or more of the above
  • 12. Product strengths of top commercial competitors Well established players tend to be full-featured Some organizations have focused on a particular application or domain (e.g., ecommerce, publishing, legal, help desk) Some competitors have focused on appliance-like simplicity
  • 13. Weaknesses of top commercial competitors Usually expensive, especially at scale Platform or portability limitations Limited transparency Limited flexibility, especially for other than intended application or domain Limited customization, especially for appliance-like products Sometimes limited scalability Technical debt and/or lack of rapid innovation Customers are dependent on the company’s continued business success
  • 14. Current competitive landscape For last 5 years commercial companies have felt increasing competition from Lucene/Solr because of the combination of its capability and price Very hard to justify multi-million dollar deals given Lucene/Solr Lucene/Solr sometimes wins on performance alone Some competitors have responded with diversification Re-invent themselves as a business intelligence or other kind of company Produce search derivative applications Focus on specific domains Some have been acquired But the need for good, affordable, flexible search remains
  • 15. The competitive future Basic search has become commoditized and widespread…but Top commercial companies usually often have one or more key weaknesses Existing search is often mediocre and too expensive or difficult to maintain, grow or customize/enhance Producing best practice search is still hard (and search remains a hard problem…intent, context, NLP…) Market strength and features of competitors will keep competitors going a while…but Very hard to justify high prices, especially for large applications Very hard to justify closed and proprietary technology Lucene/Solr capabilities, performance, control, price and continued rapid innovation (and addressing weaknesses) will likely lead to its dominance
  • 16. Resources Lucene in Action, Second Edition, by Michael McCandless, Erik Hatcher and Otis Gospodnetic. Manning, 2010. Solr 1.4 Enterprise Search Server, by David Smiley and Eric Pugh. Packt Publishing, 2009. Solr reference guide: http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide 17