SlideShare a Scribd company logo
1 of 50
Download to read offline
About Solr
                         People as A Search Problem




Thursday, May 26, 2011
About Me


                    • Building websites since 1996, Java since
                      1997
                    • Prior web search experience
                    • Building and scaling eHarmony
                      products since 2002



Thursday, May 26, 2011
What is Jazzed


                    • Subscription Based
                      Dating Site
                    • Incubated by
                      eHarmony




Thursday, May 26, 2011
What is Jazzed


                     • Create a profile
                     • Search for others
                     • View their photos
                     • Privately
                       Communicate


Thursday, May 26, 2011
What is Jazzed


                     • Create a profile
                     • Search for others
                     • View their photos
                     • Privately
                       Communicate


Thursday, May 26, 2011
What is Jazzed


                     • Create a profile
                     • Search for others
                     • View their photos
                     • Privately
                       Communicate


Thursday, May 26, 2011
What is Jazzed


                     • Create a profile
                     • Search for others
                     • View their photos
                     • Privately
                       Communicate


Thursday, May 26, 2011
How is it different?


                    • Covers broader range of relationships
                    • Easy to get started
                    • Real profiles screened by machine and
                      humans
                    • Fast, effective search oriented tools



Thursday, May 26, 2011
Jazzed Stats

                    • Started Fall 2009
                    • Beta Summer 2010
                    • Launched October 2010
                    • 100,000s of Profiles
                    • 1,000s of Searches Daily


Thursday, May 26, 2011
Jazzed Architecture



                    • Event-driven SOA
                    • REST, JSON, EIP, Not-only-SQL
                    • Technology incubation




Thursday, May 26, 2011
Tech Stack


                    • Java 6, Spring 3, Jersey 1.1, JMS
                      (AQMP)
                    • RHEL 4, Oracle 11g, Voldemort 0.81,
                      Solr 1.4.1, NFS




Thursday, May 26, 2011
Thursday, May 26, 2011
Thursday, May 26, 2011
Not Covered


                    • Distributed Search
                    • Caching Strategies
                    • Data Import
                    • Analyzers/Tokenizers



Thursday, May 26, 2011
Why Lucene?

                    • Proven Solid IR library
                    • Prefer Open Source Solutions
                    • Not Only SQL
                    • Flexible Ranking
                    • Pluggable


Thursday, May 26, 2011
Why Solr


                    • Performant, Extensible, RESTful Service
                    • Configuration, Schema, Multicores
                    • Admin Interface
                    • Replication, Backups, Monitoring



Thursday, May 26, 2011
Open Source



                    • Strengthens Engineering Team
                    • Be apart of great community
                    • Not Brochure-ware




Thursday, May 26, 2011
Not Only SQL



                    • One solution does not fit all
                    • Prefer availability over consistency
                    • Horizontal Scaling over Vertical




Thursday, May 26, 2011
Flexible Ranking

                    • Query Strategies
                         • Boolean Algebra
                         • Vector Space Analysis
                         • Hybrids
                    • Extensive Function Support
                    • Index and Query Boosting


Thursday, May 26, 2011
...Oh My!


                    • Standard Plugins - Geospatial*,
                      Faceting, Spelling, MoreLikeThis
                    • Full Text with Highlighted Results
                    • Client agnostic



Thursday, May 26, 2011
Inevitable Question

                    • “Does it scale?”
                    • Solr POC Benchmark
                         • 10 Million profiles
                         • >200 queries/sec under 100ms 90th
                         • Default tuning until 5 million profiles


Thursday, May 26, 2011
Profile Service



                    • RESTful Hybrid Data Service
                    • Public, Private, Attributes
                    • Event Producer




Thursday, May 26, 2011
Profiles

                    • Mostly structured
                    • Categories - Eye Color, Desired
                      Ethnicity
                    • Dates - Birthdate
                    • Numbers - Coordinates, Age Range
                    • Text -Name, Headline


Thursday, May 26, 2011
Inverting People
                                            Term          Document
                                           MALE           1, 3, 5, 7, 9
                                          FEMALE         2, 4, 6, 8, 10
                    • Stored as an        HAIR_RED              8
                      inverted index     HAIR_BLOND        1, 2, 5, 6
                                          EYE_BLUE         1, 2, 3, 10
                    • Index random
                                         EYE_BROWN      4, 5, 6, 7, 8, 9
                      accessed by term       fun           1, 3, 7, 9
                                            funny          2, 4, 6, 10
                                            beach     1, 2, 3, 4, 5, 6, 7, 8


Thursday, May 26, 2011
Schema Design


                    • Single “Table”
                    • One-to-many = multi-value fields
                    • Individual vs Composite Fields
                         • copyTo and have both!



Thursday, May 26, 2011
Field considerations


                    • Stored or not
                    • Indexed or not
                    • Multivalued - desires fields
                    • Type



Thursday, May 26, 2011
Solr Types Used
                                                 The ‘t’ is for Trie
                    • tdate, tint, tfloat* - birthdate, loginAt
                    • text - all text
                    • string - id, non indexed text
                    • random - good for random sorts
                    • enum - for all enumerations


Thursday, May 26, 2011
Data Duplication


                    • By function - numberPhotos &
                      hasPhotos
                    • By relationship - hiddenBy & hidden
                    • By analysis - name & text



Thursday, May 26, 2011
Saving Profiles


                    • Updating is in memory operation
                    • No partial updates
                    • Commit means flush index changes
                    • Autocommit on maxDocs, maxTime or
                      both



Thursday, May 26, 2011
Why Also Voldemort


                    • Private profiles can not be stale
                    • Many fields not searchable or viewable
                      by others
                    • Isolate queries from fetch by id



Thursday, May 26, 2011
Querying


                    • Superset of Lucene
                    • Efficient Range Queries
                    • Multiple Query Handlers
                         • Dismax, Boost, Geo



Thursday, May 26, 2011
Recall vs Precision



                    • Focus on recall when corpus is small
                    • Precision once it is at critical mass




Thursday, May 26, 2011
Boolean Queries


                    • Default operator set to AND
                    • +gender:FEMALE +seeking:MALE
                      +eyeColor:EYE_BLUE +hairColor:
                      (HAIR_RED, HAIR_BLONDE)
                    • Sort order is important



Thursday, May 26, 2011
Hybrid Queries


                    • Default operator set to OR
                    • +gender:FEMALE +seeking:MALE
                      eyeColor:EYE_BLUE hairColor:
                      (HAIR_RED, HAIR_BLONDE)




Thursday, May 26, 2011
Why you’re lucky if you
                      like redheads

                    • Inverse Document
                      Frequency (IDF)  1.Blue eyed, redheads
                                       2.Blue eyed, blonds
                    • Rarer is favored
                                       3.Redheads
                      over more common
                                       4.Blonds
                    • More fields
                      matched = higher
                      ranking

Thursday, May 26, 2011
Boosting



                    • Query time by importance
                         • eyeColor:EYE_BLUE^2
                           hairColor:HAIR_BLOND




Thursday, May 26, 2011
Filter Fields

                                             id   hidden
                                             1    2, 4, 6
                    • Useful for roles and
                      other lists            2      1

                    • -hidden:(2 4 6)




Thursday, May 26, 2011
Filter Fields

                                             id    hidden
                                             1     2, 4, 6
                    • Useful for roles and
                      other lists            2       1

                    • -hidden:(2 4 6)        id   hiddenBy
                                             1       2
                    • -hiddenBy:1
                                             2       1
                                             4       1
                                             6       1

Thursday, May 26, 2011
Date Math



                    • Simplifies query preprocessing
                    • +birthDate:[NOW/DAY+1DAY-36YEAR
                      TO NOW/DAY-25YEAR]




Thursday, May 26, 2011
Date Math



                    • Simplifies query preprocessing
                    • +birthDate:[NOW/DAY+1DAY-36YEAR
                      TO NOW/DAY-25YEAR]

                          Between 25 and 35 years old



Thursday, May 26, 2011
Distance Searching




                    • lat, lon, distance
                    • SolrLocal by Patrick O’Leary
                    • Additional overhead ~90ms per query
                    • Superceded in Solr 3.1



Thursday, May 26, 2011
Testing Queries



                    • Log queries and ids returned
                    • Version your search strategies
                    • Improve one thing at a time




Thursday, May 26, 2011
Geo Service


                    • Read-mostly service
                    • Fields - Postal Code, Country,
                      State, Cities, Lat, Lon
                    • Usage - Registration
                      Validation, City Selection



Thursday, May 26, 2011
Operations



                    • Servlet container and filesystem
                    • Jetty 6, 64 Java 6 JVM
                    • 8G Heap -XX:+UseCompressedOops




Thursday, May 26, 2011
Operations


                    • Active/Passive
                    • Layer 7 Load balancing
                    • Nightly snapshots
                    • Eventually SolrCloud



Thursday, May 26, 2011
Multicore


                    • Run multiple schemas on the same
                    • Hot swappable for backwards
                      compatible changes
                    • private / public profiles



Thursday, May 26, 2011
Security


                     • No security provided
                     • At minimum secure      <delete>
                                                <query>*:*</query>
                       your UpdateHandler     </delete>


                     • Separate Cores



Thursday, May 26, 2011
Future

                    • Solr 3.1
                    • Mutual Matching
                    • Faceting / Guided Search
                    • Incorporating spelling
                    • Hierarchies, categories, better ranking
                      models


Thursday, May 26, 2011
Faceting

                    • Returns counts
                      with query
                      results
                    • Efficient
                    • Guides the user
                      toward precision


Thursday, May 26, 2011
Thank you
                         jtuberville@eharmony.com
                            Twitter: @jtuberville




Thursday, May 26, 2011

More Related Content

Similar to Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)Balanced Team
 
Sustainable Theming with Fusion - DCCO 2011
Sustainable Theming with Fusion - DCCO 2011Sustainable Theming with Fusion - DCCO 2011
Sustainable Theming with Fusion - DCCO 2011sheenadonnelly
 
Bonfire... How'd You Do That?! - AtlasCamp 2011
Bonfire... How'd You Do That?! - AtlasCamp 2011Bonfire... How'd You Do That?! - AtlasCamp 2011
Bonfire... How'd You Do That?! - AtlasCamp 2011Atlassian
 
P90 X Your Database!!
P90 X Your Database!!P90 X Your Database!!
P90 X Your Database!!Denish Patel
 
Skills & Training for Library Publishing
Skills & Training for Library PublishingSkills & Training for Library Publishing
Skills & Training for Library Publishingkimballs
 
Education 2.3 m erwin
Education 2.3 m erwinEducation 2.3 m erwin
Education 2.3 m erwinErwin Huang
 
AIIM Ottawa May 12 2011 Agenda
AIIM Ottawa May 12 2011 AgendaAIIM Ottawa May 12 2011 Agenda
AIIM Ottawa May 12 2011 AgendaCheryl McKinnon
 

Similar to Jazzed about Solr: People as a Search Problem - By Joshua Tuberville (8)

Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
 
Sustainable Theming with Fusion - DCCO 2011
Sustainable Theming with Fusion - DCCO 2011Sustainable Theming with Fusion - DCCO 2011
Sustainable Theming with Fusion - DCCO 2011
 
JSLOL
JSLOLJSLOL
JSLOL
 
Bonfire... How'd You Do That?! - AtlasCamp 2011
Bonfire... How'd You Do That?! - AtlasCamp 2011Bonfire... How'd You Do That?! - AtlasCamp 2011
Bonfire... How'd You Do That?! - AtlasCamp 2011
 
P90 X Your Database!!
P90 X Your Database!!P90 X Your Database!!
P90 X Your Database!!
 
Skills & Training for Library Publishing
Skills & Training for Library PublishingSkills & Training for Library Publishing
Skills & Training for Library Publishing
 
Education 2.3 m erwin
Education 2.3 m erwinEducation 2.3 m erwin
Education 2.3 m erwin
 
AIIM Ottawa May 12 2011 Agenda
AIIM Ottawa May 12 2011 AgendaAIIM Ottawa May 12 2011 Agenda
AIIM Ottawa May 12 2011 Agenda
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 

Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

  • 1. About Solr People as A Search Problem Thursday, May 26, 2011
  • 2. About Me • Building websites since 1996, Java since 1997 • Prior web search experience • Building and scaling eHarmony products since 2002 Thursday, May 26, 2011
  • 3. What is Jazzed • Subscription Based Dating Site • Incubated by eHarmony Thursday, May 26, 2011
  • 4. What is Jazzed • Create a profile • Search for others • View their photos • Privately Communicate Thursday, May 26, 2011
  • 5. What is Jazzed • Create a profile • Search for others • View their photos • Privately Communicate Thursday, May 26, 2011
  • 6. What is Jazzed • Create a profile • Search for others • View their photos • Privately Communicate Thursday, May 26, 2011
  • 7. What is Jazzed • Create a profile • Search for others • View their photos • Privately Communicate Thursday, May 26, 2011
  • 8. How is it different? • Covers broader range of relationships • Easy to get started • Real profiles screened by machine and humans • Fast, effective search oriented tools Thursday, May 26, 2011
  • 9. Jazzed Stats • Started Fall 2009 • Beta Summer 2010 • Launched October 2010 • 100,000s of Profiles • 1,000s of Searches Daily Thursday, May 26, 2011
  • 10. Jazzed Architecture • Event-driven SOA • REST, JSON, EIP, Not-only-SQL • Technology incubation Thursday, May 26, 2011
  • 11. Tech Stack • Java 6, Spring 3, Jersey 1.1, JMS (AQMP) • RHEL 4, Oracle 11g, Voldemort 0.81, Solr 1.4.1, NFS Thursday, May 26, 2011
  • 14. Not Covered • Distributed Search • Caching Strategies • Data Import • Analyzers/Tokenizers Thursday, May 26, 2011
  • 15. Why Lucene? • Proven Solid IR library • Prefer Open Source Solutions • Not Only SQL • Flexible Ranking • Pluggable Thursday, May 26, 2011
  • 16. Why Solr • Performant, Extensible, RESTful Service • Configuration, Schema, Multicores • Admin Interface • Replication, Backups, Monitoring Thursday, May 26, 2011
  • 17. Open Source • Strengthens Engineering Team • Be apart of great community • Not Brochure-ware Thursday, May 26, 2011
  • 18. Not Only SQL • One solution does not fit all • Prefer availability over consistency • Horizontal Scaling over Vertical Thursday, May 26, 2011
  • 19. Flexible Ranking • Query Strategies • Boolean Algebra • Vector Space Analysis • Hybrids • Extensive Function Support • Index and Query Boosting Thursday, May 26, 2011
  • 20. ...Oh My! • Standard Plugins - Geospatial*, Faceting, Spelling, MoreLikeThis • Full Text with Highlighted Results • Client agnostic Thursday, May 26, 2011
  • 21. Inevitable Question • “Does it scale?” • Solr POC Benchmark • 10 Million profiles • >200 queries/sec under 100ms 90th • Default tuning until 5 million profiles Thursday, May 26, 2011
  • 22. Profile Service • RESTful Hybrid Data Service • Public, Private, Attributes • Event Producer Thursday, May 26, 2011
  • 23. Profiles • Mostly structured • Categories - Eye Color, Desired Ethnicity • Dates - Birthdate • Numbers - Coordinates, Age Range • Text -Name, Headline Thursday, May 26, 2011
  • 24. Inverting People Term Document MALE 1, 3, 5, 7, 9 FEMALE 2, 4, 6, 8, 10 • Stored as an HAIR_RED 8 inverted index HAIR_BLOND 1, 2, 5, 6 EYE_BLUE 1, 2, 3, 10 • Index random EYE_BROWN 4, 5, 6, 7, 8, 9 accessed by term fun 1, 3, 7, 9 funny 2, 4, 6, 10 beach 1, 2, 3, 4, 5, 6, 7, 8 Thursday, May 26, 2011
  • 25. Schema Design • Single “Table” • One-to-many = multi-value fields • Individual vs Composite Fields • copyTo and have both! Thursday, May 26, 2011
  • 26. Field considerations • Stored or not • Indexed or not • Multivalued - desires fields • Type Thursday, May 26, 2011
  • 27. Solr Types Used The ‘t’ is for Trie • tdate, tint, tfloat* - birthdate, loginAt • text - all text • string - id, non indexed text • random - good for random sorts • enum - for all enumerations Thursday, May 26, 2011
  • 28. Data Duplication • By function - numberPhotos & hasPhotos • By relationship - hiddenBy & hidden • By analysis - name & text Thursday, May 26, 2011
  • 29. Saving Profiles • Updating is in memory operation • No partial updates • Commit means flush index changes • Autocommit on maxDocs, maxTime or both Thursday, May 26, 2011
  • 30. Why Also Voldemort • Private profiles can not be stale • Many fields not searchable or viewable by others • Isolate queries from fetch by id Thursday, May 26, 2011
  • 31. Querying • Superset of Lucene • Efficient Range Queries • Multiple Query Handlers • Dismax, Boost, Geo Thursday, May 26, 2011
  • 32. Recall vs Precision • Focus on recall when corpus is small • Precision once it is at critical mass Thursday, May 26, 2011
  • 33. Boolean Queries • Default operator set to AND • +gender:FEMALE +seeking:MALE +eyeColor:EYE_BLUE +hairColor: (HAIR_RED, HAIR_BLONDE) • Sort order is important Thursday, May 26, 2011
  • 34. Hybrid Queries • Default operator set to OR • +gender:FEMALE +seeking:MALE eyeColor:EYE_BLUE hairColor: (HAIR_RED, HAIR_BLONDE) Thursday, May 26, 2011
  • 35. Why you’re lucky if you like redheads • Inverse Document Frequency (IDF) 1.Blue eyed, redheads 2.Blue eyed, blonds • Rarer is favored 3.Redheads over more common 4.Blonds • More fields matched = higher ranking Thursday, May 26, 2011
  • 36. Boosting • Query time by importance • eyeColor:EYE_BLUE^2 hairColor:HAIR_BLOND Thursday, May 26, 2011
  • 37. Filter Fields id hidden 1 2, 4, 6 • Useful for roles and other lists 2 1 • -hidden:(2 4 6) Thursday, May 26, 2011
  • 38. Filter Fields id hidden 1 2, 4, 6 • Useful for roles and other lists 2 1 • -hidden:(2 4 6) id hiddenBy 1 2 • -hiddenBy:1 2 1 4 1 6 1 Thursday, May 26, 2011
  • 39. Date Math • Simplifies query preprocessing • +birthDate:[NOW/DAY+1DAY-36YEAR TO NOW/DAY-25YEAR] Thursday, May 26, 2011
  • 40. Date Math • Simplifies query preprocessing • +birthDate:[NOW/DAY+1DAY-36YEAR TO NOW/DAY-25YEAR] Between 25 and 35 years old Thursday, May 26, 2011
  • 41. Distance Searching • lat, lon, distance • SolrLocal by Patrick O’Leary • Additional overhead ~90ms per query • Superceded in Solr 3.1 Thursday, May 26, 2011
  • 42. Testing Queries • Log queries and ids returned • Version your search strategies • Improve one thing at a time Thursday, May 26, 2011
  • 43. Geo Service • Read-mostly service • Fields - Postal Code, Country, State, Cities, Lat, Lon • Usage - Registration Validation, City Selection Thursday, May 26, 2011
  • 44. Operations • Servlet container and filesystem • Jetty 6, 64 Java 6 JVM • 8G Heap -XX:+UseCompressedOops Thursday, May 26, 2011
  • 45. Operations • Active/Passive • Layer 7 Load balancing • Nightly snapshots • Eventually SolrCloud Thursday, May 26, 2011
  • 46. Multicore • Run multiple schemas on the same • Hot swappable for backwards compatible changes • private / public profiles Thursday, May 26, 2011
  • 47. Security • No security provided • At minimum secure <delete> <query>*:*</query> your UpdateHandler </delete> • Separate Cores Thursday, May 26, 2011
  • 48. Future • Solr 3.1 • Mutual Matching • Faceting / Guided Search • Incorporating spelling • Hierarchies, categories, better ranking models Thursday, May 26, 2011
  • 49. Faceting • Returns counts with query results • Efficient • Guides the user toward precision Thursday, May 26, 2011
  • 50. Thank you jtuberville@eharmony.com Twitter: @jtuberville Thursday, May 26, 2011