SlideShare a Scribd company logo
1 of 50
De Bitmanager, 2016
You Know, for Search
Peter van der Weerd
De Bitmanager, 2016
Who am I?
• Peter van der Weerd
• Search specialist
• Self employed Bitmanager
• Enormous span of control 
De Bitmanager, 2016
Search
• Common sense:
Easy
Solved
De Bitmanager, 2016
Yeah, true…
• Install ES
• Fill it with some data
• And o/: we can search
De Bitmanager, 2016
But…
• Are the users satisfied?
• Many people struggle with sub-optimal search
results.
De Bitmanager, 2016
Search as a toolbox
• It consists of 1 or more(!) tools to find what
you need
Searchbox
Faceting (intersecting)
Sorting
More like this
Not more like this (this is not what I mean)
Etc…
De Bitmanager, 2016
Search at Booking
• Destination based (city, region, airport, etc)
• Autocomplete
Results in max 5 destinations, query per
keystroke
• Disambiguation
Show a partioned result that enables people
to choose a destination
De Bitmanager, 2016
Autocomplete in action
De Bitmanager, 2016
Disambiguation in action
De Bitmanager, 2016
Scoring
De Bitmanager, 2016
Scoring
• Lucene scores in general like: tf * idf
• Tf = term frequency
the more matched terms, the more important
• Idf = inverse document frequency
The more matched documents for the term,
the less important
De Bitmanager, 2016
Term frequency
• Used to give more importance to relative high
occurring terms.
• Scoring examples for ‘house’
House
The house
The little house on the prairie
The little house on the prairie blah blah blah
s
c
o
r
e
De Bitmanager, 2016
Inverse document frequency
• Prefers less frequent tokens.
• Useless on single token queries: it is only used
to relative score multiple tokens
• Examples:
house
little
on
the
s
c
o
r
e
De Bitmanager, 2016
Drawback of idf
• Other example…
Pekela
Haarlem
Amsterdam
Paris
• Booking switched off idf, but could have used
df instead…
s
c
o
r
e
De Bitmanager, 2016
When does idf work
• Idf typically work for large text-like queries.
• The documents *must* be evenly distributed
over shards
(or use dfs_query_then_fetch)
De Bitmanager, 2016
Is tf * idf enough?
• Well, no…
• What to deliver on a query for ‘Paris’?
The city (ehm, the are several cities Paris)
Airports?
Hotels? Which one? There are 1000’s of them.
• Even worse:
What to deliver for query ‘p’ or ‘pa’?
De Bitmanager, 2016
Record boost
• Based on
Popularity
From where booked
Language
oSame (doc language == site language)
oLocal translations
oEnglish
oMismatch
De Bitmanager, 2016
+ or x?
• Boosts are implemented by adding
• Intuitive justification:
Language could be seen as yet another (implicit!)
search term
Same for popularity: people ar typical not
searching for impopular things
• Example (from an english site):
amsterdam->amsterdam english popular
De Bitmanager, 2016
But wait…
• How big should the record-boost be?
0..1? 100?
• Lucene score might vary heavely,
sometimes more then 10x different
• So lets take 10 as max record-boost
But now the recordboost might out-weight smaller
scores
• Argggggg….
De Bitmanager, 2016
Score ranges
• Difficult to tinker with:
For instance use a stemmed token with boost 0.5
house^1.0 vs houses^0.5
What if the Lucene score is more than 2 times
higher than the stem itself?
• We are doing entity search vs text search
De Bitmanager, 2016
Different scorers
Title Score:default Score:BM25 Score:custom
House 1.22 0.77 1.20
The house 0.76 0.61 1.10
The little house on
the prairie
0.46 0.39 1.05
Querying for ‘house’:
De Bitmanager, 2016
Normalizing scores
• Goal: each term is scored around 1.0
Base score 1.0
Tf is normalized between 0 .. 0.2 and added to the
base score
Idf is normalized between 0 .. 0.2 and added to the
base score
Giving a score varying between 1 and 1.4 per term
(sometimes we don’t use idf)
De Bitmanager, 2016
Language boosting
• Same language or english: +0.7
• Local language: +0.3
(Roma vs Rome in an English site)
• Mismatched language: -0.3
De Bitmanager, 2016
About N-grams
• For auto-complete: left-edge N-Grams
• Rome:
rome
rom
ro
r
De Bitmanager, 2016
About N-grams
• When a user types ‘ro’…
Rome
Ródos
Rotterdam
Etc
• Score depends on percentage of match
(or Levenshtein distance)
s
c
o
r
e
De Bitmanager, 2016
Original approach
• Multiple fields (name, city, region, etc)
• Combining them by a weighted dismax query
De Bitmanager, 2016
Dismax query
• More subtle way of combining scores.
• Score = max + (sum - max) * tieBreaker
In words: the max plus a percentage of the others
• Edge cases:
Tiebreaker=0
Score is the max. score
Tiebreaker=1
Score is the sum of all the individual scores
(same behavior as boolean or)
De Bitmanager, 2016
Dismax example
• Q= the house
Suppose S[the] = 0.8, S[house]=1.2
• Scores for different tiebreakers:
Bool score (tiebreaker=1): 2.0
Max score (tiebreaker=0): 1.2
Score with tiebreaker=0.1: 1.28
this makes documents containing ‘the house’ a
little bit more important than ‘house’ only.
De Bitmanager, 2016
Difficulties
• Lack of context
• Hard to create a reliable scoring model
De Bitmanager, 2016
Different approach
• Canonical name:
 Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands
• Self name (indexed)
Hotel V Frederiksplein
• Rest (indexed)
Amsterdam, Noord-Holland, Netherlands
De Bitmanager, 2016
Weighting fields
• All fields are equal but some fields are more
equal than others…
Self name is most important
Other names (like the city where a hotel resides)
are less important
• Dismax over self name and other
De Bitmanager, 2016
Payload
• Small piece of information that is added to
every occurrence
• Basically a byte[]
De Bitmanager, 2016
Nowadays: payloads
• We need more information per occurrence of
a token:
Length of the original token
Self-name or other location info
Type of the name (hotel, city, landmark, etc)
• All the above info is encoded in a 32 bit
integer, and indexed as a payload
De Bitmanager, 2016
Dismax vs payload
• With fieldinfo in the payload we can simulate
dismax behavior
• We query only 1 index-field (instead of 5)
• Context: easier to do advanced scoring: all info
is in 1 scorer.
• Payloads *are* possible in ElasticSearch, but
more difficult to use
De Bitmanager, 2016
Search
• Difficult
• Sensitive equilibrium
• Impossible to serve them all
De Bitmanager, 2016
Suits
De Bitmanager, 2016
Suits
• Reasons for people to wear a suit might
include:
Hiding the fact that you cannot trust them
Hiding their incompetence
etc

De Bitmanager, 2016
Combining fields
• To prevent double counting, a dismax is
adviced.
• The fact that a term occurs in both the title as
the abstract doesn’t make it roughly twice as
important.
But it does make it somewhat more important
De Bitmanager, 2016
Combining fields
• Intuitive reaction: query terms in each others
neighborhood are more important…
• Example: search for a book:
chamber secrets rowling
• Expected top result:
Harry Potter and the Chamber of Secrets/J.K.
Rowling
De Bitmanager, 2016
Combining fields
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
• More important if in the same field?
De Bitmanager, 2016
Combining fields
• But: we get an excerpt book that contains the
requested
(all terms were present in the abstract field)
• Phrases behave even worse
De Bitmanager, 2016
Combining fields
• Suppose:
 we have 2 fields: F1 and F2
 2 query terms: qt1 and qt2
• Now we have choices how to combine…
De Bitmanager, 2016
Combining fields
• (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
 this will prefer records where both terms are
found in the same field
• (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
 this prefer behaves more like a there were no
fields
De Bitmanager, 2016
Combining fields
(F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
De Bitmanager, 2016
Combining fields
(F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
"_score": 2.1447253,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
De Bitmanager, 2016
Combining fields
• Of course: way more possibilities.
See the multi-match query for examples
Most but not all possibilities can be done by hand
(blending)
De Bitmanager, 2016
Combining fields
• Different strategy:
Combine all fields as if they were one field
Do some re-scoring afterwards
Example:
oSearch ‘rowling’ anywhere, score 1
oSearch ‘potter’ anywhere, score 1
oCombine with additional queries to do a finishing touch
De Bitmanager, 2016
Explain
• Always use explain (in debug mode)
• Did I already tell you to always use explain?
• Create a new application by first making
explain part of your infrastructure
• At least expose the scores in debug mode.
De Bitmanager, 2016
Suits: beware the logic rules…
• Cannot be reversed:
• The fact that I am not wearing a suit does not
imply that:
I am trustworthy
I am competent
De Bitmanager, 2016
You Know, for Bits…
Peter @ bitmanager.nl

More Related Content

Viewers also liked

Advanced Microservices - Greach 2015
Advanced Microservices - Greach 2015Advanced Microservices - Greach 2015
Advanced Microservices - Greach 2015
Steve Pember
 

Viewers also liked (18)

Python Pants Build System for Large Codebases
Python Pants Build System for Large CodebasesPython Pants Build System for Large Codebases
Python Pants Build System for Large Codebases
 
Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...
Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...
Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...
 
Advanced Microservices - Greach 2015
Advanced Microservices - Greach 2015Advanced Microservices - Greach 2015
Advanced Microservices - Greach 2015
 
Item analysis
Item analysisItem analysis
Item analysis
 
Catálogo 15 16 elksport
Catálogo 15 16 elksportCatálogo 15 16 elksport
Catálogo 15 16 elksport
 
IM World presentation from Chris Swan: Application centric – how the cloud ha...
IM World presentation from Chris Swan: Application centric – how the cloud ha...IM World presentation from Chris Swan: Application centric – how the cloud ha...
IM World presentation from Chris Swan: Application centric – how the cloud ha...
 
Chicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at CohesiveChicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at Cohesive
 
Jake Fox Pd. 5
Jake Fox Pd. 5Jake Fox Pd. 5
Jake Fox Pd. 5
 
NSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland ChapecoNSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland Chapeco
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
ITV& Bashton
ITV& Bashton ITV& Bashton
ITV& Bashton
 
Developing highly scalable applications with Symfony and RabbitMQ
Developing highly scalable applications with  Symfony and RabbitMQDeveloping highly scalable applications with  Symfony and RabbitMQ
Developing highly scalable applications with Symfony and RabbitMQ
 
Platform - Technical architecture
Platform - Technical architecturePlatform - Technical architecture
Platform - Technical architecture
 
Linux Malware Analysis
Linux Malware Analysis	Linux Malware Analysis
Linux Malware Analysis
 
Yirgacheffe Chelelelktu Washed Coffee 2015
Yirgacheffe Chelelelktu Washed Coffee 2015Yirgacheffe Chelelelktu Washed Coffee 2015
Yirgacheffe Chelelelktu Washed Coffee 2015
 
Application Deployment at UC Riverside
Application Deployment at UC RiversideApplication Deployment at UC Riverside
Application Deployment at UC Riverside
 
Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud. Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud.
 
Sunbrella Ottomans by Outdoor Elegance
Sunbrella Ottomans by Outdoor EleganceSunbrella Ottomans by Outdoor Elegance
Sunbrella Ottomans by Outdoor Elegance
 

Similar to You know, for search

Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
MongoDB
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
MongoDB
 
Breaking the oracle tie
Breaking the oracle tieBreaking the oracle tie
Breaking the oracle tie
agiamas
 

Similar to You know, for search (20)

Time collapsingmegaconference 9successaccelerationstrategies10-12-16
Time collapsingmegaconference 9successaccelerationstrategies10-12-16Time collapsingmegaconference 9successaccelerationstrategies10-12-16
Time collapsingmegaconference 9successaccelerationstrategies10-12-16
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
 
Living Labs Challenge Workshop
Living Labs Challenge WorkshopLiving Labs Challenge Workshop
Living Labs Challenge Workshop
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020
 
Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020
 
Southern Maryland Golden Results Marketing Proposal 2017
Southern Maryland Golden Results Marketing Proposal 2017Southern Maryland Golden Results Marketing Proposal 2017
Southern Maryland Golden Results Marketing Proposal 2017
 
Marketing Tools 2016 T&C2016 Roland Frasier Marketing Tools Presentation
Marketing Tools 2016 T&C2016 Roland Frasier Marketing Tools PresentationMarketing Tools 2016 T&C2016 Roland Frasier Marketing Tools Presentation
Marketing Tools 2016 T&C2016 Roland Frasier Marketing Tools Presentation
 
Breaking the oracle tie
Breaking the oracle tieBreaking the oracle tie
Breaking the oracle tie
 
Golden Results Maryland Real Estate Seller Marketing Plan
Golden Results Maryland Real Estate Seller Marketing PlanGolden Results Maryland Real Estate Seller Marketing Plan
Golden Results Maryland Real Estate Seller Marketing Plan
 
Leveraging Graph Analytics for Fraud Detection in PaySim Data
Leveraging Graph Analytics for Fraud Detection in PaySim DataLeveraging Graph Analytics for Fraud Detection in PaySim Data
Leveraging Graph Analytics for Fraud Detection in PaySim Data
 
Golden Results Maryland Seller Luxury Marketing Plan 2018
Golden Results Maryland Seller Luxury Marketing Plan 2018Golden Results Maryland Seller Luxury Marketing Plan 2018
Golden Results Maryland Seller Luxury Marketing Plan 2018
 
How to monetize your passion - An example in the game industry
How to monetize your passion - An example in the game industryHow to monetize your passion - An example in the game industry
How to monetize your passion - An example in the game industry
 
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
 
Bootstrapping 101
Bootstrapping 101Bootstrapping 101
Bootstrapping 101
 
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
 
Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
 
Event email campaign takedown 18 x your results!
Event email campaign takedown 18 x your results!Event email campaign takedown 18 x your results!
Event email campaign takedown 18 x your results!
 
Domain Primitives in Action - DataTjej 2018
Domain Primitives in Action - DataTjej 2018Domain Primitives in Action - DataTjej 2018
Domain Primitives in Action - DataTjej 2018
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

You know, for search

  • 1. De Bitmanager, 2016 You Know, for Search Peter van der Weerd
  • 2. De Bitmanager, 2016 Who am I? • Peter van der Weerd • Search specialist • Self employed Bitmanager • Enormous span of control 
  • 3. De Bitmanager, 2016 Search • Common sense: Easy Solved
  • 4. De Bitmanager, 2016 Yeah, true… • Install ES • Fill it with some data • And o/: we can search
  • 5. De Bitmanager, 2016 But… • Are the users satisfied? • Many people struggle with sub-optimal search results.
  • 6. De Bitmanager, 2016 Search as a toolbox • It consists of 1 or more(!) tools to find what you need Searchbox Faceting (intersecting) Sorting More like this Not more like this (this is not what I mean) Etc…
  • 7. De Bitmanager, 2016 Search at Booking • Destination based (city, region, airport, etc) • Autocomplete Results in max 5 destinations, query per keystroke • Disambiguation Show a partioned result that enables people to choose a destination
  • 11. De Bitmanager, 2016 Scoring • Lucene scores in general like: tf * idf • Tf = term frequency the more matched terms, the more important • Idf = inverse document frequency The more matched documents for the term, the less important
  • 12. De Bitmanager, 2016 Term frequency • Used to give more importance to relative high occurring terms. • Scoring examples for ‘house’ House The house The little house on the prairie The little house on the prairie blah blah blah s c o r e
  • 13. De Bitmanager, 2016 Inverse document frequency • Prefers less frequent tokens. • Useless on single token queries: it is only used to relative score multiple tokens • Examples: house little on the s c o r e
  • 14. De Bitmanager, 2016 Drawback of idf • Other example… Pekela Haarlem Amsterdam Paris • Booking switched off idf, but could have used df instead… s c o r e
  • 15. De Bitmanager, 2016 When does idf work • Idf typically work for large text-like queries. • The documents *must* be evenly distributed over shards (or use dfs_query_then_fetch)
  • 16. De Bitmanager, 2016 Is tf * idf enough? • Well, no… • What to deliver on a query for ‘Paris’? The city (ehm, the are several cities Paris) Airports? Hotels? Which one? There are 1000’s of them. • Even worse: What to deliver for query ‘p’ or ‘pa’?
  • 17. De Bitmanager, 2016 Record boost • Based on Popularity From where booked Language oSame (doc language == site language) oLocal translations oEnglish oMismatch
  • 18. De Bitmanager, 2016 + or x? • Boosts are implemented by adding • Intuitive justification: Language could be seen as yet another (implicit!) search term Same for popularity: people ar typical not searching for impopular things • Example (from an english site): amsterdam->amsterdam english popular
  • 19. De Bitmanager, 2016 But wait… • How big should the record-boost be? 0..1? 100? • Lucene score might vary heavely, sometimes more then 10x different • So lets take 10 as max record-boost But now the recordboost might out-weight smaller scores • Argggggg….
  • 20. De Bitmanager, 2016 Score ranges • Difficult to tinker with: For instance use a stemmed token with boost 0.5 house^1.0 vs houses^0.5 What if the Lucene score is more than 2 times higher than the stem itself? • We are doing entity search vs text search
  • 21. De Bitmanager, 2016 Different scorers Title Score:default Score:BM25 Score:custom House 1.22 0.77 1.20 The house 0.76 0.61 1.10 The little house on the prairie 0.46 0.39 1.05 Querying for ‘house’:
  • 22. De Bitmanager, 2016 Normalizing scores • Goal: each term is scored around 1.0 Base score 1.0 Tf is normalized between 0 .. 0.2 and added to the base score Idf is normalized between 0 .. 0.2 and added to the base score Giving a score varying between 1 and 1.4 per term (sometimes we don’t use idf)
  • 23. De Bitmanager, 2016 Language boosting • Same language or english: +0.7 • Local language: +0.3 (Roma vs Rome in an English site) • Mismatched language: -0.3
  • 24. De Bitmanager, 2016 About N-grams • For auto-complete: left-edge N-Grams • Rome: rome rom ro r
  • 25. De Bitmanager, 2016 About N-grams • When a user types ‘ro’… Rome Ródos Rotterdam Etc • Score depends on percentage of match (or Levenshtein distance) s c o r e
  • 26. De Bitmanager, 2016 Original approach • Multiple fields (name, city, region, etc) • Combining them by a weighted dismax query
  • 27. De Bitmanager, 2016 Dismax query • More subtle way of combining scores. • Score = max + (sum - max) * tieBreaker In words: the max plus a percentage of the others • Edge cases: Tiebreaker=0 Score is the max. score Tiebreaker=1 Score is the sum of all the individual scores (same behavior as boolean or)
  • 28. De Bitmanager, 2016 Dismax example • Q= the house Suppose S[the] = 0.8, S[house]=1.2 • Scores for different tiebreakers: Bool score (tiebreaker=1): 2.0 Max score (tiebreaker=0): 1.2 Score with tiebreaker=0.1: 1.28 this makes documents containing ‘the house’ a little bit more important than ‘house’ only.
  • 29. De Bitmanager, 2016 Difficulties • Lack of context • Hard to create a reliable scoring model
  • 30. De Bitmanager, 2016 Different approach • Canonical name:  Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands • Self name (indexed) Hotel V Frederiksplein • Rest (indexed) Amsterdam, Noord-Holland, Netherlands
  • 31. De Bitmanager, 2016 Weighting fields • All fields are equal but some fields are more equal than others… Self name is most important Other names (like the city where a hotel resides) are less important • Dismax over self name and other
  • 32. De Bitmanager, 2016 Payload • Small piece of information that is added to every occurrence • Basically a byte[]
  • 33. De Bitmanager, 2016 Nowadays: payloads • We need more information per occurrence of a token: Length of the original token Self-name or other location info Type of the name (hotel, city, landmark, etc) • All the above info is encoded in a 32 bit integer, and indexed as a payload
  • 34. De Bitmanager, 2016 Dismax vs payload • With fieldinfo in the payload we can simulate dismax behavior • We query only 1 index-field (instead of 5) • Context: easier to do advanced scoring: all info is in 1 scorer. • Payloads *are* possible in ElasticSearch, but more difficult to use
  • 35. De Bitmanager, 2016 Search • Difficult • Sensitive equilibrium • Impossible to serve them all
  • 37. De Bitmanager, 2016 Suits • Reasons for people to wear a suit might include: Hiding the fact that you cannot trust them Hiding their incompetence etc 
  • 38. De Bitmanager, 2016 Combining fields • To prevent double counting, a dismax is adviced. • The fact that a term occurs in both the title as the abstract doesn’t make it roughly twice as important. But it does make it somewhat more important
  • 39. De Bitmanager, 2016 Combining fields • Intuitive reaction: query terms in each others neighborhood are more important… • Example: search for a book: chamber secrets rowling • Expected top result: Harry Potter and the Chamber of Secrets/J.K. Rowling
  • 40. De Bitmanager, 2016 Combining fields "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling" "_score": 1.2030121, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom." • More important if in the same field?
  • 41. De Bitmanager, 2016 Combining fields • But: we get an excerpt book that contains the requested (all terms were present in the abstract field) • Phrases behave even worse
  • 42. De Bitmanager, 2016 Combining fields • Suppose:  we have 2 fields: F1 and F2  2 query terms: qt1 and qt2 • Now we have choices how to combine…
  • 43. De Bitmanager, 2016 Combining fields • (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)  this will prefer records where both terms are found in the same field • (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)  this prefer behaves more like a there were no fields
  • 44. De Bitmanager, 2016 Combining fields (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2) "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling" "_score": 1.2030121, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."
  • 45. De Bitmanager, 2016 Combining fields (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2) "_score": 2.1447253, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom." "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
  • 46. De Bitmanager, 2016 Combining fields • Of course: way more possibilities. See the multi-match query for examples Most but not all possibilities can be done by hand (blending)
  • 47. De Bitmanager, 2016 Combining fields • Different strategy: Combine all fields as if they were one field Do some re-scoring afterwards Example: oSearch ‘rowling’ anywhere, score 1 oSearch ‘potter’ anywhere, score 1 oCombine with additional queries to do a finishing touch
  • 48. De Bitmanager, 2016 Explain • Always use explain (in debug mode) • Did I already tell you to always use explain? • Create a new application by first making explain part of your infrastructure • At least expose the scores in debug mode.
  • 49. De Bitmanager, 2016 Suits: beware the logic rules… • Cannot be reversed: • The fact that I am not wearing a suit does not imply that: I am trustworthy I am competent
  • 50. De Bitmanager, 2016 You Know, for Bits… Peter @ bitmanager.nl