SlideShare a Scribd company logo
1 of 93
Download to read offline
Search Engines
How They Work and
Why You Need Them
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
What do you
even do all day?
We have Google.
@scarletdrive
Not all search engines are
web search engines.
@scarletdrive
google.com potatoparcel.com
Large scope
(entire internet)
Small scope
(just a few potatoes)
No control
over content
Total control over content
Many use cases
Optimize for selling
potatoes
Most websites have a
custom search engine.
@scarletdrive
Why build search engines?
● Keep it local and customize it
Let’s try to
search my store.
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
cat
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
cat
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
n = items in database
m = max length of title strings
n·m
n = items in database
m = max length of title strings = 250
O(n)
n n · m (m=250)
10 2 500
100 25 000
1 000 250 000
10 000 2 500 000
100 000 25 000 000
1 000 000 250 000 000
Why build search engines?
● Keep it local and customize it
● Improve performance
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
● Search for “cats” doesn’t return
“cat hat” or “red cat mittens”
SELECT *
FROM items
WHERE title LIKE ‘%cats%’
SELECT * FROM items
WHERE title LIKE ‘cat’ OR title LIKE ‘cats’
OR title LIKE ‘cat %’ OR title LIKE ‘cats %’
OR title LIKE ‘% cat’ OR title LIKE ‘% cats’
OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’
OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’
OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’
OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’
OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’
OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’
OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’
OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’
OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’
OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’
...
Why build search engines?
● Keep it local and customize it
● Improve performance
● Improve quality of results
But how?
@scarletdrive
Agenda
1. Why build search engines? ✓
2. Search indexes
3. Open source tools
4. Interesting challenges
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Inverted
Index
Terminology
● A document is a single searchable unit
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
● An inverted index is an internal data
structure which maps terms to IDs
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
● An inverted index is an internal data
structure which maps terms to IDs
● An index is a collection of documents
(including many inverted indexes)
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
... ...
5.00 [5]
8.00 [3]
0-10.00 [3, 5]
11.99 [7, 8]
... ...
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
... ... ...
items indexTerminology
● A search index can have
many inverted indexes
● A search engine can have
many search indexes
title inverted index
price inverted index
blog-posts index
title inverted index
post inverted index
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance
● Improve quality of results
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
cat
O(1)
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
cat
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
r = number of results found
O(1+r)
...but we usually only ask for a fixed
number of results at a time
O(25) → O(1)
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance ✓
● Improve quality of results
But at
what cost?
@scarletdrive
Trade-offs
● Space
● System complexity
● Pre-processing time
O(1)
Query
time
O(n·m·p)
Index
time
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance ✓
○ At the expense of space, complexity, and pre-processing effort
● Improve quality of results
Let’s talk about
how we build it.
@scarletdrive
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
How did we do this??
Step 1:
Tokenization
string: “cat hat”
array: [“cat”, “hat”]
Image from aliexpress.com
Image from aliexpress.com
Step 2:
Normalization
● Stemming
○ “cats” → “cat”
○ “walking” → “walk”
● Stop words
○ Remove “the”, “and”, “to”, etc...
Image from aliexpress.com
Step 3: Filters
● Lowercase
○ “Dog” → “dog”
● Synonyms
○ “colour” → “color”
○ “t-shirt” → “tshirt”
○ “canadian” → “canada”
○ “kitten” → “cat”
Quality Problems
1. “cat” search returned “vacation hat for dog”
Quality Problems
1. “cat” search returned “vacation hat for dog”
id title price
4 vacation hat for dog 12.99
cat [1, 3, 5]
hat [4]
dog [4]
vacation [4]
Quality Problems
1. “cat” search returned “vacation hat for dog”
cat [1, 3, 5]
hat [4]
dog [4]
vacation [4]
cat
id title price
4 vacation hat for dog 12.99
Quality Problems
1. “cat” search returned “vacation hat for dog”
2. “cats” search does not return “red cat mittens”
Quality Problems
2. “cats” search does not return “red cat mittens”
id title price
1 red cat mittens 14.99
red [1]
cat [1]
mitten [1]
→
All transformations performed on
the input data for the index
are also performed on the query
Quality Problems
2. “cats” search does not return “red cat mittens”
id title price
1 red cat mittens 14.99
red [1]
cat [1]
mitten [1]
cats cat
Quality Problems
1. “cat” search returned “vacation hat for dogs”
2. “cats” search does not return “red cat mittens”
3. “cat” search does not return “kitten mittens”
Quality Problems
3. “cat” search does not return “kitten mittens”
id title price
7 kitten mittens 11.99
cat [7]
mitten [7]
Quality Problems
3. “cat” search does not return “kitten mittens”
cat [7]
mitten [7]
id title price
7 kitten mittens 11.99
cat
Quality Problems
3 ½ search for “kitten” still returns “kitten mittens”
cat [7]
mitten [7]
id title price
7 kitten mittens 11.99
kitten cat
Did we solve it?
● Keep it local ✓ and customize it ✓
● Improve performance ✓
○ At the expense of space, complexity, and pre-processing effort
● Improve quality of results ✓
○ By performing special pre-processing steps
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools
4. Interesting challenges
I want a search engine...
do I have to build it myself?
@scarletdrive
● Inverted index
● Basic tokenization,
normalization, and filters
● Replication, sharding, and
distribution
● Caching and warming
● Advanced tokenization,
normalization, and filters
● Plugins
● ...and more!
Which one should I pick?
It doesn’t matter
Which one should I pick?
● Most projects work well with either
● Getting configuration right is most important
● Test with your own data, your own queries
Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe
https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr
https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
Solr vs. Elasticsearch by Kelvin Tan
http://solr-vs-elasticsearch.com/
Which one should I pick?
Better for advanced
customization
Easier to learn, faster to
start up, better docs
~ ~ WARNING: Toria’s personal opinion ~ ~
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools ✓
4. Interesting challenges
Interesting Challenge:
Scalability
Too much traffic?
Replication
Too much traffic?
Replication
update
Too much data?
Sharding
Distribution
Replication, Sharding, and Distribution
8 shards
(A,B,C,D,E,F,G,H)
3 replicas each
6 servers
Replication, Sharding, and Distribution
8 shards
(A,B,C,D,E,F,G,H)
3 replicas each
6 servers
Interesting Challenge:
Relevance
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
22 feather cat toy 7.99
124 cat and mouse t-shirt 24.50
128 cat t-shirt 31.80
329 “cats rule” sticker 0.99
420 catnip joint for cats 5.99
455 cat toy 7.00
... ... ...
When there are
many results, what
order should we
display them in?
tf-idf
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange.
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 1/5 = 0.20
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [1, 3, 2]Query: “cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange. Cat cat cat!
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 4/8 = 0.50
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [2, 1, 3]Query: “cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
(assume 100 records which all contain
“cat” in them)
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
Query: “orange cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Query: “orange cat”
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Result order = [2, 1]Query: “orange cat”
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
3/7 = 0.43
2/5 = 0.40
1/7 = 0.14
1/5 = 0.20
tf-idf
bm25
https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
Relevance Challenges
● Prevent keyword stuffing or other “gaming the system”
● Phrase matching
● Fuzzy matching
● User factors: language, location
● Other factors: quality, recency, randomness, diversity
Interesting Challenges
● Scalability
● Relevance
● Query understanding
● Numeric range search
● Faceted search
● Autocomplete
We covered: We did not cover:
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools ✓
4. Interesting challenges ✓
Thanks!

More Related Content

Similar to Search Engines: How They Work and Why You Need Them

Most common mistakes of workshops applicants
Most common mistakes of workshops applicantsMost common mistakes of workshops applicants
Most common mistakes of workshops applicantsDominik Wojciechowski
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchGeorge Awad
 
Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Brian Nauheimer
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsJoe Stein
 
Microsoft_brand_template_blue.potx
Microsoft_brand_template_blue.potxMicrosoft_brand_template_blue.potx
Microsoft_brand_template_blue.potxPhanTien25
 
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! StoresGoogle INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! StoresRob Snell
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#J On The Beach
 
Crush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO TacticsCrush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO TacticsPJ Howland
 
Agile Experiments in Machine Learning
Agile Experiments in Machine LearningAgile Experiments in Machine Learning
Agile Experiments in Machine Learningmathias-brandewinder
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School ProgrammersSiva Arunachalam
 
SEO: Create Compelling Content
SEO: Create Compelling ContentSEO: Create Compelling Content
SEO: Create Compelling ContentRob Snell
 
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AI
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AIProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AI
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AIAmanda Lam
 

Similar to Search Engines: How They Work and Why You Need Them (15)

Most common mistakes of workshops applicants
Most common mistakes of workshops applicantsMost common mistakes of workshops applicants
Most common mistakes of workshops applicants
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance Search
 
Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite Columns
 
Microsoft_brand_template_blue.potx
Microsoft_brand_template_blue.potxMicrosoft_brand_template_blue.potx
Microsoft_brand_template_blue.potx
 
Agile Estimating
Agile EstimatingAgile Estimating
Agile Estimating
 
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! StoresGoogle INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#
 
Crush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO TacticsCrush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO Tactics
 
Agile Experiments in Machine Learning
Agile Experiments in Machine LearningAgile Experiments in Machine Learning
Agile Experiments in Machine Learning
 
Adp scrum multiple product logs
Adp scrum multiple product logsAdp scrum multiple product logs
Adp scrum multiple product logs
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
 
SEO: Create Compelling Content
SEO: Create Compelling ContentSEO: Create Compelling Content
SEO: Create Compelling Content
 
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AI
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AIProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AI
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AI
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Search Engines: How They Work and Why You Need Them

  • 1. Search Engines How They Work and Why You Need Them
  • 2.
  • 3.
  • 4. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 5. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 6. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 7. What do you even do all day? We have Google. @scarletdrive
  • 8. Not all search engines are web search engines. @scarletdrive
  • 9. google.com potatoparcel.com Large scope (entire internet) Small scope (just a few potatoes) No control over content Total control over content Many use cases Optimize for selling potatoes
  • 10.
  • 11.
  • 12. Most websites have a custom search engine. @scarletdrive
  • 13. Why build search engines? ● Keep it local and customize it
  • 14.
  • 15. Let’s try to search my store. @scarletdrive
  • 16. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  • 17. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 cat SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 18. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 cat SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 19. n = items in database m = max length of title strings n·m
  • 20. n = items in database m = max length of title strings = 250 O(n)
  • 21. n n · m (m=250) 10 2 500 100 25 000 1 000 250 000 10 000 2 500 000 100 000 25 000 000 1 000 000 250 000 000
  • 22. Why build search engines? ● Keep it local and customize it ● Improve performance
  • 23. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 24. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 25. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 26. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” ● Search for “cats” doesn’t return “cat hat” or “red cat mittens” SELECT * FROM items WHERE title LIKE ‘%cats%’
  • 27. SELECT * FROM items WHERE title LIKE ‘cat’ OR title LIKE ‘cats’ OR title LIKE ‘cat %’ OR title LIKE ‘cats %’ OR title LIKE ‘% cat’ OR title LIKE ‘% cats’ OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’ OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’ OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’ OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’ OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’ OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’ OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’ OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’ OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’ OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’ ...
  • 28. Why build search engines? ● Keep it local and customize it ● Improve performance ● Improve quality of results
  • 30. Agenda 1. Why build search engines? ✓ 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 31. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  • 32. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] Inverted Index
  • 33. Terminology ● A document is a single searchable unit red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] 7 kitten mittens 11.99
  • 34. Terminology ● A document is a single searchable unit ● A field is a defined value in a document red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 7 kitten mittens 11.99
  • 35. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 7 kitten mittens 11.99
  • 36. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index ● An inverted index is an internal data structure which maps terms to IDs red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  • 37. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index ● An inverted index is an internal data structure which maps terms to IDs ● An index is a collection of documents (including many inverted indexes) red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] ... ... 5.00 [5] 8.00 [3] 0-10.00 [3, 5] 11.99 [7, 8] ... ... id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 ... ... ...
  • 38. items indexTerminology ● A search index can have many inverted indexes ● A search engine can have many search indexes title inverted index price inverted index blog-posts index title inverted index post inverted index
  • 39. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ● Improve quality of results
  • 40. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] cat
  • 41. O(1)
  • 42. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] cat id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00
  • 43. r = number of results found O(1+r)
  • 44. ...but we usually only ask for a fixed number of results at a time O(25) → O(1)
  • 45. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ✓ ● Improve quality of results
  • 47. Trade-offs ● Space ● System complexity ● Pre-processing time
  • 49. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ✓ ○ At the expense of space, complexity, and pre-processing effort ● Improve quality of results
  • 50. Let’s talk about how we build it. @scarletdrive
  • 51. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 How did we do this??
  • 52. Step 1: Tokenization string: “cat hat” array: [“cat”, “hat”] Image from aliexpress.com
  • 53. Image from aliexpress.com Step 2: Normalization ● Stemming ○ “cats” → “cat” ○ “walking” → “walk” ● Stop words ○ Remove “the”, “and”, “to”, etc...
  • 54. Image from aliexpress.com Step 3: Filters ● Lowercase ○ “Dog” → “dog” ● Synonyms ○ “colour” → “color” ○ “t-shirt” → “tshirt” ○ “canadian” → “canada” ○ “kitten” → “cat”
  • 55. Quality Problems 1. “cat” search returned “vacation hat for dog”
  • 56. Quality Problems 1. “cat” search returned “vacation hat for dog” id title price 4 vacation hat for dog 12.99 cat [1, 3, 5] hat [4] dog [4] vacation [4]
  • 57. Quality Problems 1. “cat” search returned “vacation hat for dog” cat [1, 3, 5] hat [4] dog [4] vacation [4] cat id title price 4 vacation hat for dog 12.99
  • 58. Quality Problems 1. “cat” search returned “vacation hat for dog” 2. “cats” search does not return “red cat mittens”
  • 59. Quality Problems 2. “cats” search does not return “red cat mittens” id title price 1 red cat mittens 14.99 red [1] cat [1] mitten [1] →
  • 60. All transformations performed on the input data for the index are also performed on the query
  • 61. Quality Problems 2. “cats” search does not return “red cat mittens” id title price 1 red cat mittens 14.99 red [1] cat [1] mitten [1] cats cat
  • 62. Quality Problems 1. “cat” search returned “vacation hat for dogs” 2. “cats” search does not return “red cat mittens” 3. “cat” search does not return “kitten mittens”
  • 63. Quality Problems 3. “cat” search does not return “kitten mittens” id title price 7 kitten mittens 11.99 cat [7] mitten [7]
  • 64. Quality Problems 3. “cat” search does not return “kitten mittens” cat [7] mitten [7] id title price 7 kitten mittens 11.99 cat
  • 65. Quality Problems 3 ½ search for “kitten” still returns “kitten mittens” cat [7] mitten [7] id title price 7 kitten mittens 11.99 kitten cat
  • 66. Did we solve it? ● Keep it local ✓ and customize it ✓ ● Improve performance ✓ ○ At the expense of space, complexity, and pre-processing effort ● Improve quality of results ✓ ○ By performing special pre-processing steps
  • 67. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools 4. Interesting challenges
  • 68. I want a search engine... do I have to build it myself? @scarletdrive
  • 69.
  • 70. ● Inverted index ● Basic tokenization, normalization, and filters ● Replication, sharding, and distribution ● Caching and warming ● Advanced tokenization, normalization, and filters ● Plugins ● ...and more!
  • 71. Which one should I pick? It doesn’t matter
  • 72. Which one should I pick? ● Most projects work well with either ● Getting configuration right is most important ● Test with your own data, your own queries Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability Solr vs. Elasticsearch by Kelvin Tan http://solr-vs-elasticsearch.com/
  • 73. Which one should I pick? Better for advanced customization Easier to learn, faster to start up, better docs ~ ~ WARNING: Toria’s personal opinion ~ ~
  • 74. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools ✓ 4. Interesting challenges
  • 79. Replication, Sharding, and Distribution 8 shards (A,B,C,D,E,F,G,H) 3 replicas each 6 servers
  • 80. Replication, Sharding, and Distribution 8 shards (A,B,C,D,E,F,G,H) 3 replicas each 6 servers
  • 82. id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00 22 feather cat toy 7.99 124 cat and mouse t-shirt 24.50 128 cat t-shirt 31.80 329 “cats rule” sticker 0.99 420 catnip joint for cats 5.99 455 cat toy 7.00 ... ... ... When there are many results, what order should we display them in?
  • 84. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 1/5 = 0.20 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [1, 3, 2]Query: “cat”
  • 85. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. Cat cat cat! 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 4/8 = 0.50 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [2, 1, 3]Query: “cat”
  • 86. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. (assume 100 records which all contain “cat” in them) IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 Query: “orange cat”
  • 87. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Query: “orange cat” IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
  • 88. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Result order = [2, 1]Query: “orange cat” IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78 3/7 = 0.43 2/5 = 0.40 1/7 = 0.14 1/5 = 0.20
  • 90. Relevance Challenges ● Prevent keyword stuffing or other “gaming the system” ● Phrase matching ● Fuzzy matching ● User factors: language, location ● Other factors: quality, recency, randomness, diversity
  • 91. Interesting Challenges ● Scalability ● Relevance ● Query understanding ● Numeric range search ● Faceted search ● Autocomplete We covered: We did not cover:
  • 92. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools ✓ 4. Interesting challenges ✓