SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Advanced fulltext search
with Sphinx
Adrian Nuta // Sphinxsearch // 2014
Fulltext search in MySQL
● available for MyISAM and lately for InnoDB
● limited in indexation options
○ only min length and list of stopwords

● limited in search options
○ boolean
○ natural mode
○ with query expansion
Why Sphinx?
● GPLv2
● better performance
● lot of features, both on indexing and
searching
● easy to transit from MySQL:
○ easy to index from MySQL
○ SphinxQL - access and query Sphinx using any
MySQL client
MySQL vs Sphinx fulltext index
● B-tree index
● easy to update
frequently, easy to
access by PK
● columnar storage
● OLTP

● inverted index
● hard to update, fast
to read
● keyword based
storage
● OLAP
Simple fulltext search
MySQL:
mysql> SELECT * FROM myindex
WHERE MATCH('title,content') AGAINST ('find me fast');
Sphinx:
mysql> SELECT * FROM myindex
WHERE MATCH('find me fast');
More complete Sphinx search
mysql> SELECT * FROM index WHERE
MATCH('"a quorum search is made here"/4')
ORDER BY WEIGHT() DESC, id ASC
OPTION ranker = expr(
'sum(
exact_hit+10*(min_hit_pos==1)+lcs*(0.1*my_attr)
)*1000 +
bm25'
);
Searching only on some fields
● Not possible in MySQL, need to declare
separate index
● in Sphinx - syntax operator:
mysql> SELECT * FROM myindex
WHERE MATCH(‘@(title,content) find me fast’);
Indexing features
●
●
●
●
●
●

charset table
stopwords, wordforms
stemming and lemmatization
HTML stripping
blending, ignore chars, bigram words
custom regexp filters
Searching operators
●
●
●
●
●
●
●

wildcard
proximity
phrase
start/end
qourum matching
strict order
sentence, paragraph, HTML zone limitation
Ranking factors formulas
● bm25
● LCS - distance between query and
document
● word and hit counting
● tf_idf and idf
● word positioning
● possible to use attribute values
Ranking without field weighting
mysql> SELECT id,title,weight() FROM wikipedia WHERE MATCH('inverted index') OPTION
ranker=expr('sum(hit_count*user_weight)'), field_weights=(title=1,body=1);
+-----------+----------------------------------------------------------+----------+
| id
| title
| weight() |
+-----------+----------------------------------------------------------+----------+
| 221501516 | Index (search engine)
|
125 |
| 221487412 | Inverted index
|
47 |

Doc. 221501516: 1 hit in ‘title’ x 100 + 124 hits in ‘body’ = 125
Doc. 221487412: 2 hits in ‘title’x 100 +

45 hits in ‘body’ = 47
Ranking with field weighting
mysql> SELECT id,title,WEIGHT() FROM index WHERE MATCH('inverted index') OPTION
ranker=expr('sum(hit_count*user_weight)'), field_weights=(title=100,body=1);
+-----------+------------------------------------------------------+----------+
| id
| title
| WEIGHT() |
+-----------+------------------------------------------------------+----------+
| 221487412 | Inverted index
|
245 |
| 221501516 | Index (search engine)
|
224 |

Doc. 221501516: 1

hit in ‘title’ x 100 + 124 hits in ‘body’ = 100+124 = 224

Doc. 221487412: 2 hits in ‘title’ x 100 +

45 hits in ‘body’ = 200 +45 = 245
Words proximity
mysql> SELECT id,title,WEIGHT() FROM index
WHERE MATCH('@title list of football players') OPTION ranker=expr('sum(lcs)');
+-----------+-----------------------------------------------------+----------+
| id
| title
| weight() |
+-----------+-----------------------------------------------------+----------+
| 207381464 | List of football players from Amsterdam
|
4 |
| 221196229 | List of Football Kingz F.C. players
|
3 |
| 210456301 | List of Florida State University football players
|
2 |
+-----------+-----------------------------------------------------+----------+
word and hit count
mysql> SELECT id,title,WEIGHT() AS w FROM index WHERE MATCH('@title php | api') OPTION
ranker=expr('sum(hit_count)');
+---------+----------------------------------------------------------+------+
| id
| title
| w
|
+---------+----------------------------------------------------------+------+
| 1000671 | PHP API gives PHP Warnings - tips?
|
3 |
...
mysql> SELECT id,title,WEIGHT() AS w FROM index WHERE MATCH('@title php | api') OPTION
ranker=expr('sum(word_count)');
+---------+----------------------------------------------------------+------+
| id
| title
| w
|
+---------+----------------------------------------------------------+------+
| 1000671 | PHP API gives PHP Warnings - tips?
|
2 |
Position
mysql> select id,title,weight() as w from forum where match('@title sphinx php api')
option ranker=expr('sum(min_hit_pos)');
+---------+--------------------------------------------------------------+------+
| id
| title
| w
|
+---------+--------------------------------------------------------------+------+
| 1004955 | how can i do a sample search use sphinx php api
|
9 |
| 1004900 | How to update fulltext field using sphinx api of PHP?
|
7 |
| 1008783 | Update MVA-Attributes with the PHP-API Sphinx 2.0.2
|
6 |
| 1000498 | Limits in sphinx when using PHP sphinx API
|
3 |

how can i do a sample search use sphinx php api
1

2

3 4

5

6

7

8

9
IDF
mysql> select id,title,weight() from wikipedia where match('@title (Polyphonic |
Polysyllabic | Oberheim) ') option ranker=expr('sum(max_idf)*1000');
+-----------+---------------------------+----------+
| id
| title
| weight() |
+-----------+---------------------------+----------+
| 165867281 | The Polysyllabic Spree
|
112 | Polysyllabic - rare
| 208650218 | Oberheim Xpander
|
108 | Oberheim - not so rare
| 209138112 | Oberheim OB-8
|
108 |
| 180503990 | Polyphonic Era
|
85 | Polyphonic - common
| 183135294 | Polyphonic C sharp
|
85 |
| 219939232 | Polyphonic HMI
|
85 |
+-----------+---------------------------+----------+
BM25F
mysql> select … where match('odbc') option ranker=expr('1000*bm25f(1,1)');
+------+------------------------------------+-----------+----------+----------+
| id
| title
| title_len | body_len | weight() |
+------+------------------------------------+-----------+----------+----------+
| 179 | odbc_dsn
| 1
| 69
|
775 |
| 170 | type
| 1
| 124
|
742 |
…
mysql> select … where match('odbc') option ranker=expr('1000*bm25f(1,0)');
+------+------------------------------------+-----------+----------+----------+
| id
| title
| title_len | body_len | weight() |
+------+------------------------------------+-----------+----------+----------+
| 169 | Data source configuration options | 4
| 6246
|
758 |
| 179 | odbc_dsn
| 1
| 69
|
743 |
| 170 | type
| 1
| 124
|
689 |
Language morphology
Will the user search ‘shirt’ or ‘shirts’?
● stemming:
○ shirt = shirts

● index_exact_form for exact matching
● lemmatization:
○

men = man
EF-S 18-200mm f/3.5-5.6

blend_chars
● act as both separators and valid chars
● 10-200mm with - blended will index 3 terms:
10-200mm, 10 and 200mm

● leading or trailing blend char behaviour can
be configured to be stripped or indexed
Sentence delimitation
mysql> INSERT INTO index VALUES(1,
'quick brown fox jumps over the lazy dog');
mysql> INSERT INTO index VALUES(2,
'The quick brown fox made it.
Where was the lazy dog?');
mysql> SELECT * FROM index WHERE
MATCH('brown fox SENTENCE lazy dog');
+------+
| id
|
+------+
|
1 |
+------+
Paragraph delimitation
mysql> INSERT
'<p>The quick
mysql> INSERT
'<p>The quick
dog</p>');

INTO index VALUES(1,
brown fox jumps over the lazy dog</p>');
INTO index VALUES(2,
brown fox jumps</p><p>over the lazy

mysql> SELECT * FROM index WHERE
MATCH('brown fox PARAGRAPH lazy dog');
+------+
| id
|
+------+
|
1 |
+------+
More fulltext features
●
●
●
●
●
●

bigrams
more ranking factors: lccs, wlccs, atc
phrase boundary chars
HTML index attributes, elements removal
RLP Chinese tokenization
position step tunning
Thank you!
http://www.sphinxsearch.com
adrian.nuta@sphinxsearch.com

Contenu connexe

Tendances

Applied Partitioning And Scaling Your Database System Presentation
Applied Partitioning And Scaling Your Database System PresentationApplied Partitioning And Scaling Your Database System Presentation
Applied Partitioning And Scaling Your Database System Presentation
Richard Crowley
 

Tendances (20)

Applied Partitioning And Scaling Your Database System Presentation
Applied Partitioning And Scaling Your Database System PresentationApplied Partitioning And Scaling Your Database System Presentation
Applied Partitioning And Scaling Your Database System Presentation
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
 
Accelerating Local Search with PostgreSQL (KNN-Search)
Accelerating Local Search with PostgreSQL (KNN-Search)Accelerating Local Search with PostgreSQL (KNN-Search)
Accelerating Local Search with PostgreSQL (KNN-Search)
 
ClickHouse materialized views - a secret weapon for high performance analytic...
ClickHouse materialized views - a secret weapon for high performance analytic...ClickHouse materialized views - a secret weapon for high performance analytic...
ClickHouse materialized views - a secret weapon for high performance analytic...
 
Common Table Expressions in MariaDB 10.2
Common Table Expressions in MariaDB 10.2Common Table Expressions in MariaDB 10.2
Common Table Expressions in MariaDB 10.2
 
Chapter 4
Chapter 4Chapter 4
Chapter 4
 
Using PostgreSQL statistics to optimize performance
Using PostgreSQL statistics to optimize performance Using PostgreSQL statistics to optimize performance
Using PostgreSQL statistics to optimize performance
 
Deep dive to PostgreSQL Indexes
Deep dive to PostgreSQL IndexesDeep dive to PostgreSQL Indexes
Deep dive to PostgreSQL Indexes
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
Understanding Erlang Terms
Understanding Erlang TermsUnderstanding Erlang Terms
Understanding Erlang Terms
 
Surface3d in R and rgl package.
Surface3d in R and rgl package.Surface3d in R and rgl package.
Surface3d in R and rgl package.
 
Fun with click house window functions webinar slides 2021-08-19
Fun with click house window functions webinar slides  2021-08-19Fun with click house window functions webinar slides  2021-08-19
Fun with click house window functions webinar slides 2021-08-19
 
Implementing qrcode
Implementing qrcodeImplementing qrcode
Implementing qrcode
 
imager package in R and examples..
imager package in R and examples..imager package in R and examples..
imager package in R and examples..
 
Analysis of Algorithms-Heapsort
Analysis of Algorithms-HeapsortAnalysis of Algorithms-Heapsort
Analysis of Algorithms-Heapsort
 
M|18 Understanding the Query Optimizer
M|18 Understanding the Query OptimizerM|18 Understanding the Query Optimizer
M|18 Understanding the Query Optimizer
 
Program Language - Fall 2013
Program Language - Fall 2013 Program Language - Fall 2013
Program Language - Fall 2013
 
Efficient Pagination Using MySQL
Efficient Pagination Using MySQLEfficient Pagination Using MySQL
Efficient Pagination Using MySQL
 

En vedette

Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
unyil96
 
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
weedge
 

En vedette (9)

Real time fulltext search with sphinx
Real time fulltext search with sphinxReal time fulltext search with sphinx
Real time fulltext search with sphinx
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHP
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQL
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
Búsquedas Full Text con esteroides - Sphinx Search
Búsquedas Full Text con esteroides - Sphinx SearchBúsquedas Full Text con esteroides - Sphinx Search
Búsquedas Full Text con esteroides - Sphinx Search
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
 

Similaire à Advanced fulltext search with Sphinx

What's New In MySQL 5.6
What's New In MySQL 5.6What's New In MySQL 5.6
What's New In MySQL 5.6
Abdul Manaf
 
15 protips for mysql users pfz
15 protips for mysql users   pfz15 protips for mysql users   pfz
15 protips for mysql users pfz
Joshua Thijssen
 

Similaire à Advanced fulltext search with Sphinx (20)

Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.
 
Using Optimizer Hints to Improve MySQL Query Performance
Using Optimizer Hints to Improve MySQL Query PerformanceUsing Optimizer Hints to Improve MySQL Query Performance
Using Optimizer Hints to Improve MySQL Query Performance
 
Parallel Query in AWS Aurora MySQL
Parallel Query in AWS Aurora MySQLParallel Query in AWS Aurora MySQL
Parallel Query in AWS Aurora MySQL
 
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...
 
MySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for DevelopersMySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for Developers
 
16 MySQL Optimization #burningkeyboards
16 MySQL Optimization #burningkeyboards16 MySQL Optimization #burningkeyboards
16 MySQL Optimization #burningkeyboards
 
Query Optimization with MySQL 5.6: Old and New Tricks
Query Optimization with MySQL 5.6: Old and New TricksQuery Optimization with MySQL 5.6: Old and New Tricks
Query Optimization with MySQL 5.6: Old and New Tricks
 
MySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your BusinessMySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your Business
 
4. Data Manipulation.ppt
4. Data Manipulation.ppt4. Data Manipulation.ppt
4. Data Manipulation.ppt
 
Performance Schema for MySQL Troubleshooting
Performance Schema for MySQL TroubleshootingPerformance Schema for MySQL Troubleshooting
Performance Schema for MySQL Troubleshooting
 
What's New In MySQL 5.6
What's New In MySQL 5.6What's New In MySQL 5.6
What's New In MySQL 5.6
 
MySQLinsanity
MySQLinsanityMySQLinsanity
MySQLinsanity
 
Explain
ExplainExplain
Explain
 
How to build and run oci containers
How to build and run oci containersHow to build and run oci containers
How to build and run oci containers
 
15 protips for mysql users pfz
15 protips for mysql users   pfz15 protips for mysql users   pfz
15 protips for mysql users pfz
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLTen Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
 
Why Use EXPLAIN FORMAT=JSON?
 Why Use EXPLAIN FORMAT=JSON?  Why Use EXPLAIN FORMAT=JSON?
Why Use EXPLAIN FORMAT=JSON?
 
MySQL Kitchen : spice up your everyday SQL queries
MySQL Kitchen : spice up your everyday SQL queriesMySQL Kitchen : spice up your everyday SQL queries
MySQL Kitchen : spice up your everyday SQL queries
 
Design and Develop SQL DDL statements which demonstrate the use of SQL objec...
 Design and Develop SQL DDL statements which demonstrate the use of SQL objec... Design and Develop SQL DDL statements which demonstrate the use of SQL objec...
Design and Develop SQL DDL statements which demonstrate the use of SQL objec...
 
NoSql
NoSqlNoSql
NoSql
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Advanced fulltext search with Sphinx

  • 1. Advanced fulltext search with Sphinx Adrian Nuta // Sphinxsearch // 2014
  • 2. Fulltext search in MySQL ● available for MyISAM and lately for InnoDB ● limited in indexation options ○ only min length and list of stopwords ● limited in search options ○ boolean ○ natural mode ○ with query expansion
  • 3. Why Sphinx? ● GPLv2 ● better performance ● lot of features, both on indexing and searching ● easy to transit from MySQL: ○ easy to index from MySQL ○ SphinxQL - access and query Sphinx using any MySQL client
  • 4. MySQL vs Sphinx fulltext index ● B-tree index ● easy to update frequently, easy to access by PK ● columnar storage ● OLTP ● inverted index ● hard to update, fast to read ● keyword based storage ● OLAP
  • 5. Simple fulltext search MySQL: mysql> SELECT * FROM myindex WHERE MATCH('title,content') AGAINST ('find me fast'); Sphinx: mysql> SELECT * FROM myindex WHERE MATCH('find me fast');
  • 6. More complete Sphinx search mysql> SELECT * FROM index WHERE MATCH('"a quorum search is made here"/4') ORDER BY WEIGHT() DESC, id ASC OPTION ranker = expr( 'sum( exact_hit+10*(min_hit_pos==1)+lcs*(0.1*my_attr) )*1000 + bm25' );
  • 7. Searching only on some fields ● Not possible in MySQL, need to declare separate index ● in Sphinx - syntax operator: mysql> SELECT * FROM myindex WHERE MATCH(‘@(title,content) find me fast’);
  • 8. Indexing features ● ● ● ● ● ● charset table stopwords, wordforms stemming and lemmatization HTML stripping blending, ignore chars, bigram words custom regexp filters
  • 10. Ranking factors formulas ● bm25 ● LCS - distance between query and document ● word and hit counting ● tf_idf and idf ● word positioning ● possible to use attribute values
  • 11. Ranking without field weighting mysql> SELECT id,title,weight() FROM wikipedia WHERE MATCH('inverted index') OPTION ranker=expr('sum(hit_count*user_weight)'), field_weights=(title=1,body=1); +-----------+----------------------------------------------------------+----------+ | id | title | weight() | +-----------+----------------------------------------------------------+----------+ | 221501516 | Index (search engine) | 125 | | 221487412 | Inverted index | 47 | Doc. 221501516: 1 hit in ‘title’ x 100 + 124 hits in ‘body’ = 125 Doc. 221487412: 2 hits in ‘title’x 100 + 45 hits in ‘body’ = 47
  • 12. Ranking with field weighting mysql> SELECT id,title,WEIGHT() FROM index WHERE MATCH('inverted index') OPTION ranker=expr('sum(hit_count*user_weight)'), field_weights=(title=100,body=1); +-----------+------------------------------------------------------+----------+ | id | title | WEIGHT() | +-----------+------------------------------------------------------+----------+ | 221487412 | Inverted index | 245 | | 221501516 | Index (search engine) | 224 | Doc. 221501516: 1 hit in ‘title’ x 100 + 124 hits in ‘body’ = 100+124 = 224 Doc. 221487412: 2 hits in ‘title’ x 100 + 45 hits in ‘body’ = 200 +45 = 245
  • 13. Words proximity mysql> SELECT id,title,WEIGHT() FROM index WHERE MATCH('@title list of football players') OPTION ranker=expr('sum(lcs)'); +-----------+-----------------------------------------------------+----------+ | id | title | weight() | +-----------+-----------------------------------------------------+----------+ | 207381464 | List of football players from Amsterdam | 4 | | 221196229 | List of Football Kingz F.C. players | 3 | | 210456301 | List of Florida State University football players | 2 | +-----------+-----------------------------------------------------+----------+
  • 14. word and hit count mysql> SELECT id,title,WEIGHT() AS w FROM index WHERE MATCH('@title php | api') OPTION ranker=expr('sum(hit_count)'); +---------+----------------------------------------------------------+------+ | id | title | w | +---------+----------------------------------------------------------+------+ | 1000671 | PHP API gives PHP Warnings - tips? | 3 | ... mysql> SELECT id,title,WEIGHT() AS w FROM index WHERE MATCH('@title php | api') OPTION ranker=expr('sum(word_count)'); +---------+----------------------------------------------------------+------+ | id | title | w | +---------+----------------------------------------------------------+------+ | 1000671 | PHP API gives PHP Warnings - tips? | 2 |
  • 15. Position mysql> select id,title,weight() as w from forum where match('@title sphinx php api') option ranker=expr('sum(min_hit_pos)'); +---------+--------------------------------------------------------------+------+ | id | title | w | +---------+--------------------------------------------------------------+------+ | 1004955 | how can i do a sample search use sphinx php api | 9 | | 1004900 | How to update fulltext field using sphinx api of PHP? | 7 | | 1008783 | Update MVA-Attributes with the PHP-API Sphinx 2.0.2 | 6 | | 1000498 | Limits in sphinx when using PHP sphinx API | 3 | how can i do a sample search use sphinx php api 1 2 3 4 5 6 7 8 9
  • 16. IDF mysql> select id,title,weight() from wikipedia where match('@title (Polyphonic | Polysyllabic | Oberheim) ') option ranker=expr('sum(max_idf)*1000'); +-----------+---------------------------+----------+ | id | title | weight() | +-----------+---------------------------+----------+ | 165867281 | The Polysyllabic Spree | 112 | Polysyllabic - rare | 208650218 | Oberheim Xpander | 108 | Oberheim - not so rare | 209138112 | Oberheim OB-8 | 108 | | 180503990 | Polyphonic Era | 85 | Polyphonic - common | 183135294 | Polyphonic C sharp | 85 | | 219939232 | Polyphonic HMI | 85 | +-----------+---------------------------+----------+
  • 17. BM25F mysql> select … where match('odbc') option ranker=expr('1000*bm25f(1,1)'); +------+------------------------------------+-----------+----------+----------+ | id | title | title_len | body_len | weight() | +------+------------------------------------+-----------+----------+----------+ | 179 | odbc_dsn | 1 | 69 | 775 | | 170 | type | 1 | 124 | 742 | … mysql> select … where match('odbc') option ranker=expr('1000*bm25f(1,0)'); +------+------------------------------------+-----------+----------+----------+ | id | title | title_len | body_len | weight() | +------+------------------------------------+-----------+----------+----------+ | 169 | Data source configuration options | 4 | 6246 | 758 | | 179 | odbc_dsn | 1 | 69 | 743 | | 170 | type | 1 | 124 | 689 |
  • 18. Language morphology Will the user search ‘shirt’ or ‘shirts’? ● stemming: ○ shirt = shirts ● index_exact_form for exact matching ● lemmatization: ○ men = man
  • 19. EF-S 18-200mm f/3.5-5.6 blend_chars ● act as both separators and valid chars ● 10-200mm with - blended will index 3 terms: 10-200mm, 10 and 200mm ● leading or trailing blend char behaviour can be configured to be stripped or indexed
  • 20. Sentence delimitation mysql> INSERT INTO index VALUES(1, 'quick brown fox jumps over the lazy dog'); mysql> INSERT INTO index VALUES(2, 'The quick brown fox made it. Where was the lazy dog?'); mysql> SELECT * FROM index WHERE MATCH('brown fox SENTENCE lazy dog'); +------+ | id | +------+ | 1 | +------+
  • 21. Paragraph delimitation mysql> INSERT '<p>The quick mysql> INSERT '<p>The quick dog</p>'); INTO index VALUES(1, brown fox jumps over the lazy dog</p>'); INTO index VALUES(2, brown fox jumps</p><p>over the lazy mysql> SELECT * FROM index WHERE MATCH('brown fox PARAGRAPH lazy dog'); +------+ | id | +------+ | 1 | +------+
  • 22. More fulltext features ● ● ● ● ● ● bigrams more ranking factors: lccs, wlccs, atc phrase boundary chars HTML index attributes, elements removal RLP Chinese tokenization position step tunning