SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
Full Text Search
     Throwdown
  Bill Karwin, Percona Inc.
In a full text
                                   search, the search
                                   engine examines
                                   all of the words in
                                   every stored
                                   document as it
                                   tries to match
                                   search words
                                   supplied by the
                                   user.


http://www.flickr.com/photos/tryingyouth/
StackOverflow Test Data
• Data dump, exported December 2011
• 7.4 million Posts = 8.18 GB




                                      www.percona.com
StackOverflow ER diagram


searchable
   text




                      www.percona.com
The Baseline:
Naive Search Predicates
Some people, when confronted with a problem, think
	

 “I know, I’ll use regular expressions.”
	

 Now they have two problems.
	

 	

   	

   	

   	

   	

   	

   	

   	

   	

   — Jamie Zawinsky




                                                                         www.percona.com
Accuracy issue
• Irrelevant or false matching words ‘one’, ‘money’,
   ‘prone’, etc.:
     SELECT * FROM Posts
      WHERE Body LIKE '%one%'
• Regular expressions in MySQL support escapes
   for word boundaries:
     SELECT * FROM Posts
      WHERE Body RLIKE '[[:<:]]one[[:>:]]'




                                          www.percona.com
Performance issue
• LIKE with wildcards:
  ! SELECT * FROM Posts
                                        49 sec
    WHERE title LIKE '%performance%'
    !   OR body LIKE '%performance%'
    !   OR tags LIKE '%performance%';

• POSIX regular expressions:
  ! SELECT * FROM Posts                7 min 57 sec
    WHERE title RLIKE '[[:<:]]performance[[:>:]]'
    !   OR body RLIKE '[[:<:]]performance[[:>:]]'
    !   OR tags RLIKE '[[:<:]]performance[[:>:]]';




                                           www.percona.com
Why so slow?
CREATE TABLE TelephoneBook (
  ! FullName VARCHAR(50));

CREATE INDEX name_idx ON TelephoneBook
  ! (FullName);

INSERT INTO TelephoneBook VALUES
  ! ('Riddle, Thomas'),
  ! ('Thomas, Dean');



                               www.percona.com
Why so slow?
• Search for all with last name “Thomas”
                                                uses index
    SELECT * FROM telephone_book
     WHERE full_name LIKE 'Thomas%'


• Search for all with first name “Thomas”
    SELECT * FROM telephone_book
     WHERE full_name LIKE '%Thomas'

                                       can’t use index



                                            www.percona.com
Because:




    B-Tree indexes can’t
☞   search for substrings



                      www.percona.com
• FULLTEXT in MyISAM
• FULLTEXT in InnoDB
• Apache Solr
• Sphinx Search
• Trigraphs
FULLTEXT
in MyISAM
FULLTEXT Index with MyISAM
• Special index type for MyISAM
• Integrated with SQL queries
• Indexes always in sync with data
• Balances features vs. speed vs. space




                                          www.percona.com
Insert Data into Index (MyISAM)

mysql> INSERT INTO Posts
 SELECT * FROM PostsSource;


          time: 33 min, 34 sec




                                 www.percona.com
Build Index on Data (MyISAM)

mysql> CREATE FULLTEXT INDEX PostText
 ! ON Posts(title, body, tags);


          time: 31 min, 18 sec




                                 www.percona.com
Querying

SELECT * FROM Posts
 WHERE MATCH(column(s))
 AGAINST('query pattern');
                      must include all columns
                      of your index, in the
                      order you defined




                                www.percona.com
Natural Language Mode (MyISAM)

• Searches concepts with free text queries:
  ! SELECT * FROM Posts
    WHERE MATCH(title, body, tags )
    AGAINST('mysql performance'
    IN NATURAL LANGUAGE MODE)
    LIMIT 100;

                     time with index:
                     200 milliseconds

                                         www.percona.com
Query Profile:
     Natural Language Mode (MyISAM)
+-------------------------+----------+
| Status                  | Duration |
+-------------------------+----------+
| starting                | 0.000068 |
| checking permissions    | 0.000006 |
| Opening tables          | 0.000017 |
| init                    | 0.000032 |
| System lock             | 0.000007 |
| optimizing              | 0.000007 |
| statistics              | 0.000018 |
| preparing               | 0.000006 |
| FULLTEXT initialization | 0.198358 |
| executing               | 0.000012 |
| Sending data            | 0.001921 |
| end                     | 0.000005 |
| query end               | 0.000003 |
| closing tables          | 0.000018 |
| freeing items           | 0.000341 |
| cleaning up             | 0.000012 |
+-------------------------+----------+
                                         www.percona.com
Boolean Mode (MyISAM)

• Searches words using mini-language:
  ! SELECT * FROM Posts
    WHERE MATCH(title, body, tags)
    AGAINST('+mysql +performance'
    IN BOOLEAN MODE)
    LIMIT 100;
             time with index:
             16 milliseconds



                                        www.percona.com
Query Profile:
            Boolean Mode (MyISAM)
+-------------------------+----------+
| Status                  | Duration |
+-------------------------+----------+
| starting                | 0.000031 |
| checking permissions    | 0.000003 |
| Opening tables          | 0.000008 |
| init                    | 0.000017 |
| System lock             | 0.000004 |
| optimizing              | 0.000004 |
| statistics              | 0.000008 |
| preparing               | 0.000003 |
| FULLTEXT initialization | 0.000008 |
| executing               | 0.000001 |
| Sending data            | 0.015703 |
| end                     | 0.000004 |
| query end               | 0.000002 |
| closing tables          | 0.000007 |
| freeing items           | 0.000381 |
| cleaning up             | 0.000007 |
+-------------------------+----------+
                                         www.percona.com
FULLTEXT
in InnoDB
FULLTEXT Index with InnoDB
• Under development in MySQL 5.6
  • I’m testing 5.6.6 m1
• Usage very similar to FULLTEXT in MyISAM
• Integrated with SQL queries
• Indexes always* in sync with data
• Read the blogs for more details:
  •   http://blogs.innodb.com/wp/2011/07/overview-and-getting-started-with-innodb-fts/
  •   http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/
  •   http://blogs.innodb.com/wp/2011/07/innodb-fts-performance/
  •   http://blogs.innodb.com/wp/2011/07/difference-between-innodb-fts-and-myisam-fts/


                                                                             www.percona.com
Insert Data into Index (InnoDB)

mysql> INSERT INTO Posts
 SELECT * FROM PostsSource;


          time: 55 min 46 sec




                                www.percona.com
Build Index on Data (InnoDB)
• Still under development; you might see problems:
mysql> CREATE FULLTEXT INDEX PostText
 ! ON Posts(title, body, tags);
ERROR 2013 (HY000): Lost connection to
 MySQL server during query




                                        www.percona.com
Build Index on Data (InnoDB)
• Solution: make sure you define a primary key
   column `FTS_DOC_ID` explicitly:
mysql> ALTER TABLE Posts
 CHANGE COLUMN PostId
 `FTS_DOC_ID` BIGINT UNSIGNED;
mysql> CREATE FULLTEXT INDEX PostText
 ! ON Posts(title, body, tags);


                   time: 25 min 27 sec

                                         www.percona.com
Natural Language Mode (InnoDB)

• Searches concepts with free text queries:
  ! SELECT * FROM Posts
    WHERE MATCH(title, body, tags)
    AGAINST('mysql performance'
    IN NATURAL LANGUAGE MODE)
    LIMIT 100;

                     time with index:
                     740 milliseconds

                                         www.percona.com
Query Profile:
     Natural Language Mode (InnoDB)
+-------------------------+----------+
| Status                  | Duration |
+-------------------------+----------+
| starting                | 0.000074 |
| checking permissions    | 0.000007 |
| Opening tables          | 0.000020 |
| init                    | 0.000034 |
| System lock             | 0.000007 |
| optimizing              | 0.000009 |
| statistics              | 0.000020 |
| preparing               | 0.000008 |
| FULLTEXT initialization | 0.577257 |
| executing               | 0.000013 |
| Sending data            | 0.106279 |
| end                     | 0.000018 |
| query end               | 0.000012 |
| closing tables          | 0.000018 |
| freeing items           | 0.055584 |
| cleaning up             | 0.000039 |
+-------------------------+----------+
                                         www.percona.com
Boolean Mode (InnoDB)

• Searches words using mini-language:
  ! SELECT * FROM Posts
    WHERE MATCH(title, body, tags)
    AGAINST('+mysql +performance'
    IN BOOLEAN MODE)
    LIMIT 100;
             time with index:
             350 milliseconds



                                        www.percona.com
Query Profile:
             Boolean Mode (InnoDB)
+-------------------------+----------+
| Status                  | Duration |
+-------------------------+----------+
| starting                | 0.000064 |
| checking permissions    | 0.000005 |
| Opening tables          | 0.000017 |
| init                    | 0.000047 |
| System lock             | 0.000007 |
| optimizing              | 0.000009 |
| statistics              | 0.000019 |
| preparing               | 0.000008 |
| FULLTEXT initialization | 0.347172 |
| executing               | 0.000014 |
| Sending data            | 0.008089 |
| end                     | 0.000011 |
| query end               | 0.000012 |
| closing tables          | 0.000015 |
| freeing items           | 0.001570 |
| cleaning up             | 0.000023 |
+-------------------------+----------+
                                         www.percona.com
Apache Solr
Apache Solr
• http://lucene.apache.org/solr/
• Formerly known as Lucene, started 2001
• Apache License
• Java implementation
• Web service architecture
• Many sophisticated search features



                                     www.percona.com
DataImportHandler
• conf/solrconfig.xml:
. . .
<requestHandler name="/dataimport"
 class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
  </lst>
</requestHandler>
. . .




                                                  www.percona.com
DataImportHandler
• conf/data-config.xml:
 <dataConfig>
   <dataSource type="JdbcDataSource"
                driver="com.mysql.jdbc.Driver"
                url="jdbc:mysql://localhost/testpattern?useUnicode=true"
                batchSize="-1"
                user="xxxx"
                password="xxxx"/>
   <document>
     <entity name="id"
              query="SELECT PostId, ParentId, Title, Body, Tags FROM Posts">
     </entity>
   </document>
 </dataConfig>                          extremely important
                                       to avoid buffering the
                                        whole query result!


                                                           www.percona.com
DataImportHandler
•conf/schema.xml:
. . .
<fields>
  <field name="PostId" type="string" indexed="true" stored="true" required="true" />
   <field name="ParentId" type="string" indexed="true" stored="true" required="false" />
   <field name="Title" type="text_general" indexed="false" stored="false"
   required="false" />
   <field name="Body" type="text_general" indexed="false" stored="false" required="false" /
   >
   <field name="Tags" type="text_general" indexed="false" stored="false" required="false" /
   >

   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/
   >
<fields>

<uniqueKey>PostId</uniqueKey>
<defaultSearchField>text</defaultSearchField>

<copyField source="Title" dest="text"/>
<copyField source="Body" dest="text"/>
<copyField source="Tags" dest="text"/>
. . .

                                                                  www.percona.com
Insert Data into Index (Solr)

• http://localhost:8983/solr/dataimport?
   command=full-import


              time: 14 min 28 sec




                                           www.percona.com
Searching Solr

• http://localhost:8983/solr/select/?q=mysql+AND
   +performance


              time: 79ms

           Query results are cached (like MySQL Query Cache),
           so they return much faster on subsequent execution




                                                        www.percona.com
Sphinx Search
Sphinx Search
• http://sphinxsearch.com/
• Started in 2001
• GPLv2 license
• C++ implementation
• SphinxSE storage engine for MySQL
• Supports MySQL protocol, SQL-like queries
• Many sophisticated search features


                                     www.percona.com
sphinx.conf
source src1
{
! type = mysql
! sql_host = localhost
! sql_user = xxxx
! sql_pass = xxxx
! sql_db = testpattern
! sql_query = SELECT PostId, ParentId, Title,
! ! Body, Tags FROM Posts
! sql_query_info = SELECT * FROM Posts 
! ! WHERE PostId=$id
}



                                   www.percona.com
sphinx.conf
! index test1
  {
  ! source = src1
  ! path = C:Sphinxdata
  }




                             www.percona.com
Insert Data into Index (Sphinx)
!   C:Sphinx> indexer.exe -c sphinx.conf.in --verbose test1

    Sphinx 2.0.5-release (r3309)

    using config file 'sphinx.conf'...
    indexing index 'test1'...
    collected 7397507 docs, 5731.8 MB         time: 8 min 20 sec
    sorted 920.3 Mhits, 100.0% done
    total 7397507 docs, 5731776959 bytes
    total 500.149 sec, 11460138 bytes/sec, 14790.60 docs/sec
    total 11 reads, 15.898 sec, 314584.8 kb/call avg, 1445.3
      msec/call avg
    total 542 writes, 3.129 sec, 12723.3 kb/call avg, 5.7 msec/
      call avg

    Execution time: 500.196 s



                                                  www.percona.com
Querying index
$ mysql --port 9306

Server version: 2.0.5-release (r3309)

mysql> SELECT * FROM test1 WHERE MATCH('mysql performance');

+---------+--------+
| id      | weight |
+---------+--------+
| 6016856 |   6600 |
| 4207641 |   6595 |
| 2656325 |   6593 |
| 7192928 |   5605 |
| 8118235 |   5598 |
. . .
20 rows in set (0.02 sec)



                                                www.percona.com
Querying index
mysql> SHOW META;

+---------------+-------------+
| Variable_name | Value       |
+---------------+-------------+
| total         | 1000        |
| total_found   | 7672        |
| time          | 0.013       |
| keyword[0]    | mysql       |
| docs[0]       | 162287      |   time: 13ms
| hits[0]       | 363694      |
| keyword[1]    | performance |
| docs[1]       | 147249      |
| hits[1]       | 210895      |
+---------------+-------------+




                                           www.percona.com
Trigraphs
Trigraphs Overview
• Not very fast, but still better than LIKE / RLIKE
• Generic, portable SQL solution
• No dependency on version, storage engine, third-
   party technology




                                        www.percona.com
Three-Letter Sequences
! CREATE TABLE AtoZ (
  !   c! ! CHAR(1),
  !   PRIMARY KEY (c));
! INSERT INTO AtoZ (c)
  VALUES ('a'), ('b'), ('c'), ...
! CREATE TABLE Trigraphs (
  !   Tri!
         !    CHAR(3),
  !   PRIMARY KEY (Tri));
! INSERT INTO Trigraphs (Tri)
  SELECT CONCAT(t1.c, t2.c, t3.c)
  FROM AtoZ t1 JOIN AtoZ t2 JOIN AtoZ t3;




                                            www.percona.com
Insert Data Into Index
my $sth = $dbh1->prepare("SELECT * FROM Posts") or die $dbh1->errstr;
$sth->execute() or die $dbh1->errstr;
$dbh2->begin_work;
my $i = 0;
while (my $row = $sth->fetchrow_hashref ) {
  my $text = lc(join('|', ($row->{title}, $row->{body}, $row->{tags})));
  my %tri;
  map($tri{$_}=1, ( $text =~ m/[[:alpha:]]{3}/g ));
  next unless %tri;
  my $tuple_list = join(",", map("('$_',$row->{postid})", keys %tri));
  my $sql = "INSERT IGNORE INTO PostsTrigraph (tri, PostId) VALUES
 $tuple_list";
  $dbh2->do($sql) or die "SQL = $sql, ".$dbh2->errstr;
  if (++$i % 1000 == 0) {
    print ".";
    $dbh2->commit;
    $dbh2->begin_work;
                                            time: 116 min 50 sec
}
  }
                                            space: 16.2GiB
print ".n";
$dbh2->commit;                              rows: 519 million

                                                          www.percona.com
Indexed Lookups
! SELECT p.*                        time: 46 sec
  FROM Posts p
  JOIN PostsTrigraph t1 ON
  ! t1.PostId = p.PostId AND t1.Tri = 'mys'
  !




                                      www.percona.com
Search Among Fewer Matches
! SELECT p.*                        time: 19 sec
  FROM Posts p
  JOIN PostsTrigraph t1 ON
  ! t1.PostId = p.PostId AND t1.Tri = 'mys'
  JOIN PostsTrigraph t2 ON
  ! t2.PostId = p.PostId AND t2.Tri = 'per'




                                      www.percona.com
Search Among Fewer Matches
! SELECT p.*                        time: 22 sec
  FROM Posts p
  JOIN PostsTrigraph t1 ON
  ! t1.PostId = p.PostId AND t1.Tri = 'mys'
  JOIN PostsTrigraph t2 ON
  ! t2.PostId = p.PostId AND t2.Tri = 'per'
  JOIN PostsTrigraph t3 ON
  ! t3.PostId = p.PostId AND t3.Tri = 'for'




                                      www.percona.com
Search Among Fewer Matches
! SELECT p.*                          time: 13.6 sec
  FROM Posts p
  JOIN PostsTrigraph t1 ON
  ! t1.PostId = p.PostId AND   t1.Tri = 'mys'
  JOIN PostsTrigraph t2 ON
  ! t2.PostId = p.PostId AND   t2.Tri = 'per'
  JOIN PostsTrigraph t3 ON
  ! t3.PostId = p.PostId AND   t3.Tri = 'for'
  JOIN PostsTrigraph t4 ON
  ! t4.PostId = p.PostId AND   t4.Tri = 'man'




                                      www.percona.com
Narrow Down Further
! SELECT p.*                                    time: 13.8 sec
  FROM Posts p
  JOIN PostsTrigraph t1 ON
  ! t1.PostId = p.PostId AND          t1.Tri = 'mys'
  JOIN PostsTrigraph t2 ON
  ! t2.PostId = p.PostId AND          t2.Tri = 'per'
  JOIN PostsTrigraph t3 ON
  ! t3.PostId = p.PostId AND          t3.Tri = 'for'
  JOIN PostsTrigraph t4 ON
  ! t4.PostId = p.PostId AND          t4.Tri = 'man'
 WHERE CONCAT(p.title,p.body,p.tags) LIKE '%mysql%'
 !   AND CONCAT(p.title,p.body,p.tags) LIKE '%performance%';




                                                www.percona.com
And the winner is...




     Jarrett Campbell
     http://www.flickr.com/people/77744839@N00
Time to Insert Data into Index

LIKE expression   n/a

FULLTEXT MyISAM 33 min, 34 sec

FULLTEXT InnoDB   55 min, 46 sec

Apache Solr       14 min, 28 sec

Sphinx Search     8 min, 20 sec

Trigraphs         116 min, 50 sec
                                    www.percona.com
Insert Data into Index (sec)
8000


6000


4000


2000


   0
       LIKE   MyISAM InnoDB   Solr   Sphinx   Trigraph

                                          www.percona.com
Time to Build Index on Data

LIKE expression   n/a

FULLTEXT MyISAM 31 min, 18 sec

FULLTEXT InnoDB   25 min, 27 sec

Apache Solr       n/a

Sphinx Search     n/a

Trigraphs         n/a
                                   www.percona.com
Build Index on Data (sec)
4000


3000


2000


1000
                               n/a     n/a         n/a

   0
        LIKE   MyISAM InnoDB   Solr   Sphinx    Trigraph

                                             www.percona.com
Index Storage

LIKE expression     n/a

FULLTEXT MyISAM 2382 MiB

FULLTEXT InnoDB     ? MiB

Apache Solr         2766 MiB

Sphinx Search       3355 MiB

Trigraphs           16589 MiB
                                  www.percona.com
Index Storage (MiB)
20000


15000


10000


 5000


    0
        LIKE   MyISAM InnoDB   Solr   Sphinx   Trigraph

                                           www.percona.com
Query Speed

LIKE expression    49,000ms - 399,000ms

FULLTEXT MyISAM 16-200ms

FULLTEXT InnoDB    350-740ms

Apache Solr        79ms

Sphinx Search      13ms

Trigraphs          13800ms
                                          www.percona.com
Query Speed (ms)
400000
350000
300000
250000
200000
150000
100000
 50000
     0
         LIKE   MyISAM InnoDB   Solr   Sphinx   Trigraph

                                            www.percona.com
1000


          Query Speed (ms)
 750




 500




 250




   0
       LIKE   MyISAM InnoDB   Solr   Sphinx   Trigraph

                                          www.percona.com
Bottom Line
                    build   insert   storage     query      solution

                                                49k-399k
LIKE expression       0       0         0
                                                   ms
                                                              SQL



FULLTEXT MyISAM     31:18   33:28    2382MiB    16-200ms     MySQL



FULLTEXT InnoDB     25:27   55:46       ?       350-740ms   MySQL 5.6



Apache Solr          n/a    14:28    2766MiB      79ms        Java



Sphinx Search        n/a     8:20    3487MiB      13ms        C++



Trigraphs            n/a    116:50   16.2 GiB   13,800ms      SQL


                                                 www.percona.com
Final Thoughts
• Third-party search engines are complex to keep in
   sync with data, and adding another type of
   server adds more operations work for you.
• Built-in FULLTEXT indexes are therefore useful
    even if they are not absolutely the fastest.
• Different search implementations may return
    different results, so you should evaluate what
    works best for your project.
• Any indexed search solution is orders of
    magnitude better than LIKE!

                                          www.percona.com
New York, October 1-2, 2012
London, December 3-4, 2012
Santa Clara, April 22-25, 2013




                  www.percona.com/live
Expert instructors
        In-person training
      Custom onsite training
       Live virtual training
http://www.percona.com/training

                               www.percona.com
http://www.pragprog.com/titles/bksqla/
                                www.percona.com
Copyright 2012 Bill Karwin
           www.slideshare.net/billkarwin

          Released under a Creative Commons 3.0 License:
           http://creativecommons.org/licenses/by-nc-nd/3.0/
              You are free to share - to copy, distribute and
             transmit this work, under the following conditions:




   Attribution.               Noncommercial.          No Derivative Works.
You must attribute this   You may not use this work       You may not alter,
 work to Bill Karwin.      for commercial purposes.      transform, or build
                                                           upon this work.

                                                               www.percona.com

Contenu connexe

En vedette

Hierarchical data models in Relational Databases
Hierarchical data models in Relational DatabasesHierarchical data models in Relational Databases
Hierarchical data models in Relational Databasesnavicorevn
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Jamey Hanson
 
Indexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuningIndexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuningOSSCube
 
Inno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structureInno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structurezhaolinjnu
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Facettensuche mit Lucene und Solr
Facettensuche mit Lucene und SolrFacettensuche mit Lucene und Solr
Facettensuche mit Lucene und SolrThomas Koch
 
Efficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String CollectionsEfficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String Collectionsrvernica
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 

En vedette (19)

Extensible Data Modeling
Extensible Data ModelingExtensible Data Modeling
Extensible Data Modeling
 
Sql query patterns, optimized
Sql query patterns, optimizedSql query patterns, optimized
Sql query patterns, optimized
 
SQL Outer Joins for Fun and Profit
SQL Outer Joins for Fun and ProfitSQL Outer Joins for Fun and Profit
SQL Outer Joins for Fun and Profit
 
Sql Antipatterns Strike Back
Sql Antipatterns Strike BackSql Antipatterns Strike Back
Sql Antipatterns Strike Back
 
How to Design Indexes, Really
How to Design Indexes, ReallyHow to Design Indexes, Really
How to Design Indexes, Really
 
Sql Injection Myths and Fallacies
Sql Injection Myths and FallaciesSql Injection Myths and Fallacies
Sql Injection Myths and Fallacies
 
Practical Object Oriented Models In Sql
Practical Object Oriented Models In SqlPractical Object Oriented Models In Sql
Practical Object Oriented Models In Sql
 
Models for hierarchical data
Models for hierarchical dataModels for hierarchical data
Models for hierarchical data
 
Hierarchical data models in Relational Databases
Hierarchical data models in Relational DatabasesHierarchical data models in Relational Databases
Hierarchical data models in Relational Databases
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
 
Indexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuningIndexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuning
 
Inno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structureInno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structure
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Facettensuche mit Lucene und Solr
Facettensuche mit Lucene und SolrFacettensuche mit Lucene und Solr
Facettensuche mit Lucene und Solr
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Efficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String CollectionsEfficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String Collections
 
Trees and Hierarchies in SQL
Trees and Hierarchies in SQLTrees and Hierarchies in SQL
Trees and Hierarchies in SQL
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 

Dernier

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Dernier (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Full Text Search Throwdown

  • 1. Full Text Search Throwdown Bill Karwin, Percona Inc.
  • 2. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. http://www.flickr.com/photos/tryingyouth/
  • 3. StackOverflow Test Data • Data dump, exported December 2011 • 7.4 million Posts = 8.18 GB www.percona.com
  • 4. StackOverflow ER diagram searchable text www.percona.com
  • 6. Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinsky www.percona.com
  • 7. Accuracy issue • Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: SELECT * FROM Posts WHERE Body LIKE '%one%' • Regular expressions in MySQL support escapes for word boundaries: SELECT * FROM Posts WHERE Body RLIKE '[[:<:]]one[[:>:]]' www.percona.com
  • 8. Performance issue • LIKE with wildcards: ! SELECT * FROM Posts 49 sec WHERE title LIKE '%performance%' ! OR body LIKE '%performance%' ! OR tags LIKE '%performance%'; • POSIX regular expressions: ! SELECT * FROM Posts 7 min 57 sec WHERE title RLIKE '[[:<:]]performance[[:>:]]' ! OR body RLIKE '[[:<:]]performance[[:>:]]' ! OR tags RLIKE '[[:<:]]performance[[:>:]]'; www.percona.com
  • 9. Why so slow? CREATE TABLE TelephoneBook ( ! FullName VARCHAR(50)); CREATE INDEX name_idx ON TelephoneBook ! (FullName); INSERT INTO TelephoneBook VALUES ! ('Riddle, Thomas'), ! ('Thomas, Dean'); www.percona.com
  • 10. Why so slow? • Search for all with last name “Thomas” uses index SELECT * FROM telephone_book WHERE full_name LIKE 'Thomas%' • Search for all with first name “Thomas” SELECT * FROM telephone_book WHERE full_name LIKE '%Thomas' can’t use index www.percona.com
  • 11. Because: B-Tree indexes can’t ☞ search for substrings www.percona.com
  • 12. • FULLTEXT in MyISAM • FULLTEXT in InnoDB • Apache Solr • Sphinx Search • Trigraphs
  • 14. FULLTEXT Index with MyISAM • Special index type for MyISAM • Integrated with SQL queries • Indexes always in sync with data • Balances features vs. speed vs. space www.percona.com
  • 15. Insert Data into Index (MyISAM) mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 33 min, 34 sec www.percona.com
  • 16. Build Index on Data (MyISAM) mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 31 min, 18 sec www.percona.com
  • 17. Querying SELECT * FROM Posts WHERE MATCH(column(s)) AGAINST('query pattern'); must include all columns of your index, in the order you defined www.percona.com
  • 18. Natural Language Mode (MyISAM) • Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags ) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 200 milliseconds www.percona.com
  • 19. Query Profile: Natural Language Mode (MyISAM) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000068 | | checking permissions | 0.000006 | | Opening tables | 0.000017 | | init | 0.000032 | | System lock | 0.000007 | | optimizing | 0.000007 | | statistics | 0.000018 | | preparing | 0.000006 | | FULLTEXT initialization | 0.198358 | | executing | 0.000012 | | Sending data | 0.001921 | | end | 0.000005 | | query end | 0.000003 | | closing tables | 0.000018 | | freeing items | 0.000341 | | cleaning up | 0.000012 | +-------------------------+----------+ www.percona.com
  • 20. Boolean Mode (MyISAM) • Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100; time with index: 16 milliseconds www.percona.com
  • 21. Query Profile: Boolean Mode (MyISAM) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000031 | | checking permissions | 0.000003 | | Opening tables | 0.000008 | | init | 0.000017 | | System lock | 0.000004 | | optimizing | 0.000004 | | statistics | 0.000008 | | preparing | 0.000003 | | FULLTEXT initialization | 0.000008 | | executing | 0.000001 | | Sending data | 0.015703 | | end | 0.000004 | | query end | 0.000002 | | closing tables | 0.000007 | | freeing items | 0.000381 | | cleaning up | 0.000007 | +-------------------------+----------+ www.percona.com
  • 23. FULLTEXT Index with InnoDB • Under development in MySQL 5.6 • I’m testing 5.6.6 m1 • Usage very similar to FULLTEXT in MyISAM • Integrated with SQL queries • Indexes always* in sync with data • Read the blogs for more details: • http://blogs.innodb.com/wp/2011/07/overview-and-getting-started-with-innodb-fts/ • http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/ • http://blogs.innodb.com/wp/2011/07/innodb-fts-performance/ • http://blogs.innodb.com/wp/2011/07/difference-between-innodb-fts-and-myisam-fts/ www.percona.com
  • 24. Insert Data into Index (InnoDB) mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 55 min 46 sec www.percona.com
  • 25. Build Index on Data (InnoDB) • Still under development; you might see problems: mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); ERROR 2013 (HY000): Lost connection to MySQL server during query www.percona.com
  • 26. Build Index on Data (InnoDB) • Solution: make sure you define a primary key column `FTS_DOC_ID` explicitly: mysql> ALTER TABLE Posts CHANGE COLUMN PostId `FTS_DOC_ID` BIGINT UNSIGNED; mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 25 min 27 sec www.percona.com
  • 27. Natural Language Mode (InnoDB) • Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 740 milliseconds www.percona.com
  • 28. Query Profile: Natural Language Mode (InnoDB) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000074 | | checking permissions | 0.000007 | | Opening tables | 0.000020 | | init | 0.000034 | | System lock | 0.000007 | | optimizing | 0.000009 | | statistics | 0.000020 | | preparing | 0.000008 | | FULLTEXT initialization | 0.577257 | | executing | 0.000013 | | Sending data | 0.106279 | | end | 0.000018 | | query end | 0.000012 | | closing tables | 0.000018 | | freeing items | 0.055584 | | cleaning up | 0.000039 | +-------------------------+----------+ www.percona.com
  • 29. Boolean Mode (InnoDB) • Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100; time with index: 350 milliseconds www.percona.com
  • 30. Query Profile: Boolean Mode (InnoDB) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000064 | | checking permissions | 0.000005 | | Opening tables | 0.000017 | | init | 0.000047 | | System lock | 0.000007 | | optimizing | 0.000009 | | statistics | 0.000019 | | preparing | 0.000008 | | FULLTEXT initialization | 0.347172 | | executing | 0.000014 | | Sending data | 0.008089 | | end | 0.000011 | | query end | 0.000012 | | closing tables | 0.000015 | | freeing items | 0.001570 | | cleaning up | 0.000023 | +-------------------------+----------+ www.percona.com
  • 32. Apache Solr • http://lucene.apache.org/solr/ • Formerly known as Lucene, started 2001 • Apache License • Java implementation • Web service architecture • Many sophisticated search features www.percona.com
  • 33. DataImportHandler • conf/solrconfig.xml: . . . <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler> . . . www.percona.com
  • 34. DataImportHandler • conf/data-config.xml: <dataConfig> <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/testpattern?useUnicode=true" batchSize="-1" user="xxxx" password="xxxx"/> <document> <entity name="id" query="SELECT PostId, ParentId, Title, Body, Tags FROM Posts"> </entity> </document> </dataConfig> extremely important to avoid buffering the whole query result! www.percona.com
  • 35. DataImportHandler •conf/schema.xml: . . . <fields> <field name="PostId" type="string" indexed="true" stored="true" required="true" /> <field name="ParentId" type="string" indexed="true" stored="true" required="false" /> <field name="Title" type="text_general" indexed="false" stored="false" required="false" /> <field name="Body" type="text_general" indexed="false" stored="false" required="false" / > <field name="Tags" type="text_general" indexed="false" stored="false" required="false" / > <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/ > <fields> <uniqueKey>PostId</uniqueKey> <defaultSearchField>text</defaultSearchField> <copyField source="Title" dest="text"/> <copyField source="Body" dest="text"/> <copyField source="Tags" dest="text"/> . . . www.percona.com
  • 36. Insert Data into Index (Solr) • http://localhost:8983/solr/dataimport? command=full-import time: 14 min 28 sec www.percona.com
  • 37. Searching Solr • http://localhost:8983/solr/select/?q=mysql+AND +performance time: 79ms Query results are cached (like MySQL Query Cache), so they return much faster on subsequent execution www.percona.com
  • 39. Sphinx Search • http://sphinxsearch.com/ • Started in 2001 • GPLv2 license • C++ implementation • SphinxSE storage engine for MySQL • Supports MySQL protocol, SQL-like queries • Many sophisticated search features www.percona.com
  • 40. sphinx.conf source src1 { ! type = mysql ! sql_host = localhost ! sql_user = xxxx ! sql_pass = xxxx ! sql_db = testpattern ! sql_query = SELECT PostId, ParentId, Title, ! ! Body, Tags FROM Posts ! sql_query_info = SELECT * FROM Posts ! ! WHERE PostId=$id } www.percona.com
  • 41. sphinx.conf ! index test1 { ! source = src1 ! path = C:Sphinxdata } www.percona.com
  • 42. Insert Data into Index (Sphinx) ! C:Sphinx> indexer.exe -c sphinx.conf.in --verbose test1 Sphinx 2.0.5-release (r3309) using config file 'sphinx.conf'... indexing index 'test1'... collected 7397507 docs, 5731.8 MB time: 8 min 20 sec sorted 920.3 Mhits, 100.0% done total 7397507 docs, 5731776959 bytes total 500.149 sec, 11460138 bytes/sec, 14790.60 docs/sec total 11 reads, 15.898 sec, 314584.8 kb/call avg, 1445.3 msec/call avg total 542 writes, 3.129 sec, 12723.3 kb/call avg, 5.7 msec/ call avg Execution time: 500.196 s www.percona.com
  • 43. Querying index $ mysql --port 9306 Server version: 2.0.5-release (r3309) mysql> SELECT * FROM test1 WHERE MATCH('mysql performance'); +---------+--------+ | id | weight | +---------+--------+ | 6016856 | 6600 | | 4207641 | 6595 | | 2656325 | 6593 | | 7192928 | 5605 | | 8118235 | 5598 | . . . 20 rows in set (0.02 sec) www.percona.com
  • 44. Querying index mysql> SHOW META; +---------------+-------------+ | Variable_name | Value | +---------------+-------------+ | total | 1000 | | total_found | 7672 | | time | 0.013 | | keyword[0] | mysql | | docs[0] | 162287 | time: 13ms | hits[0] | 363694 | | keyword[1] | performance | | docs[1] | 147249 | | hits[1] | 210895 | +---------------+-------------+ www.percona.com
  • 46. Trigraphs Overview • Not very fast, but still better than LIKE / RLIKE • Generic, portable SQL solution • No dependency on version, storage engine, third- party technology www.percona.com
  • 47. Three-Letter Sequences ! CREATE TABLE AtoZ ( ! c! ! CHAR(1), ! PRIMARY KEY (c)); ! INSERT INTO AtoZ (c) VALUES ('a'), ('b'), ('c'), ... ! CREATE TABLE Trigraphs ( ! Tri! ! CHAR(3), ! PRIMARY KEY (Tri)); ! INSERT INTO Trigraphs (Tri) SELECT CONCAT(t1.c, t2.c, t3.c) FROM AtoZ t1 JOIN AtoZ t2 JOIN AtoZ t3; www.percona.com
  • 48. Insert Data Into Index my $sth = $dbh1->prepare("SELECT * FROM Posts") or die $dbh1->errstr; $sth->execute() or die $dbh1->errstr; $dbh2->begin_work; my $i = 0; while (my $row = $sth->fetchrow_hashref ) { my $text = lc(join('|', ($row->{title}, $row->{body}, $row->{tags}))); my %tri; map($tri{$_}=1, ( $text =~ m/[[:alpha:]]{3}/g )); next unless %tri; my $tuple_list = join(",", map("('$_',$row->{postid})", keys %tri)); my $sql = "INSERT IGNORE INTO PostsTrigraph (tri, PostId) VALUES $tuple_list"; $dbh2->do($sql) or die "SQL = $sql, ".$dbh2->errstr; if (++$i % 1000 == 0) { print "."; $dbh2->commit; $dbh2->begin_work; time: 116 min 50 sec } } space: 16.2GiB print ".n"; $dbh2->commit; rows: 519 million www.percona.com
  • 49. Indexed Lookups ! SELECT p.* time: 46 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' ! www.percona.com
  • 50. Search Among Fewer Matches ! SELECT p.* time: 19 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' www.percona.com
  • 51. Search Among Fewer Matches ! SELECT p.* time: 22 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' www.percona.com
  • 52. Search Among Fewer Matches ! SELECT p.* time: 13.6 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = 'man' www.percona.com
  • 53. Narrow Down Further ! SELECT p.* time: 13.8 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = 'man' WHERE CONCAT(p.title,p.body,p.tags) LIKE '%mysql%' ! AND CONCAT(p.title,p.body,p.tags) LIKE '%performance%'; www.percona.com
  • 54. And the winner is... Jarrett Campbell http://www.flickr.com/people/77744839@N00
  • 55. Time to Insert Data into Index LIKE expression n/a FULLTEXT MyISAM 33 min, 34 sec FULLTEXT InnoDB 55 min, 46 sec Apache Solr 14 min, 28 sec Sphinx Search 8 min, 20 sec Trigraphs 116 min, 50 sec www.percona.com
  • 56. Insert Data into Index (sec) 8000 6000 4000 2000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 57. Time to Build Index on Data LIKE expression n/a FULLTEXT MyISAM 31 min, 18 sec FULLTEXT InnoDB 25 min, 27 sec Apache Solr n/a Sphinx Search n/a Trigraphs n/a www.percona.com
  • 58. Build Index on Data (sec) 4000 3000 2000 1000 n/a n/a n/a 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 59. Index Storage LIKE expression n/a FULLTEXT MyISAM 2382 MiB FULLTEXT InnoDB ? MiB Apache Solr 2766 MiB Sphinx Search 3355 MiB Trigraphs 16589 MiB www.percona.com
  • 60. Index Storage (MiB) 20000 15000 10000 5000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 61. Query Speed LIKE expression 49,000ms - 399,000ms FULLTEXT MyISAM 16-200ms FULLTEXT InnoDB 350-740ms Apache Solr 79ms Sphinx Search 13ms Trigraphs 13800ms www.percona.com
  • 62. Query Speed (ms) 400000 350000 300000 250000 200000 150000 100000 50000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 63. 1000 Query Speed (ms) 750 500 250 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 64. Bottom Line build insert storage query solution 49k-399k LIKE expression 0 0 0 ms SQL FULLTEXT MyISAM 31:18 33:28 2382MiB 16-200ms MySQL FULLTEXT InnoDB 25:27 55:46 ? 350-740ms MySQL 5.6 Apache Solr n/a 14:28 2766MiB 79ms Java Sphinx Search n/a 8:20 3487MiB 13ms C++ Trigraphs n/a 116:50 16.2 GiB 13,800ms SQL www.percona.com
  • 65. Final Thoughts • Third-party search engines are complex to keep in sync with data, and adding another type of server adds more operations work for you. • Built-in FULLTEXT indexes are therefore useful even if they are not absolutely the fastest. • Different search implementations may return different results, so you should evaluate what works best for your project. • Any indexed search solution is orders of magnitude better than LIKE! www.percona.com
  • 66. New York, October 1-2, 2012 London, December 3-4, 2012 Santa Clara, April 22-25, 2013 www.percona.com/live
  • 67. Expert instructors In-person training Custom onsite training Live virtual training http://www.percona.com/training www.percona.com
  • 69. Copyright 2012 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions: Attribution. Noncommercial. No Derivative Works. You must attribute this You may not use this work You may not alter, work to Bill Karwin. for commercial purposes. transform, or build upon this work. www.percona.com