SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Sphinx full-text search engine
                 November 15, 2008
                 OpenSQL Camp


                 Piotr Biel, Percona Inc
                 Andrew Aksyonoff, Sphinx Technologies
                 Peter Zaitsev, Percona Inc
Full Text Search

Full Text Search – technique for searching words in indexed
  documents by examination of all keywords in stored
  document and matching it against keywords supplied in
  the query.
External or local?
Local:
• Native, easy in implementation
• No need to change environment which is pretty often
  problematic in hosted services
External:
• Independent of storage system (MySQL, PostgreSQL,
  Oracle, XML files…)
• Works with all storage engines (MyISAM, InnoDB, Falcon)
• Ideal for minimising load on databases

Sphinx full-text search engine
Why Sphinx?
• Great indexing / searching speed
• Scalability
• Better resources utilization with a lot of concurrent
  searches than traditional, native FTS
• Indexes not only text
       – Numerical attributes can be used to filter results
• Can work as SQL replacement
       – sorting, grouping
       – can be more efficient than MySQL by order of magnitude on large sets
• Online index rebuilds
• Good choice of matching modes and operators
Sphinx full-text search engine
MySQL FTS vs Sphinx
MySQL
• Available only for MyISAM tables
• Slow on moderate sized (1 GB+) collections
• Limited querying capabilities
• Can't index BLOBs
Sphinx
• Available for all type of tables
• Scales well
• Advanced querying capabilities
• Ability to index all data types

Sphinx full-text search engine
Speed
     Wikipedia database used for tests, 2.5M rows, 15G text.
     Times are in seconds to complete.
     Creating index (just on title):
                                                 MySQL: 273,7s
          250                                    Sphinx: 75,1s
          200
                                     MySQL FTS
          150                        Sphinx

          100

           50

            0


Sphinx full-text search engine
Speed
     Wikipedia database used for tests, 2.5M rows, 15G text.
     Times are in seconds to complete.
     Creating index (on title and content):
                                              MySQL: +inf
                                              Sphinx: 1159,3s
       Unfortunately, we had to kill MySQL
       after 6+ hours (21600s) of indexing.




Sphinx full-text search engine
Speed
     Sorting speed
     • SELECT id FROM test1 ORDER BY touched DESC LIMIT 1000
     • php test.php -i test1 -s touched,desc


                                                    MySQL: 1,87s
            2
          1,8                                       Sphinx: 0,87s
          1,6
          1,4                              MySQL
          1,2                              Sphinx
            1
          0,8
          0,6
          0,4
          0,2
            0



Sphinx full-text search engine
Speed
     Grouping speed
     • SELECT flet, COUNT(*) AS q FROM test1 GROUP BY flet ORDER BY q DESC
     • php test.php -i test1 -g flet -gs @count,desc


                                                    MySQL: 1,10s
         1,2

           1
                                                    Sphinx: 0,33s
         0,8                               MySQL
                                           Sphinx
         0,6

         0,4

         0,2

           0



Sphinx full-text search engine
Speed (methodology)
     Sorting and grouping speed — what do we measure?
     • We create trimmed down MySQL table
     • We benchmark it against similar Sphinx index
     • We benchmark full-scan + ORDER BY + GROUP BY
     • Covering index, of course, makes MySQL lot faster?
                   – WRONG, it does not. Still got 1.87 sec on sorting
                   – WRONG, it hurts!!! Got 1.74 sec on grouping
     • Original data, of course, makes MySQL lot slower
                   – 66.0+ sec for both sorting and grouping, IO bound

Sphinx full-text search engine
Speed
     FTS speed - 2000 queries running in 8 threads
     Against just 100K rows of Wiki for MySQL (tiiime)
     Against complete 2500K rows for Sphinx
                                                  MySQL: 86,56s x25!
         90
         80                                       Sphinx: 25,66s
         70
         60                              MySQL

         50                              Sphinx

         40
         30
         20
         10
          0




Sphinx full-text search engine
Speed
     FTS speed - 2000 queries running in 8 threads
     Against just 100K rows of Wiki for MySQL (tiiime)
     Against complete 2500K rows for Sphinx
                                                  MySQL: 23,1 qps /25
         80
         70                                       Sphinx: 77,95 qps
         60
                                         MySQL
         50
                                         Sphinx
         40
         30
         20
         10
          0




Sphinx full-text search engine
Scalability
• Distributed searches
       – Many servers can be used
       – Many CPUs/cores within single server can be used
• Distributed indexes are fully transparent for end-users
       – Virtual distributed index overlays many physical indexes, either
         local or remote
• Great local resources utilization (CPUs)




Sphinx full-text search engine
How is it organized?
    Searchd – standalone daemon running in system,
    responsible for answering client queries
                                 • Filtering – WHERE analogue
                                 • Sorting – ORDER BY analogue
                                 • Grouping – GROUP BY analogue


     Indexer – tool to build indexes
                                 • Fetching documents and splitting into separate words
                                 • Processing fetched results




Sphinx full-text search engine
Talking to Sphinx
• Using APIs
• PHP, Perl, Python, Ruby, pure-C, C++, C#, Haskell...
• SphinxSE – MySQL storage engine dedicated for
  communication with Sphinx
                   – Useful when there's no native API port
                   – Also lets you juggle huge datasets directly on MySQL side
                   – Without fetching them to client then sending to MySQL




Sphinx full-text search engine
New features in Sphinx 0.9.9
Why?
• 0.9.9 is the current development version
• Beta will be published shortly (this weekend?)
• Let’s see what’s bleeding at the edge!

Overall
• 34 new features of different caliber
• 10 are major changes
• had a hard time selecting best N...
Sphinx full-text search engine
Top-10 new features
(1) added select-list feature w/full expressions support
       –   lets you specify specific columns and expressions to fetch
       –   compute and fetch arbitrary number of arbitrary expressions
       –   computed columns can be used for filtering, sorting, grouping
       –   expressions are currently 2-4x slower than native (!) code:

              benchmarking expressions
              run 1: int-eval 49.7M/sec, flt-eval 46.3M/sec, native 129.7M/sec
              run 2: flt-eval 28.2M/sec, native 108.6M/sec
              run 3: flt-eval 269.2M/sec, native 309.9M/sec

       – further optimizations planned (JIT native code for expressions)

Sphinx full-text search engine
Top-10 new features
 (2) added arbitrary brackets/negations nesting support to
   query language
        –    query parser was rewritten from scratch
        –    query still must be "computable“
        –    implicit lists of documents such as "foo|-bar" are not allowed
        –    they usually indicate a programming or querying mistake
             anyway…
 (3) added config reload on SIGHUP
        – lets you add and remove new indexes on the fly
        – also, all index settings are now stored within index
        – (much) less index/config incosistency issues

Sphinx full-text search engine
Top-10 new features
 (4) added signed 64bit attrs support (sql_attr_bigint
   directive)
        – support means support
        – filtering, sorting, groupby, expressions, everything should
          work
 (5) added persistent connections, UNIX-socket, and multi-
   interface support (Open(), Close(), listen)
        – self-explanatory (less TCP pressure, more security, etc)
        – pconns are not (yet?) used by master searchd instances when
          talking to remote agents, though
        – something to add for HP/HA... sponsors are welcome :)
Sphinx full-text search engine
Top-10 new features
 (6) added kill-list support
        –    new sql_query_killlist directive
        –    lets you eliminate "phantom results" from older indexes
        –    fetches a list of documents to remove from previous results
        –    example:
                main index
                  doc 1 title is "hello world"
                  doc 2 title is "hello world reloaded"
                  doc 3 title is "hello world revolutions"
                delta index:
                  doc 2 gets deleted
                  doc 3 title becomes "sample program"


Sphinx full-text search engine
Top-10 new features
 (kill-lists, continued)
        – querying for "hello" will return documents 2 and 3
                • doc 2 could be suppresed by runtime "deleted" flag update… ugly
                • doc 3 from "main" (!) could not be suppressed at all (phantom result)
        –    solution? kill-lists attached to "delta" index
        –    delta (!) kill-list contains ids 2 and 3
        –    how that works?
        –    result set after searching "main" will be 1,2,3
        –    then delta kill-list only keeps result 1 and removes 2,3
        –    then delta search would add new matches to result set (if any)

Sphinx full-text search engine
Top-10 new features
 (7) added MS SQL (aka SQL Server) source type support
 (8) added index_exact_words feature, and exact form
   operator
        – because sometimes you want to partially suppress stemming
        – query “your =business” will match “business of yours”
        – but not “are you busy” any more
 (9) added inplace inversion of .spa and .spp
   (inplace_enable, 1.5-2x less disk space for indexing)
 (10) improved excerpts speed (upto 50x faster!)
        – not exactly totally new, but major improvements…
Sphinx full-text search engine
Other 24 features at a glance
 •    added min_stemming_len
 •    indexer-side column unpacker (unpack_mysqlcompress)
 •    builtin Czech stemmer (morphology=stem_cz)
 •    on-disk SPI support, trade RAM for IO (ondisk_dict)
 •    indexer now prints out IO stats
 •    HTML stripper now skips PIs (such as <?php ... ?>)
 •    IsConnectError() API call (API errors vs remote errors)
 •    int64 expressions and BIGINT() cast (lets avoid 32bit
      wraparounds when computing A*B)

Sphinx full-text search engine
Other 24 features at a glance
 • star-syntax support in BuildExcerpts()
 • IDIV(), NOW(), INTERVAL(), IN() functions
 • index-level early-reject based on filters
 • MVA updates feature (mva_updates_pool directive)
 • multiforms support (multiple source words can be
   mapped to single destination word)
 • removed legacy matching code, everything runs using
   new V2 engine now
 • field position limit (syntax: @title[50] hello world)
 • periodic .spa flush (attr_flush_period directive)

Sphinx full-text search engine
Other 24 features at a glance
 •    periodic .spa flush (attr_flush_period directive)
 •    per-query attribute overrides (see SetOverride() call)
 •    duplicate log messages filter in searchd
 •    --nodetach debugging switch in searchd
 •    blackhole agents support in searchd (agent_blackhole)
 •    max_filters, max_filter_values (were hardcoded)
 •    crash handler for debugging (crash_log_path)
 •    status variables support in SphinxSE
 •    max_packet_size (was hardcoded)
Sphinx full-text search engine
Thank you
 Sphinx website
   http://sphinxsearch.com

 Percona website
   http://percona.com




Sphinx full-text search engine

Contenu connexe

Tendances

Large Scale ETL for Hadoop and Cloudera Search using Morphlines
Large Scale ETL for Hadoop and Cloudera Search using MorphlinesLarge Scale ETL for Hadoop and Cloudera Search using Morphlines
Large Scale ETL for Hadoop and Cloudera Search using Morphlineswhoschek
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
 
What's new in Elasticsearch v5
What's new in Elasticsearch v5What's new in Elasticsearch v5
What's new in Elasticsearch v5Idan Tohami
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsySearch-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsyLucidworks
 
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Lucidworks
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 

Tendances (20)

Large Scale ETL for Hadoop and Cloudera Search using Morphlines
Large Scale ETL for Hadoop and Cloudera Search using MorphlinesLarge Scale ETL for Hadoop and Cloudera Search using Morphlines
Large Scale ETL for Hadoop and Cloudera Search using Morphlines
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
What's new in Elasticsearch v5
What's new in Elasticsearch v5What's new in Elasticsearch v5
What's new in Elasticsearch v5
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Solr vs ElasticSearch
Solr vs ElasticSearchSolr vs ElasticSearch
Solr vs ElasticSearch
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsySearch-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
 
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Scaling search with SolrCloud
Scaling search with SolrCloudScaling search with SolrCloud
Scaling search with SolrCloud
 

Similaire à Plugin Opensql2008 Sphinx

Sphinx new
Sphinx newSphinx new
Sphinx newrit2010
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Jeremy Zawodny
 
MariaDB with SphinxSE
MariaDB with SphinxSEMariaDB with SphinxSE
MariaDB with SphinxSEColin Charles
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHPMike Lively
 
Real time fulltext search with sphinx
Real time fulltext search with sphinxReal time fulltext search with sphinx
Real time fulltext search with sphinxAdrian Nuta
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
Maria db 10 and the mariadb foundation(colin)
Maria db 10 and the mariadb foundation(colin)Maria db 10 and the mariadb foundation(colin)
Maria db 10 and the mariadb foundation(colin)kayokogoto
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Jeremy Zawodny
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityIvan Zoratti
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
Elasticsearch python
Elasticsearch pythonElasticsearch python
Elasticsearch pythonvaliantval2
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018Roy Russo
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017Roy Russo
 
Cassandra
CassandraCassandra
Cassandraexsuns
 

Similaire à Plugin Opensql2008 Sphinx (20)

Sphinx new
Sphinx newSphinx new
Sphinx new
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
 
MariaDB with SphinxSE
MariaDB with SphinxSEMariaDB with SphinxSE
MariaDB with SphinxSE
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHP
 
Real time fulltext search with sphinx
Real time fulltext search with sphinxReal time fulltext search with sphinx
Real time fulltext search with sphinx
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Maria db 10 and the mariadb foundation(colin)
Maria db 10 and the mariadb foundation(colin)Maria db 10 and the mariadb foundation(colin)
Maria db 10 and the mariadb foundation(colin)
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Elasticsearch python
Elasticsearch pythonElasticsearch python
Elasticsearch python
 
Elasticsearch features presentation
Elasticsearch features presentationElasticsearch features presentation
Elasticsearch features presentation
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
Hbase Nosql
Hbase NosqlHbase Nosql
Hbase Nosql
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
Cassandra
CassandraCassandra
Cassandra
 

Dernier

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 

Dernier (20)

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 

Plugin Opensql2008 Sphinx

  • 1. Sphinx full-text search engine November 15, 2008 OpenSQL Camp Piotr Biel, Percona Inc Andrew Aksyonoff, Sphinx Technologies Peter Zaitsev, Percona Inc
  • 2. Full Text Search Full Text Search – technique for searching words in indexed documents by examination of all keywords in stored document and matching it against keywords supplied in the query.
  • 3. External or local? Local: • Native, easy in implementation • No need to change environment which is pretty often problematic in hosted services External: • Independent of storage system (MySQL, PostgreSQL, Oracle, XML files…) • Works with all storage engines (MyISAM, InnoDB, Falcon) • Ideal for minimising load on databases Sphinx full-text search engine
  • 4. Why Sphinx? • Great indexing / searching speed • Scalability • Better resources utilization with a lot of concurrent searches than traditional, native FTS • Indexes not only text – Numerical attributes can be used to filter results • Can work as SQL replacement – sorting, grouping – can be more efficient than MySQL by order of magnitude on large sets • Online index rebuilds • Good choice of matching modes and operators Sphinx full-text search engine
  • 5. MySQL FTS vs Sphinx MySQL • Available only for MyISAM tables • Slow on moderate sized (1 GB+) collections • Limited querying capabilities • Can't index BLOBs Sphinx • Available for all type of tables • Scales well • Advanced querying capabilities • Ability to index all data types Sphinx full-text search engine
  • 6. Speed Wikipedia database used for tests, 2.5M rows, 15G text. Times are in seconds to complete. Creating index (just on title): MySQL: 273,7s 250 Sphinx: 75,1s 200 MySQL FTS 150 Sphinx 100 50 0 Sphinx full-text search engine
  • 7. Speed Wikipedia database used for tests, 2.5M rows, 15G text. Times are in seconds to complete. Creating index (on title and content): MySQL: +inf Sphinx: 1159,3s Unfortunately, we had to kill MySQL after 6+ hours (21600s) of indexing. Sphinx full-text search engine
  • 8. Speed Sorting speed • SELECT id FROM test1 ORDER BY touched DESC LIMIT 1000 • php test.php -i test1 -s touched,desc MySQL: 1,87s 2 1,8 Sphinx: 0,87s 1,6 1,4 MySQL 1,2 Sphinx 1 0,8 0,6 0,4 0,2 0 Sphinx full-text search engine
  • 9. Speed Grouping speed • SELECT flet, COUNT(*) AS q FROM test1 GROUP BY flet ORDER BY q DESC • php test.php -i test1 -g flet -gs @count,desc MySQL: 1,10s 1,2 1 Sphinx: 0,33s 0,8 MySQL Sphinx 0,6 0,4 0,2 0 Sphinx full-text search engine
  • 10. Speed (methodology) Sorting and grouping speed — what do we measure? • We create trimmed down MySQL table • We benchmark it against similar Sphinx index • We benchmark full-scan + ORDER BY + GROUP BY • Covering index, of course, makes MySQL lot faster? – WRONG, it does not. Still got 1.87 sec on sorting – WRONG, it hurts!!! Got 1.74 sec on grouping • Original data, of course, makes MySQL lot slower – 66.0+ sec for both sorting and grouping, IO bound Sphinx full-text search engine
  • 11. Speed FTS speed - 2000 queries running in 8 threads Against just 100K rows of Wiki for MySQL (tiiime) Against complete 2500K rows for Sphinx MySQL: 86,56s x25! 90 80 Sphinx: 25,66s 70 60 MySQL 50 Sphinx 40 30 20 10 0 Sphinx full-text search engine
  • 12. Speed FTS speed - 2000 queries running in 8 threads Against just 100K rows of Wiki for MySQL (tiiime) Against complete 2500K rows for Sphinx MySQL: 23,1 qps /25 80 70 Sphinx: 77,95 qps 60 MySQL 50 Sphinx 40 30 20 10 0 Sphinx full-text search engine
  • 13. Scalability • Distributed searches – Many servers can be used – Many CPUs/cores within single server can be used • Distributed indexes are fully transparent for end-users – Virtual distributed index overlays many physical indexes, either local or remote • Great local resources utilization (CPUs) Sphinx full-text search engine
  • 14. How is it organized? Searchd – standalone daemon running in system, responsible for answering client queries • Filtering – WHERE analogue • Sorting – ORDER BY analogue • Grouping – GROUP BY analogue Indexer – tool to build indexes • Fetching documents and splitting into separate words • Processing fetched results Sphinx full-text search engine
  • 15. Talking to Sphinx • Using APIs • PHP, Perl, Python, Ruby, pure-C, C++, C#, Haskell... • SphinxSE – MySQL storage engine dedicated for communication with Sphinx – Useful when there's no native API port – Also lets you juggle huge datasets directly on MySQL side – Without fetching them to client then sending to MySQL Sphinx full-text search engine
  • 16. New features in Sphinx 0.9.9 Why? • 0.9.9 is the current development version • Beta will be published shortly (this weekend?) • Let’s see what’s bleeding at the edge! Overall • 34 new features of different caliber • 10 are major changes • had a hard time selecting best N... Sphinx full-text search engine
  • 17. Top-10 new features (1) added select-list feature w/full expressions support – lets you specify specific columns and expressions to fetch – compute and fetch arbitrary number of arbitrary expressions – computed columns can be used for filtering, sorting, grouping – expressions are currently 2-4x slower than native (!) code: benchmarking expressions run 1: int-eval 49.7M/sec, flt-eval 46.3M/sec, native 129.7M/sec run 2: flt-eval 28.2M/sec, native 108.6M/sec run 3: flt-eval 269.2M/sec, native 309.9M/sec – further optimizations planned (JIT native code for expressions) Sphinx full-text search engine
  • 18. Top-10 new features (2) added arbitrary brackets/negations nesting support to query language – query parser was rewritten from scratch – query still must be "computable“ – implicit lists of documents such as "foo|-bar" are not allowed – they usually indicate a programming or querying mistake anyway… (3) added config reload on SIGHUP – lets you add and remove new indexes on the fly – also, all index settings are now stored within index – (much) less index/config incosistency issues Sphinx full-text search engine
  • 19. Top-10 new features (4) added signed 64bit attrs support (sql_attr_bigint directive) – support means support – filtering, sorting, groupby, expressions, everything should work (5) added persistent connections, UNIX-socket, and multi- interface support (Open(), Close(), listen) – self-explanatory (less TCP pressure, more security, etc) – pconns are not (yet?) used by master searchd instances when talking to remote agents, though – something to add for HP/HA... sponsors are welcome :) Sphinx full-text search engine
  • 20. Top-10 new features (6) added kill-list support – new sql_query_killlist directive – lets you eliminate "phantom results" from older indexes – fetches a list of documents to remove from previous results – example: main index doc 1 title is "hello world" doc 2 title is "hello world reloaded" doc 3 title is "hello world revolutions" delta index: doc 2 gets deleted doc 3 title becomes "sample program" Sphinx full-text search engine
  • 21. Top-10 new features (kill-lists, continued) – querying for "hello" will return documents 2 and 3 • doc 2 could be suppresed by runtime "deleted" flag update… ugly • doc 3 from "main" (!) could not be suppressed at all (phantom result) – solution? kill-lists attached to "delta" index – delta (!) kill-list contains ids 2 and 3 – how that works? – result set after searching "main" will be 1,2,3 – then delta kill-list only keeps result 1 and removes 2,3 – then delta search would add new matches to result set (if any) Sphinx full-text search engine
  • 22. Top-10 new features (7) added MS SQL (aka SQL Server) source type support (8) added index_exact_words feature, and exact form operator – because sometimes you want to partially suppress stemming – query “your =business” will match “business of yours” – but not “are you busy” any more (9) added inplace inversion of .spa and .spp (inplace_enable, 1.5-2x less disk space for indexing) (10) improved excerpts speed (upto 50x faster!) – not exactly totally new, but major improvements… Sphinx full-text search engine
  • 23. Other 24 features at a glance • added min_stemming_len • indexer-side column unpacker (unpack_mysqlcompress) • builtin Czech stemmer (morphology=stem_cz) • on-disk SPI support, trade RAM for IO (ondisk_dict) • indexer now prints out IO stats • HTML stripper now skips PIs (such as <?php ... ?>) • IsConnectError() API call (API errors vs remote errors) • int64 expressions and BIGINT() cast (lets avoid 32bit wraparounds when computing A*B) Sphinx full-text search engine
  • 24. Other 24 features at a glance • star-syntax support in BuildExcerpts() • IDIV(), NOW(), INTERVAL(), IN() functions • index-level early-reject based on filters • MVA updates feature (mva_updates_pool directive) • multiforms support (multiple source words can be mapped to single destination word) • removed legacy matching code, everything runs using new V2 engine now • field position limit (syntax: @title[50] hello world) • periodic .spa flush (attr_flush_period directive) Sphinx full-text search engine
  • 25. Other 24 features at a glance • periodic .spa flush (attr_flush_period directive) • per-query attribute overrides (see SetOverride() call) • duplicate log messages filter in searchd • --nodetach debugging switch in searchd • blackhole agents support in searchd (agent_blackhole) • max_filters, max_filter_values (were hardcoded) • crash handler for debugging (crash_log_path) • status variables support in SphinxSE • max_packet_size (was hardcoded) Sphinx full-text search engine
  • 26. Thank you Sphinx website http://sphinxsearch.com Percona website http://percona.com Sphinx full-text search engine