SlideShare une entreprise Scribd logo
1  sur  11
MongoDB Full Text Searchwith Sphinx Pierre Far, PhD Twitter: @ocwsearch Web: www.ocwsearch.com Email: pierre@ocwsearch.com
About A search engine of the full text of OpenCourseWare course materials. 2600+ courses, 10 universities, 11 OCW collections Courses in English, Japanese, Spanish, Dutch
Why MongoDB? Very helpful community Document DB Schemaless
Technology Stack Website (HTML), API (JSON) Query Index mongos3 xmlpipe2 Amazon S3 Adaptor Scripts
xmlpipe2 An XML documents input into Sphinx Any XML source so... Read courses from MongoDB and stream as XML sphinxsearch.com/wiki/doku.php?id=sphinx_xmlpipe2_tutorial
Pitfall 1: Document ID “ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS” Generate a unique 10-digit numeric ID for each course. Must be deterministic Unique index on field.
Pitfall 2: UTF-8 “Fatal error: Uncaught exception 'MongoException' with message 'non-utf8 string” Encoding: it’s a lie. mb_detect_encoding() unreliable. 2-part solution 	1. $HTML = @mb_convert_encoding($HTML, 'HTML-ENTITIES', 'utf-8'); 	2. $Text = FixEncoding($Text);
FixEncoding(); A set of real encoding detection functions http://lachy.id.au/dev/2005/11/encoding-functions-source FixEncoding() is a wrapper for these functions
UTF-8 in Sphinx In sphinx.conf: charset_type = utf-8 ngram_chars charset_table sphinxsearch.com/wiki/doku.php?id=charset_tables
mongos3 MongoDB document = S3 object Backup tool for MongoDB $Contents = gzencode(json_encode($Course), 9);
Thanks! Any questions? Twitter: @ocwsearch Web: www.ocwsearch.com Email: pierre@ocwsearch.com

Contenu connexe

Plus de Skills Matter

Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimSkills Matter
 
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter
 
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlSkills Matter
 
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsSkills Matter
 
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Skills Matter
 
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Skills Matter
 
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldSkills Matter
 
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Skills Matter
 
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Skills Matter
 
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingSkills Matter
 
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveSkills Matter
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tSkills Matter
 
Bootstrapping a-devops-matter
Bootstrapping a-devops-matterBootstrapping a-devops-matter
Bootstrapping a-devops-matterSkills Matter
 
Personal kanban-workshop
Personal kanban-workshopPersonal kanban-workshop
Personal kanban-workshopSkills Matter
 

Plus de Skills Matter (20)

Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheim
 
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
 
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberl
 
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.js
 
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
 
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
 
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source world
 
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
 
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#
 
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testing
 
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-dive
 
Serendipity-neo4j
Serendipity-neo4jSerendipity-neo4j
Serendipity-neo4j
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Plug 20110217
Plug   20110217Plug   20110217
Plug 20110217
 
Lug presentation
Lug presentationLug presentation
Lug presentation
 
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_t
 
Plug saiku
Plug   saikuPlug   saiku
Plug saiku
 
Huguk lily
Huguk lilyHuguk lily
Huguk lily
 
Bootstrapping a-devops-matter
Bootstrapping a-devops-matterBootstrapping a-devops-matter
Bootstrapping a-devops-matter
 
Personal kanban-workshop
Personal kanban-workshopPersonal kanban-workshop
Personal kanban-workshop
 

Mongo db full text search with sphinx

  • 1. MongoDB Full Text Searchwith Sphinx Pierre Far, PhD Twitter: @ocwsearch Web: www.ocwsearch.com Email: pierre@ocwsearch.com
  • 2. About A search engine of the full text of OpenCourseWare course materials. 2600+ courses, 10 universities, 11 OCW collections Courses in English, Japanese, Spanish, Dutch
  • 3. Why MongoDB? Very helpful community Document DB Schemaless
  • 4. Technology Stack Website (HTML), API (JSON) Query Index mongos3 xmlpipe2 Amazon S3 Adaptor Scripts
  • 5. xmlpipe2 An XML documents input into Sphinx Any XML source so... Read courses from MongoDB and stream as XML sphinxsearch.com/wiki/doku.php?id=sphinx_xmlpipe2_tutorial
  • 6. Pitfall 1: Document ID “ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS” Generate a unique 10-digit numeric ID for each course. Must be deterministic Unique index on field.
  • 7. Pitfall 2: UTF-8 “Fatal error: Uncaught exception 'MongoException' with message 'non-utf8 string” Encoding: it’s a lie. mb_detect_encoding() unreliable. 2-part solution 1. $HTML = @mb_convert_encoding($HTML, 'HTML-ENTITIES', 'utf-8'); 2. $Text = FixEncoding($Text);
  • 8. FixEncoding(); A set of real encoding detection functions http://lachy.id.au/dev/2005/11/encoding-functions-source FixEncoding() is a wrapper for these functions
  • 9. UTF-8 in Sphinx In sphinx.conf: charset_type = utf-8 ngram_chars charset_table sphinxsearch.com/wiki/doku.php?id=charset_tables
  • 10. mongos3 MongoDB document = S3 object Backup tool for MongoDB $Contents = gzencode(json_encode($Course), 9);
  • 11. Thanks! Any questions? Twitter: @ocwsearch Web: www.ocwsearch.com Email: pierre@ocwsearch.com

Notes de l'éditeur

  1. 10gen and usersA course is a (really long) documentAllowed OCW Search to get new features seamlessly
  2. PHP scriptOnly fields to be indexed
  3. Use course meta data in one algo to always produce the same output given the same inputs.
  4. Need a way to work with all kinds of input
  5. Uses regexs. Ugly, but works.PHP crashes with regexes matching really long strings.Split up string into array and loop, detecting encoding and reacting accordingly.It’s probably wrong for cases I’ve yet to see.
  6. Uses CloudFusion libraryObject name = unique ID.