1. MongoDB Full Text Searchwith Sphinx Pierre Far, PhD Twitter: @ocwsearch Web: www.ocwsearch.com Email: pierre@ocwsearch.com
2. About A search engine of the full text of OpenCourseWare course materials. 2600+ courses, 10 universities, 11 OCW collections Courses in English, Japanese, Spanish, Dutch
4. Technology Stack Website (HTML), API (JSON) Query Index mongos3 xmlpipe2 Amazon S3 Adaptor Scripts
5. xmlpipe2 An XML documents input into Sphinx Any XML source so... Read courses from MongoDB and stream as XML sphinxsearch.com/wiki/doku.php?id=sphinx_xmlpipe2_tutorial
6. Pitfall 1: Document ID “ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS” Generate a unique 10-digit numeric ID for each course. Must be deterministic Unique index on field.
8. FixEncoding(); A set of real encoding detection functions http://lachy.id.au/dev/2005/11/encoding-functions-source FixEncoding() is a wrapper for these functions
9. UTF-8 in Sphinx In sphinx.conf: charset_type = utf-8 ngram_chars charset_table sphinxsearch.com/wiki/doku.php?id=charset_tables
11. Thanks! Any questions? Twitter: @ocwsearch Web: www.ocwsearch.com Email: pierre@ocwsearch.com
Notes de l'éditeur
10gen and usersA course is a (really long) documentAllowed OCW Search to get new features seamlessly
PHP scriptOnly fields to be indexed
Use course meta data in one algo to always produce the same output given the same inputs.
Need a way to work with all kinds of input
Uses regexs. Ugly, but works.PHP crashes with regexes matching really long strings.Split up string into array and loop, detecting encoding and reacting accordingly.It’s probably wrong for cases I’ve yet to see.