SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
Smart Search




   and Beyond
Who?

            Chris Davenport
       Production Leadership Team




Smart Search and Beyond
Solving the search problem




Smart Search and Beyond
Old Joomla Search Sucks!
                    Cannot rank by
                     relevance across
                     content types
                    Only very crude
                     filtering
                    Can be slow to
                     search



Smart Search and Beyond
Table of Contents
01   Smart Search so far
02   Smart Search in action
03   Smart Search under the hood
04   Smart Search tips and tricks
05   Smart Search where next?




Smart Search and Beyond
A Short History
 ‣ Old Joomla Search
  • Introduced in Mambo
  • Largely unchanged since

 ‣ JXTended Finder for Joomla 1.5
 ‣ Finder Integration Working Group
  • Smart Search for Joomla 2.5

 ‣ Search Working Group



Smart Search and Beyond
Smart Search for Joomla 2.5
 ‣ Separate index
 ‣ Auto-completion
 ‣ Facetted search
 ‣ Relevancy ordering
 ‣ Did you mean?
 ‣ ...and more besides



Smart Search and Beyond
Table of Contents
01   Smart Search so far
02   Smart Search in action
03   Smart Search under the hood
04   Smart Search tips and tricks
05   Smart Search where next?




Smart Search and Beyond
Auto-completion




Smart Search and Beyond
Another example




Smart Search and Beyond
Another example




Smart Search and Beyond
Table of Contents
01   Smart Search so far
02   Smart Search in action
03   Smart Search under the hood
04   Smart Search tips and tricks
05   Smart Search where next?




Smart Search and Beyond
Under the hood




Smart Search and Beyond
A problem in two halves




Smart Search and Beyond
First half: Indexing


             INDEX




             Raw data




Smart Search and Beyond
Second half: Querying


     Search    INDEX   Search
     queries           results




Smart Search and Beyond
Search results
Search results are rendered purely from
data in the index, not the raw data.




Smart Search and Beyond
Indexing




Smart Search and Beyond
Indexing

      Parsing          Stemming

    Tokenisation        Analysis

 Token aggregation   Term weighting

     Filtration       Classification




Smart Search and Beyond
Terms index




Smart Search and Beyond
Parsing
 ‣ Extract plain text from raw data
  • HTML, RTF supported out-of-the-box
  • PDF, MS Word could be supported

 ‣ For example, HTML
  • Essentially the same as PHP strip_tags




Smart Search and Beyond
Tokenisation
 ‣ Fold to lowercase
 ‣ Special handling for plus, dash, comma,
   dot and quotes
 ‣ Remove non-alphanumerics
 ‣ Replace multiple spaces with one space
 ‣ Special support for Chinese




Smart Search and Beyond
Token aggregation
On a clear disk you can seek forever
on             a              clear
on a           a clear        clear disk
on a clear     a clear disk   clear disk you
disk           you            can
disk you       you can        can seek
disk you can   you can seek   can seek forever
seek           forever
seek forever


Smart Search and Beyond
Filtration
 ‣ “Stop word removal”
  • Not removed, just given a low weight

 ‣ jos_finder_terms_common
 ‣ English only
  • Other languages need to add their common
    words to the table




Smart Search and Beyond
Stemming
fishing

fished
               fish
fisher

fish




Smart Search and Beyond
Stemming
 ‣ “Snowball” is used by default
  • Danish, German, English, Spanish, Finnish,
    French, Hungarian, Italian, Norwegian, Dutch,
    Portuguese, Romanian, Russian, Swedish and
    Turkish
  • BUT it requires PHP extension

 ‣ “English only” uses a pure PHP stemmer
  • Recommended for all English sites



Smart Search and Beyond
Morphological analysis
 ‣ Currently uses Soundex
 ‣ Not used in search as such
 ‣ Used for the “Did you mean?” feature
 ‣ If no search results found, then...
  • Match on Soundex code
  • Return nearest term/phrase by Levenshtein
    distance



Smart Search and Beyond
Term weighting

Context         Multiplier
Title           1.7
Text            0.7
Meta            1.2
Path            2.0
Miscellaneous   0.3




Smart Search and Beyond
Classification




Smart Search and Beyond
Taxonomies
 ‣ “Content maps” in Administrator
 ‣ Basis for facetted search
 ‣ Multi-level taxonomies not fully
   supported (yet)




Smart Search and Beyond
Taxonomies - drop-downs




Smart Search and Beyond
Taxonomies - checkboxes




Smart Search and Beyond
Taxonomies - links




Smart Search and Beyond
Database ERD




Smart Search and Beyond
Smart Search Plug-ins
               /plugins

   /content     /finder     /system
    /finder   /categories   /highlight
               /contacts
                /content
              /newsfeeds
               /weblinks




Smart Search and Beyond
Smart Search Plug-ins
content/finder           finder/[type]
  onContentBeforeSave         onFinderBeforeSave
   onContentAfterSave          onFinderAfterSave
  onContentAfterDelete        onFinderAfterDelete
 onContentChangeState        onFinderChangeState
 onCategoryChangeState   onFinderCategoryChangeState




Smart Search and Beyond
Query parsing
                      URI argument      Query string
Terms                 q=Some+text       Some text
Phrases               q=”Some+text”     “Some text”
Logical operators     q=This+and+that   This and that
Before a date         d1=2012-05-16     before:2012-05-16
After a date          d2=2012-05-18     after:2012-05-18
Content type filter   t[]=98233         type:Articles
Taxonomy filter       t[]=30922         author:Chris Davenport
Static filter         f=2
Highlight             qh=Some+text




Smart Search and Beyond
Results rendering
 ‣ com_finder
  • search                  Search results
    ‣ default.php           page
    ‣ form.php
    ‣ default_results.php

    ‣ default_result.php    For custom types
    ‣ default_[type].php

 ‣ mod_finder
    ‣ default.php           Search module


Smart Search and Beyond
Layout overrides example




Smart Search and Beyond
Alternative override




Smart Search and Beyond
Table of Contents
01   Smart Search so far
02   Smart Search in action
03   Smart Search under the hood
04   Smart Search tips and tricks
05   Smart Search where next?




Smart Search and Beyond
Tips and tricks




Smart Search and Beyond
Tips and tricks
 ‣ HTML Parser
  • Invalid HTML can confuse the parser
  • Invalid UTF8 is ignored
  • Text in attributes is ignored




Smart Search and Beyond
When to do a purge
 ‣ Indexing is incremental so most of the time you don't
   need to.
 ‣ Changes to taxonomies that do not involve changes to
   content items
 ‣ Changes to term weights
 ‣ Changing the stemmer
 ‣ Changes to content items that do not trigger the standard
   content events
 ‣ IMPORTANT
  • If you have static filters they will be lost when you do a purge.




Smart Search and Beyond
Tuning Smart Search
 ‣ Use the CLI for indexing
  • http://docs.joomla.org/Setting_up_automatic_Smart_
    Search_indexing

 ‣ Out of memory issues
  • Please report out of memory issues so we can
    understand them better.
  • Reduce batch size
    ‣ Default is 50. Drop it to 5 or even 1.

  • Terms per batch
    ‣ Can be increased BUT NEEDS APACHE SERVER CONFIG
      CHANGE



Smart Search and Beyond
Table of Contents
01   Smart Search so far
02   Smart Search in action
03   Smart Search under the hood
04   Smart Search tips and tricks
05   Smart Search where next?




Smart Search and Beyond
Where next?




Smart Search and Beyond
Search Working Group
 ‣ Meeting at J and Beyond
  • 19 May 2012 11:30 AM

 ‣ Stable ready for merge July 2012
 ‣ Joomla 3.0 release September 2012
 ‣ Meeting at Joomla World Conference
  • San Jose, California, November 2012




Smart Search and Beyond
Improved language support
 ‣ Improve common word support
 ‣ Improve stemmer support
  • Native PHP stemmers?

 ‣ Improve morphological coding
  • Non-English alternatives to Soundex

 ‣ Mixed language content items
  • Language tagging of tokens/terms?


Smart Search and Beyond
Other possibilities
 ‣ Preserve static filters on purge/index
 ‣ Decouple indexing via message queues
 ‣ Easier support for range queries
 ‣ Search logging via JLog
 ‣ Variable-length token aggregation
 ‣ Multi-level taxonomies
 ‣ Add parsers for PDF, MS Word

Smart Search and Beyond
Search API
 ‣ Very important going forward
 ‣ Too big a leap for Joomla 3.0
 ‣ Develop in parallel during 3.x cycle
 ‣ Use in Smart Search for Joomla 4.0




Smart Search and Beyond
Documentation


http://docs.joomla.org/Category:Smart_Search




Smart Search and Beyond
Questions?




Smart Search and Beyond
Don't forget


   Search Working Group
         Meeting
    Saturday 19 May 2012
          11:30 AM




Smart Search and Beyond
Haystack - Mark Duncan CC-BY-SA 2.0 Generic
 http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg

 Under the hood - ilovebutter CC-BY 2.0 Generic
 http://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg

 Child sucking thumb - Thahira CC-BY-SA 3.0 Unported
 http://commons.wikimedia.org/wiki/File:Sucking_finger.jpg

 Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domain
 http://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg

 Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generic
 http://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg

 Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public
 domain
 http://commons.wikimedia.org/wiki/File:Index_Pages.jpg

 Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domain
 http://commons.wikimedia.org/wiki/File:20_questions_1954.JPG

 Linnaeus taxonomy - Public domain
 http://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png


 All other images are Copyright (C) 2012 Chris Davenport unless I've accidentally missed crediting them.




Image Credits

Contenu connexe

Similaire à JAB2012 Smart Search Presentation

Similaire à JAB2012 Smart Search Presentation (20)

Key Success Factors for Enterprise Content Management
Key Success Factors for Enterprise Content ManagementKey Success Factors for Enterprise Content Management
Key Success Factors for Enterprise Content Management
 
International seo and content clustering
International seo and content clusteringInternational seo and content clustering
International seo and content clustering
 
International seo and content clustering | John Caldwell | CreatorSEO
International seo and content clustering | John Caldwell | CreatorSEOInternational seo and content clustering | John Caldwell | CreatorSEO
International seo and content clustering | John Caldwell | CreatorSEO
 
International Seo and Content Clustering
International Seo and Content ClusteringInternational Seo and Content Clustering
International Seo and Content Clustering
 
Search Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEO
Search Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEOSearch Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEO
Search Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEO
 
Enterprise Ireland presentation - International seo and content June 2018
Enterprise Ireland  presentation - International seo and content   June 2018Enterprise Ireland  presentation - International seo and content   June 2018
Enterprise Ireland presentation - International seo and content June 2018
 
International SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEOInternational SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEO
 
International Search Engine Optimisation - SEO
International Search Engine Optimisation - SEOInternational Search Engine Optimisation - SEO
International Search Engine Optimisation - SEO
 
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
 
How to Run LinkedIn Searches Like a Pro [Webcast]
How to Run LinkedIn Searches Like a Pro [Webcast]How to Run LinkedIn Searches Like a Pro [Webcast]
How to Run LinkedIn Searches Like a Pro [Webcast]
 
Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...
Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...
Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...
 
Searching in SharePoint
Searching in SharePointSearching in SharePoint
Searching in SharePoint
 
SEO for Ecommerce - an overview
SEO for Ecommerce - an overviewSEO for Ecommerce - an overview
SEO for Ecommerce - an overview
 
DITA and SEO
DITA and SEODITA and SEO
DITA and SEO
 
WordPress SEO Basics - Melbourne WordPress Meetup
WordPress SEO Basics - Melbourne WordPress MeetupWordPress SEO Basics - Melbourne WordPress Meetup
WordPress SEO Basics - Melbourne WordPress Meetup
 
SEO for Online Startups - Small Business Festival Victoria 2015
SEO for Online Startups - Small Business Festival Victoria 2015SEO for Online Startups - Small Business Festival Victoria 2015
SEO for Online Startups - Small Business Festival Victoria 2015
 
TCDrupal 2018: SEO! Snippets! Schema!
TCDrupal 2018: SEO! Snippets! Schema! TCDrupal 2018: SEO! Snippets! Schema!
TCDrupal 2018: SEO! Snippets! Schema!
 
Improving Your Onsite Search
Improving Your Onsite SearchImproving Your Onsite Search
Improving Your Onsite Search
 
International SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEOInternational SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEO
 
SEO for humans, without the jargon- Halton Business Fair November 16
SEO for humans, without the jargon- Halton Business Fair November 16SEO for humans, without the jargon- Halton Business Fair November 16
SEO for humans, without the jargon- Halton Business Fair November 16
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

JAB2012 Smart Search Presentation

  • 1. Smart Search and Beyond
  • 2. Who? Chris Davenport Production Leadership Team Smart Search and Beyond
  • 3. Solving the search problem Smart Search and Beyond
  • 4. Old Joomla Search Sucks! Cannot rank by relevance across content types Only very crude filtering Can be slow to search Smart Search and Beyond
  • 5. Table of Contents 01 Smart Search so far 02 Smart Search in action 03 Smart Search under the hood 04 Smart Search tips and tricks 05 Smart Search where next? Smart Search and Beyond
  • 6. A Short History ‣ Old Joomla Search • Introduced in Mambo • Largely unchanged since ‣ JXTended Finder for Joomla 1.5 ‣ Finder Integration Working Group • Smart Search for Joomla 2.5 ‣ Search Working Group Smart Search and Beyond
  • 7. Smart Search for Joomla 2.5 ‣ Separate index ‣ Auto-completion ‣ Facetted search ‣ Relevancy ordering ‣ Did you mean? ‣ ...and more besides Smart Search and Beyond
  • 8. Table of Contents 01 Smart Search so far 02 Smart Search in action 03 Smart Search under the hood 04 Smart Search tips and tricks 05 Smart Search where next? Smart Search and Beyond
  • 12. Table of Contents 01 Smart Search so far 02 Smart Search in action 03 Smart Search under the hood 04 Smart Search tips and tricks 05 Smart Search where next? Smart Search and Beyond
  • 13. Under the hood Smart Search and Beyond
  • 14. A problem in two halves Smart Search and Beyond
  • 15. First half: Indexing INDEX Raw data Smart Search and Beyond
  • 16. Second half: Querying Search INDEX Search queries results Smart Search and Beyond
  • 17. Search results Search results are rendered purely from data in the index, not the raw data. Smart Search and Beyond
  • 19. Indexing Parsing Stemming Tokenisation Analysis Token aggregation Term weighting Filtration Classification Smart Search and Beyond
  • 21. Parsing ‣ Extract plain text from raw data • HTML, RTF supported out-of-the-box • PDF, MS Word could be supported ‣ For example, HTML • Essentially the same as PHP strip_tags Smart Search and Beyond
  • 22. Tokenisation ‣ Fold to lowercase ‣ Special handling for plus, dash, comma, dot and quotes ‣ Remove non-alphanumerics ‣ Replace multiple spaces with one space ‣ Special support for Chinese Smart Search and Beyond
  • 23. Token aggregation On a clear disk you can seek forever on a clear on a a clear clear disk on a clear a clear disk clear disk you disk you can disk you you can can seek disk you can you can seek can seek forever seek forever seek forever Smart Search and Beyond
  • 24. Filtration ‣ “Stop word removal” • Not removed, just given a low weight ‣ jos_finder_terms_common ‣ English only • Other languages need to add their common words to the table Smart Search and Beyond
  • 25. Stemming fishing fished fish fisher fish Smart Search and Beyond
  • 26. Stemming ‣ “Snowball” is used by default • Danish, German, English, Spanish, Finnish, French, Hungarian, Italian, Norwegian, Dutch, Portuguese, Romanian, Russian, Swedish and Turkish • BUT it requires PHP extension ‣ “English only” uses a pure PHP stemmer • Recommended for all English sites Smart Search and Beyond
  • 27. Morphological analysis ‣ Currently uses Soundex ‣ Not used in search as such ‣ Used for the “Did you mean?” feature ‣ If no search results found, then... • Match on Soundex code • Return nearest term/phrase by Levenshtein distance Smart Search and Beyond
  • 28. Term weighting Context Multiplier Title 1.7 Text 0.7 Meta 1.2 Path 2.0 Miscellaneous 0.3 Smart Search and Beyond
  • 30. Taxonomies ‣ “Content maps” in Administrator ‣ Basis for facetted search ‣ Multi-level taxonomies not fully supported (yet) Smart Search and Beyond
  • 31. Taxonomies - drop-downs Smart Search and Beyond
  • 32. Taxonomies - checkboxes Smart Search and Beyond
  • 33. Taxonomies - links Smart Search and Beyond
  • 35. Smart Search Plug-ins /plugins /content /finder /system /finder /categories /highlight /contacts /content /newsfeeds /weblinks Smart Search and Beyond
  • 36. Smart Search Plug-ins content/finder finder/[type] onContentBeforeSave onFinderBeforeSave onContentAfterSave onFinderAfterSave onContentAfterDelete onFinderAfterDelete onContentChangeState onFinderChangeState onCategoryChangeState onFinderCategoryChangeState Smart Search and Beyond
  • 37. Query parsing URI argument Query string Terms q=Some+text Some text Phrases q=”Some+text” “Some text” Logical operators q=This+and+that This and that Before a date d1=2012-05-16 before:2012-05-16 After a date d2=2012-05-18 after:2012-05-18 Content type filter t[]=98233 type:Articles Taxonomy filter t[]=30922 author:Chris Davenport Static filter f=2 Highlight qh=Some+text Smart Search and Beyond
  • 38. Results rendering ‣ com_finder • search Search results ‣ default.php page ‣ form.php ‣ default_results.php ‣ default_result.php For custom types ‣ default_[type].php ‣ mod_finder ‣ default.php Search module Smart Search and Beyond
  • 39. Layout overrides example Smart Search and Beyond
  • 41. Table of Contents 01 Smart Search so far 02 Smart Search in action 03 Smart Search under the hood 04 Smart Search tips and tricks 05 Smart Search where next? Smart Search and Beyond
  • 42. Tips and tricks Smart Search and Beyond
  • 43. Tips and tricks ‣ HTML Parser • Invalid HTML can confuse the parser • Invalid UTF8 is ignored • Text in attributes is ignored Smart Search and Beyond
  • 44. When to do a purge ‣ Indexing is incremental so most of the time you don't need to. ‣ Changes to taxonomies that do not involve changes to content items ‣ Changes to term weights ‣ Changing the stemmer ‣ Changes to content items that do not trigger the standard content events ‣ IMPORTANT • If you have static filters they will be lost when you do a purge. Smart Search and Beyond
  • 45. Tuning Smart Search ‣ Use the CLI for indexing • http://docs.joomla.org/Setting_up_automatic_Smart_ Search_indexing ‣ Out of memory issues • Please report out of memory issues so we can understand them better. • Reduce batch size ‣ Default is 50. Drop it to 5 or even 1. • Terms per batch ‣ Can be increased BUT NEEDS APACHE SERVER CONFIG CHANGE Smart Search and Beyond
  • 46. Table of Contents 01 Smart Search so far 02 Smart Search in action 03 Smart Search under the hood 04 Smart Search tips and tricks 05 Smart Search where next? Smart Search and Beyond
  • 48. Search Working Group ‣ Meeting at J and Beyond • 19 May 2012 11:30 AM ‣ Stable ready for merge July 2012 ‣ Joomla 3.0 release September 2012 ‣ Meeting at Joomla World Conference • San Jose, California, November 2012 Smart Search and Beyond
  • 49. Improved language support ‣ Improve common word support ‣ Improve stemmer support • Native PHP stemmers? ‣ Improve morphological coding • Non-English alternatives to Soundex ‣ Mixed language content items • Language tagging of tokens/terms? Smart Search and Beyond
  • 50. Other possibilities ‣ Preserve static filters on purge/index ‣ Decouple indexing via message queues ‣ Easier support for range queries ‣ Search logging via JLog ‣ Variable-length token aggregation ‣ Multi-level taxonomies ‣ Add parsers for PDF, MS Word Smart Search and Beyond
  • 51. Search API ‣ Very important going forward ‣ Too big a leap for Joomla 3.0 ‣ Develop in parallel during 3.x cycle ‣ Use in Smart Search for Joomla 4.0 Smart Search and Beyond
  • 54. Don't forget Search Working Group Meeting Saturday 19 May 2012 11:30 AM Smart Search and Beyond
  • 55. Haystack - Mark Duncan CC-BY-SA 2.0 Generic http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg Under the hood - ilovebutter CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg Child sucking thumb - Thahira CC-BY-SA 3.0 Unported http://commons.wikimedia.org/wiki/File:Sucking_finger.jpg Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domain http://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public domain http://commons.wikimedia.org/wiki/File:Index_Pages.jpg Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domain http://commons.wikimedia.org/wiki/File:20_questions_1954.JPG Linnaeus taxonomy - Public domain http://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png All other images are Copyright (C) 2012 Chris Davenport unless I've accidentally missed crediting them. Image Credits