SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
search
Needle in an enterprise haystack   engine integrations

                                                         1
Who am I?


            Andrew Mleczko
            Plone Integrator
            Redturtle Technology (Ferrara/Italy)
            andrew.mleczko@redturtle.net




                                                   2
so why do you need   an external search engine?


                                                  3
why do you need an
external search engine...

• Plone's portal_catalog is slow
  with big sites (large number of
  indexed objects)

• You want to reduce Plone
  memory consumption (by
  removing heavy indexes like
  SearchableText)

• You want to query Plone's
  content from external
  applications

• You want to use advanced
  search features



                                    4
there are several
                     that you can use
         solutions
                                        5
Plone external indexing
and searching

• Out-of-the-box:

  • collective.gsa
    (Google Search Appliance)

  • collective.solr
    (Apache Solr)

• Custom integrations:

  • Solr

  • Tsearch2




                      http://www.flickr.com/photos/jenny-pics/3527749814

                                                                          6
http://www.flickr.com/photos/st3f4n/2767217547




                                                Solr?

                                                        7
http://www.flickr.com/photos/st3f4n/2767217547




                                                  a search engine
                                                based on Lucene
                                                                    8
http://www.flickr.com/photos/st3f4n/2767217547




                                                Lucene?

                                                          9
http://www.flickr.com/photos/st3f4n/2767217547




                                  Full-text search library   100% in java


                                                                            10
http://www.flickr.com/photos/st3f4n/2767217547




                                                       XML/HTTP, JSON interface,
                                                Solr   Open Source

                                                                                   11
http://www.flickr.com/photos/st3f4n/2767217547




                                                                  python API
                                                collective.solr   and Plone integration

                                                                                          12
13




                  solr   collective.solr
Document format
Document format

  <add><doc>

  !   <field name=”id”>123</field>




                                                     solr
  !   <field name=”title”>The Trap</field>

  !   <field name=”author”>Agatha Christie</field>

  !   <field name=”genre”>thriller</field>

  </doc></add>




                                                     collective.solr
                                                                       13
Document format

  <add><doc>

  !   <field name=”id”>123</field>




                                                     solr
  !   <field name=”title”>The Trap</field>

  !   <field name=”author”>Agatha Christie</field>

  !   <field name=”genre”>thriller</field>

  </doc></add>




                                                     collective.solr
  >>> conn = SolrConnection(host='127.0.0.1', ...)
  >>> book = {'title': 'The Trap',
  ...!
     !    !        'author': 'Agatha Christie',
  ...!
     !         !   'genre' : 'thriller'}
  >>> conn.add(**book)


                                                                       13
Response format




                  14
Response format

  <response><result numFound=”2” start=”0”>

    <doc><str name=”title”>Coma</str>




                                                          solr
         <str name=”author”>Robin Cook</str></doc>

    <doc><str name=”title”>The Trap</str>

    !    <str name=”author”>Agatha Christie</str></doc>

  </result></response>




                                                                 14
Response format

  >>> query = {'genre': 'thriller'}
  >>> response = conn.search(q=query)
  >>> results = SolrResponse(response).response




                                                  collective.solr
  >>> results.numFound
  2
  >>> results[0].title
  'Coma'
  >>> results[0].author
  'Robin Cook'




                                                                    14
Who use solr/lucene?




                       15
Who use Solr/Lucene?
   Who use solr/lucene?
Who use Solr/Lucene?




                          15
"Biblioteca Virtuale
Italiana di Testi in
Formato Alternativo"
                       16
Architecture

CSV
                                       search
 sources

  Books        retriever



 Z39.50        retriever   populator    solr


web site                   populator     ...
               retriever

                                                17
Retrievers

 • they are normalizing sources to unique format

 • source can be anything from CSV to public site




                                                    18
Public sites

• makes a query


• grabs HTML results


• using configurable xpath parser
  transform HTML results into
  python format




                                   19
Normalize it!
every Book needs to have minimal
metadata:

• Title         • Format


• Description   • ISBN


• Authors       • ISSN


• Publisher     • Data




                                   20
Populators
Today:

• only one solr populator
In the future:
•   populate other sites,
•   populate RDBMS
•   ...



                            21
Conclusions

• multiple retrivers –
 multiple populators

• we have used only collective.solr
 SolrConnection API

• 120.000 books indexed so far in
 solr - querying and indexing is
 extremly fast



                                      22
http://www.flickr.com/photos/st3f4n/2767217547




                                                tsearch2 ?

                                                             23
http://www.flickr.com/photos/st3f4n/2767217547




                                                             search engine fully integrated
                                                tsearch2 ?   in PostgreSQL 8.3.x


                                                                                              24
tsearch2 main features

• Flexible and rich linguistic support
  (dictionaries, stop words), thesaurus

• Full UTF-8 support

• Sophisticated ranking functions
  with support of proximity and
  structure information (rank, rank_cd)

• Rich query language with query
  rewriting support

• Headline support (text fragments
  with highlighted search terms)

• It is mature (5 years of development)



                                          25
first steps with tsearch2

1. PostgreSQL >= 8.4
  (but 8.3 will work as well)

2. COLUMN
  ALTER TABLE content ADD
  COLUMN search_vector tsvector;

3. INDEX
  CREATE INDEX search_index ON
  content USING gin(search_vector);




                                      26
first steps with tsearch2

   4. TRIGGER
 CREATE FUNCTION fullsearch_trigger() RETURNS trigger AS $$
    begin
     new.search_vector :=
      setweight(to_tsvector('pg_catalog.english',
          coalesce(new.subject,'')), 'A') ||
      setweight(to_tsvector('pg_catalog.english',
          coalesce(new.title,'')), 'B') ||
      setweight(to_tsvector('pg_catalog.english',
          coalesce(new.description,'')), 'C');
     return new;
    end
  $$ LANGUAGE plpgsql;


CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON content FOR EACH ROW EXECUTE PROCEDURE
fullsearch_trigger();




                                                              27
http://www.flickr.com/photos/st3f4n/2767217547




                                                           how to serialize Plone content
                                                tsearch2   to SQL?


                                                                                            28
http://www.flickr.com/photos/st3f4n/2767217547



                                                                    „it focuses and supports out of
                                                ore.contentmirror   the box, content deployment to
                                                                    a relational database”

                                                                                                      29
http://www.flickr.com/photos/st3f4n/2767217547




                                       how to add tsearch2   to ore.contentmirror ddl?

                                                                                         30
How to add tsearch2
to ore.contentmirror ddl?


   >>> from ore.contentmirror.schema import content


   >>> def setup_search(event, schema_item, bind):
   ...!
      !       bind.execute("alter table content add
   ...!
      !   !    !   !   !   column search_vector tsvector")

   >>> content.append_ddl_listener('after-create',
   ...                              setup_search)




                                                             31
Geco - community
portal for Italian youth


                           32
Geco

• Started in 2009 for
  Emilia-Romagna
• Multiple content types,
  including video, polls, articles
  and more




                                     33
Geco
• 95 editors (Emilia-Romagna)
• 100.000 documents (Emilia-
  Romagna)
• This year: 2 other regions joins
• Future: all 20 regions joins the
  project
• Every region has it's own server
  deployment




                                     34
Objectives
✓   fast and efficient search engine
    that can integrate multiple
    different Plone sites
✓   search results should be ordered
    by rank
✓   content should be serialized in
    SQL so it can be reused by other
    applications (ratings, comments)




                                       35
rt.tsearch2

• integrates tsearch2 in PostgreSQL

• extend sqlalchemy query with rank
  sorting




                                      36
rt.tsearch2

• integrates tsearch2 in PostgreSQL

• extend sqlalchemy query with rank
  sorting




>>> rank = '{0,0.05,0.05,0.9}'
>>> term = 'Ferrara'
>>> query = query.order_by(desc("ts_rank('%s',
                                 Content.search_vector,!
                                 to_tsquery('%s'))" % (rank, term)))


                                                                       36
http://www.flickr.com/photos/vramak/3499502280




Conclusions
                                                         37
Conclusions
✓ Integrating external
  search engine in Plone is
  easy!
✓ You can find a solution
  that suites your needs!




                              38
Questions
Andrew Mleczko
RedTurtle Technology
andrew.mleczko@redturtle.net




                               39
Thank you.

             40

Contenu connexe

Tendances

Python client api
Python client apiPython client api
Python client apidreampuf
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会Toshihiro Suzuki
 
анатолий шарифулин Mojolicious
анатолий шарифулин Mojoliciousанатолий шарифулин Mojolicious
анатолий шарифулин Mojoliciousrit2010
 
Mojolicious. Веб в коробке!
Mojolicious. Веб в коробке!Mojolicious. Веб в коробке!
Mojolicious. Веб в коробке!Anatoly Sharifulin
 
NoSQL & MongoDB
NoSQL & MongoDBNoSQL & MongoDB
NoSQL & MongoDBShuai Liu
 
анатолий шарифулин Mojolicious
анатолий шарифулин Mojoliciousанатолий шарифулин Mojolicious
анатолий шарифулин Mojoliciousrit2010
 
Webinar: Simplifying Persistence for Java and MongoDB
Webinar: Simplifying Persistence for Java and MongoDBWebinar: Simplifying Persistence for Java and MongoDB
Webinar: Simplifying Persistence for Java and MongoDBMongoDB
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPJeremy Kendall
 
Php 102: Out with the Bad, In with the Good
Php 102: Out with the Bad, In with the GoodPhp 102: Out with the Bad, In with the Good
Php 102: Out with the Bad, In with the GoodJeremy Kendall
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPJeremy Kendall
 
The amazing world behind your ORM
The amazing world behind your ORMThe amazing world behind your ORM
The amazing world behind your ORMLouise Grandjonc
 
sf bay area dfir meetup (2016-04-30) - OsxCollector
sf bay area dfir meetup (2016-04-30) - OsxCollector   sf bay area dfir meetup (2016-04-30) - OsxCollector
sf bay area dfir meetup (2016-04-30) - OsxCollector Rishi Bhargava
 
Black Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data RetrievalBlack Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data Retrievalqqlan
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchLuiz Messias
 
Euroscipy SemNews 2011
Euroscipy SemNews 2011Euroscipy SemNews 2011
Euroscipy SemNews 2011Logilab
 

Tendances (18)

Python client api
Python client apiPython client api
Python client api
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会
 
анатолий шарифулин Mojolicious
анатолий шарифулин Mojoliciousанатолий шарифулин Mojolicious
анатолий шарифулин Mojolicious
 
Mojolicious. Веб в коробке!
Mojolicious. Веб в коробке!Mojolicious. Веб в коробке!
Mojolicious. Веб в коробке!
 
NoSQL & MongoDB
NoSQL & MongoDBNoSQL & MongoDB
NoSQL & MongoDB
 
анатолий шарифулин Mojolicious
анатолий шарифулин Mojoliciousанатолий шарифулин Mojolicious
анатолий шарифулин Mojolicious
 
Webinar: Simplifying Persistence for Java and MongoDB
Webinar: Simplifying Persistence for Java and MongoDBWebinar: Simplifying Persistence for Java and MongoDB
Webinar: Simplifying Persistence for Java and MongoDB
 
Wsomdp
WsomdpWsomdp
Wsomdp
 
3. javascript bangla tutorials
3. javascript bangla tutorials3. javascript bangla tutorials
3. javascript bangla tutorials
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHP
 
Php 102: Out with the Bad, In with the Good
Php 102: Out with the Bad, In with the GoodPhp 102: Out with the Bad, In with the Good
Php 102: Out with the Bad, In with the Good
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHP
 
The amazing world behind your ORM
The amazing world behind your ORMThe amazing world behind your ORM
The amazing world behind your ORM
 
sf bay area dfir meetup (2016-04-30) - OsxCollector
sf bay area dfir meetup (2016-04-30) - OsxCollector   sf bay area dfir meetup (2016-04-30) - OsxCollector
sf bay area dfir meetup (2016-04-30) - OsxCollector
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Black Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data RetrievalBlack Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data Retrieval
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Euroscipy SemNews 2011
Euroscipy SemNews 2011Euroscipy SemNews 2011
Euroscipy SemNews 2011
 

En vedette

Project management software of your dreams
Project management software of your dreamsProject management software of your dreams
Project management software of your dreamsAndrew Mleczko
 
Collective.amberjack ploneconf2010
Collective.amberjack ploneconf2010Collective.amberjack ploneconf2010
Collective.amberjack ploneconf2010Massimo Azzolini
 
Resoconto dalla Plone Conference 2010
Resoconto dalla Plone Conference 2010Resoconto dalla Plone Conference 2010
Resoconto dalla Plone Conference 2010Stefano Marchetti
 
Fast content import in Plone
Fast content import in PloneFast content import in Plone
Fast content import in PloneAndrew Mleczko
 
Bringing "real life" relations to Plone
Bringing "real life" relations to PloneBringing "real life" relations to Plone
Bringing "real life" relations to PloneMassimo Azzolini
 
Ferrara Eventi - la nostra applicazione iPhone per vivere al meglio Ferrara
Ferrara Eventi - la nostra applicazione iPhone per vivere al meglio FerraraFerrara Eventi - la nostra applicazione iPhone per vivere al meglio Ferrara
Ferrara Eventi - la nostra applicazione iPhone per vivere al meglio FerraraRedTurtle S.r.l.
 
ItalianSkin: an improvement in the accessibility of the Plone interface in or...
ItalianSkin: an improvement in the accessibility of the Plone interface in or...ItalianSkin: an improvement in the accessibility of the Plone interface in or...
ItalianSkin: an improvement in the accessibility of the Plone interface in or...Vincenzo Barone
 
Plone TuneUp challenges
Plone TuneUp challengesPlone TuneUp challenges
Plone TuneUp challengesAndrew Mleczko
 
Collective Amberjack - European Plone Symposium
Collective Amberjack - European Plone SymposiumCollective Amberjack - European Plone Symposium
Collective Amberjack - European Plone SymposiumMassimo Azzolini
 
Breve resoconto dalla World Plone Conference 2009 26 Ottobre - 1 Novembre
Breve resoconto dalla World Plone Conference 2009 26 Ottobre - 1 NovembreBreve resoconto dalla World Plone Conference 2009 26 Ottobre - 1 Novembre
Breve resoconto dalla World Plone Conference 2009 26 Ottobre - 1 NovembreStefano Marchetti
 
Strategie e comunicazione per il turismo sul web
Strategie e comunicazione per il turismo sul webStrategie e comunicazione per il turismo sul web
Strategie e comunicazione per il turismo sul webMassimo Azzolini
 
3M per Plone Mockup, Mediacore, Mailchimp
3M per Plone Mockup, Mediacore, Mailchimp3M per Plone Mockup, Mediacore, Mailchimp
3M per Plone Mockup, Mediacore, MailchimpStefano Marchetti
 
Future is bright, future is Plone
Future is bright, future is PloneFuture is bright, future is Plone
Future is bright, future is PloneAndrew Mleczko
 

En vedette (20)

Project management software of your dreams
Project management software of your dreamsProject management software of your dreams
Project management software of your dreams
 
Il futuro di Plone
Il futuro di PloneIl futuro di Plone
Il futuro di Plone
 
Collective.amberjack ploneconf2010
Collective.amberjack ploneconf2010Collective.amberjack ploneconf2010
Collective.amberjack ploneconf2010
 
Plone per tutte le stagioni
Plone per tutte le stagioniPlone per tutte le stagioni
Plone per tutte le stagioni
 
Resoconto dalla Plone Conference 2010
Resoconto dalla Plone Conference 2010Resoconto dalla Plone Conference 2010
Resoconto dalla Plone Conference 2010
 
Migrazione Plone4
Migrazione Plone4Migrazione Plone4
Migrazione Plone4
 
BibliotecaAccessibile
BibliotecaAccessibileBibliotecaAccessibile
BibliotecaAccessibile
 
Fast content import in Plone
Fast content import in PloneFast content import in Plone
Fast content import in Plone
 
Bringing "real life" relations to Plone
Bringing "real life" relations to PloneBringing "real life" relations to Plone
Bringing "real life" relations to Plone
 
Ferrara Eventi - la nostra applicazione iPhone per vivere al meglio Ferrara
Ferrara Eventi - la nostra applicazione iPhone per vivere al meglio FerraraFerrara Eventi - la nostra applicazione iPhone per vivere al meglio Ferrara
Ferrara Eventi - la nostra applicazione iPhone per vivere al meglio Ferrara
 
ItalianSkin: an improvement in the accessibility of the Plone interface in or...
ItalianSkin: an improvement in the accessibility of the Plone interface in or...ItalianSkin: an improvement in the accessibility of the Plone interface in or...
ItalianSkin: an improvement in the accessibility of the Plone interface in or...
 
Plone e Web 2.0
Plone e Web 2.0Plone e Web 2.0
Plone e Web 2.0
 
Plone Konferenz 2012
Plone Konferenz 2012Plone Konferenz 2012
Plone Konferenz 2012
 
Plone TuneUp challenges
Plone TuneUp challengesPlone TuneUp challenges
Plone TuneUp challenges
 
Collective Amberjack - European Plone Symposium
Collective Amberjack - European Plone SymposiumCollective Amberjack - European Plone Symposium
Collective Amberjack - European Plone Symposium
 
Breve resoconto dalla World Plone Conference 2009 26 Ottobre - 1 Novembre
Breve resoconto dalla World Plone Conference 2009 26 Ottobre - 1 NovembreBreve resoconto dalla World Plone Conference 2009 26 Ottobre - 1 Novembre
Breve resoconto dalla World Plone Conference 2009 26 Ottobre - 1 Novembre
 
Strategie e comunicazione per il turismo sul web
Strategie e comunicazione per il turismo sul webStrategie e comunicazione per il turismo sul web
Strategie e comunicazione per il turismo sul web
 
3M per Plone Mockup, Mediacore, Mailchimp
3M per Plone Mockup, Mediacore, Mailchimp3M per Plone Mockup, Mediacore, Mailchimp
3M per Plone Mockup, Mediacore, Mailchimp
 
Social intranet
Social intranetSocial intranet
Social intranet
 
Future is bright, future is Plone
Future is bright, future is PloneFuture is bright, future is Plone
Future is bright, future is Plone
 

Similaire à Needle in an enterprise haystack

Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher lucenerevolution
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 
Small wins in a small time with Apache Solr
Small wins in a small time with Apache SolrSmall wins in a small time with Apache Solr
Small wins in a small time with Apache SolrSourcesense
 
REST teori og praksis; REST in theory and practice
REST teori og praksis; REST in theory and practiceREST teori og praksis; REST in theory and practice
REST teori og praksis; REST in theory and practicehamnis
 
Apache Solr for TYPO3 what's new 2018
Apache Solr for TYPO3 what's new 2018Apache Solr for TYPO3 what's new 2018
Apache Solr for TYPO3 what's new 2018timohund
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Real-time Semantic Web with Twitter Annotations
Real-time Semantic Web with Twitter AnnotationsReal-time Semantic Web with Twitter Annotations
Real-time Semantic Web with Twitter AnnotationsJoshua Shinavier
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Ecossistema Ruby - versão SCTI UNF 2013
Ecossistema Ruby - versão SCTI UNF 2013Ecossistema Ruby - versão SCTI UNF 2013
Ecossistema Ruby - versão SCTI UNF 2013Fabio Akita
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
 
Solr introduction
Solr introductionSolr introduction
Solr introductionLap Tran
 
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLucidworks
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Rails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineDavid Keener
 
Backbonejs for beginners
Backbonejs for beginnersBackbonejs for beginners
Backbonejs for beginnersDivakar Gu
 

Similaire à Needle in an enterprise haystack (20)

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
Small wins in a small time with Apache Solr
Small wins in a small time with Apache SolrSmall wins in a small time with Apache Solr
Small wins in a small time with Apache Solr
 
REST teori og praksis; REST in theory and practice
REST teori og praksis; REST in theory and practiceREST teori og praksis; REST in theory and practice
REST teori og praksis; REST in theory and practice
 
Apache Solr for TYPO3 what's new 2018
Apache Solr for TYPO3 what's new 2018Apache Solr for TYPO3 what's new 2018
Apache Solr for TYPO3 what's new 2018
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Real-time Semantic Web with Twitter Annotations
Real-time Semantic Web with Twitter AnnotationsReal-time Semantic Web with Twitter Annotations
Real-time Semantic Web with Twitter Annotations
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Ecossistema Ruby - versão SCTI UNF 2013
Ecossistema Ruby - versão SCTI UNF 2013Ecossistema Ruby - versão SCTI UNF 2013
Ecossistema Ruby - versão SCTI UNF 2013
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 
iSoligorsk #3 2013
iSoligorsk #3 2013iSoligorsk #3 2013
iSoligorsk #3 2013
 
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
 
Rails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search Engine
 
Backbonejs for beginners
Backbonejs for beginnersBackbonejs for beginners
Backbonejs for beginners
 

Plus de Andrew Mleczko

Lost in o auth? learn velruse and get your life back
Lost in o auth? learn velruse and get your life backLost in o auth? learn velruse and get your life back
Lost in o auth? learn velruse and get your life backAndrew Mleczko
 
Celery and the social networks
Celery and the social networksCelery and the social networks
Celery and the social networksAndrew Mleczko
 
PloneConf2012 - Are you in a hole and still digging? Or how to become an agil...
PloneConf2012 - Are you in a hole and still digging? Or how to become an agil...PloneConf2012 - Are you in a hole and still digging? Or how to become an agil...
PloneConf2012 - Are you in a hole and still digging? Or how to become an agil...Andrew Mleczko
 
Bootstrap your app in 45 seconds
Bootstrap your app in 45 secondsBootstrap your app in 45 seconds
Bootstrap your app in 45 secondsAndrew Mleczko
 
PyconUA - How to build ERP application having fun?
PyconUA - How to build ERP application having fun?PyconUA - How to build ERP application having fun?
PyconUA - How to build ERP application having fun?Andrew Mleczko
 
EuroPython 2011 - How to build complex web applications having fun?
EuroPython 2011 - How to build complex web applications having fun?EuroPython 2011 - How to build complex web applications having fun?
EuroPython 2011 - How to build complex web applications having fun?Andrew Mleczko
 

Plus de Andrew Mleczko (6)

Lost in o auth? learn velruse and get your life back
Lost in o auth? learn velruse and get your life backLost in o auth? learn velruse and get your life back
Lost in o auth? learn velruse and get your life back
 
Celery and the social networks
Celery and the social networksCelery and the social networks
Celery and the social networks
 
PloneConf2012 - Are you in a hole and still digging? Or how to become an agil...
PloneConf2012 - Are you in a hole and still digging? Or how to become an agil...PloneConf2012 - Are you in a hole and still digging? Or how to become an agil...
PloneConf2012 - Are you in a hole and still digging? Or how to become an agil...
 
Bootstrap your app in 45 seconds
Bootstrap your app in 45 secondsBootstrap your app in 45 seconds
Bootstrap your app in 45 seconds
 
PyconUA - How to build ERP application having fun?
PyconUA - How to build ERP application having fun?PyconUA - How to build ERP application having fun?
PyconUA - How to build ERP application having fun?
 
EuroPython 2011 - How to build complex web applications having fun?
EuroPython 2011 - How to build complex web applications having fun?EuroPython 2011 - How to build complex web applications having fun?
EuroPython 2011 - How to build complex web applications having fun?
 

Dernier

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Needle in an enterprise haystack

  • 1. search Needle in an enterprise haystack engine integrations 1
  • 2. Who am I? Andrew Mleczko Plone Integrator Redturtle Technology (Ferrara/Italy) andrew.mleczko@redturtle.net 2
  • 3. so why do you need an external search engine? 3
  • 4. why do you need an external search engine... • Plone's portal_catalog is slow with big sites (large number of indexed objects) • You want to reduce Plone memory consumption (by removing heavy indexes like SearchableText) • You want to query Plone's content from external applications • You want to use advanced search features 4
  • 5. there are several that you can use solutions 5
  • 6. Plone external indexing and searching • Out-of-the-box: • collective.gsa (Google Search Appliance) • collective.solr (Apache Solr) • Custom integrations: • Solr • Tsearch2 http://www.flickr.com/photos/jenny-pics/3527749814 6
  • 8. http://www.flickr.com/photos/st3f4n/2767217547 a search engine based on Lucene 8
  • 10. http://www.flickr.com/photos/st3f4n/2767217547 Full-text search library 100% in java 10
  • 11. http://www.flickr.com/photos/st3f4n/2767217547 XML/HTTP, JSON interface, Solr Open Source 11
  • 12. http://www.flickr.com/photos/st3f4n/2767217547 python API collective.solr and Plone integration 12
  • 13. 13 solr collective.solr Document format
  • 14. Document format <add><doc> ! <field name=”id”>123</field> solr ! <field name=”title”>The Trap</field> ! <field name=”author”>Agatha Christie</field> ! <field name=”genre”>thriller</field> </doc></add> collective.solr 13
  • 15. Document format <add><doc> ! <field name=”id”>123</field> solr ! <field name=”title”>The Trap</field> ! <field name=”author”>Agatha Christie</field> ! <field name=”genre”>thriller</field> </doc></add> collective.solr >>> conn = SolrConnection(host='127.0.0.1', ...) >>> book = {'title': 'The Trap', ...! ! ! 'author': 'Agatha Christie', ...! ! ! 'genre' : 'thriller'} >>> conn.add(**book) 13
  • 17. Response format <response><result numFound=”2” start=”0”> <doc><str name=”title”>Coma</str> solr <str name=”author”>Robin Cook</str></doc> <doc><str name=”title”>The Trap</str> ! <str name=”author”>Agatha Christie</str></doc> </result></response> 14
  • 18. Response format >>> query = {'genre': 'thriller'} >>> response = conn.search(q=query) >>> results = SolrResponse(response).response collective.solr >>> results.numFound 2 >>> results[0].title 'Coma' >>> results[0].author 'Robin Cook' 14
  • 20. Who use Solr/Lucene? Who use solr/lucene? Who use Solr/Lucene? 15
  • 21. "Biblioteca Virtuale Italiana di Testi in Formato Alternativo" 16
  • 22. Architecture CSV search sources Books retriever Z39.50 retriever populator solr web site populator ... retriever 17
  • 23. Retrievers • they are normalizing sources to unique format • source can be anything from CSV to public site 18
  • 24. Public sites • makes a query • grabs HTML results • using configurable xpath parser transform HTML results into python format 19
  • 25. Normalize it! every Book needs to have minimal metadata: • Title • Format • Description • ISBN • Authors • ISSN • Publisher • Data 20
  • 26. Populators Today: • only one solr populator In the future: • populate other sites, • populate RDBMS • ... 21
  • 27. Conclusions • multiple retrivers – multiple populators • we have used only collective.solr SolrConnection API • 120.000 books indexed so far in solr - querying and indexing is extremly fast 22
  • 29. http://www.flickr.com/photos/st3f4n/2767217547 search engine fully integrated tsearch2 ? in PostgreSQL 8.3.x 24
  • 30. tsearch2 main features • Flexible and rich linguistic support (dictionaries, stop words), thesaurus • Full UTF-8 support • Sophisticated ranking functions with support of proximity and structure information (rank, rank_cd) • Rich query language with query rewriting support • Headline support (text fragments with highlighted search terms) • It is mature (5 years of development) 25
  • 31. first steps with tsearch2 1. PostgreSQL >= 8.4 (but 8.3 will work as well) 2. COLUMN ALTER TABLE content ADD COLUMN search_vector tsvector; 3. INDEX CREATE INDEX search_index ON content USING gin(search_vector); 26
  • 32. first steps with tsearch2 4. TRIGGER CREATE FUNCTION fullsearch_trigger() RETURNS trigger AS $$ begin new.search_vector := setweight(to_tsvector('pg_catalog.english', coalesce(new.subject,'')), 'A') || setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'B') || setweight(to_tsvector('pg_catalog.english', coalesce(new.description,'')), 'C'); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON content FOR EACH ROW EXECUTE PROCEDURE fullsearch_trigger(); 27
  • 33. http://www.flickr.com/photos/st3f4n/2767217547 how to serialize Plone content tsearch2 to SQL? 28
  • 34. http://www.flickr.com/photos/st3f4n/2767217547 „it focuses and supports out of ore.contentmirror the box, content deployment to a relational database” 29
  • 35. http://www.flickr.com/photos/st3f4n/2767217547 how to add tsearch2 to ore.contentmirror ddl? 30
  • 36. How to add tsearch2 to ore.contentmirror ddl? >>> from ore.contentmirror.schema import content >>> def setup_search(event, schema_item, bind): ...! ! bind.execute("alter table content add ...! ! ! ! ! ! column search_vector tsvector") >>> content.append_ddl_listener('after-create', ... setup_search) 31
  • 37. Geco - community portal for Italian youth 32
  • 38. Geco • Started in 2009 for Emilia-Romagna • Multiple content types, including video, polls, articles and more 33
  • 39. Geco • 95 editors (Emilia-Romagna) • 100.000 documents (Emilia- Romagna) • This year: 2 other regions joins • Future: all 20 regions joins the project • Every region has it's own server deployment 34
  • 40. Objectives ✓ fast and efficient search engine that can integrate multiple different Plone sites ✓ search results should be ordered by rank ✓ content should be serialized in SQL so it can be reused by other applications (ratings, comments) 35
  • 41. rt.tsearch2 • integrates tsearch2 in PostgreSQL • extend sqlalchemy query with rank sorting 36
  • 42. rt.tsearch2 • integrates tsearch2 in PostgreSQL • extend sqlalchemy query with rank sorting >>> rank = '{0,0.05,0.05,0.9}' >>> term = 'Ferrara' >>> query = query.order_by(desc("ts_rank('%s', Content.search_vector,! to_tsquery('%s'))" % (rank, term))) 36
  • 44. Conclusions ✓ Integrating external search engine in Plone is easy! ✓ You can find a solution that suites your needs! 38