SlideShare a Scribd company logo
1 of 39
Download to read offline
Building search app
                   for public mailing lists in



            15
                                minutes
                                  with ElasticSearch


                by Lukáš Vlček,  #BBUZZ ­ 2011 
Agenda

1. Who & Why
2. Searching mailing lists
3. How to do it
4. Implementation details
5. Challenges
6. Give aways (free stuff inside!)
Who am I?

   Lukáš Vlček
Senior Software Engineer
 JBoss Community team
Red Hat, Czech Republic



        @lukasvlcek
Why I am here?
My mission is to improve search
     for JBoss.org content.
Why I am here?
            My mission is to improve search
                 for JBoss.org content.




Mailing lists   Blogs   Documentation   Forums   Project pages   IRC logs   ...
Why I am here?
          My mission is to improve search
      for JBoss.org content (and make it really rock!).




                                 Search...




Mailing lists   Blogs   Documentation    Forums   Project pages   IRC logs   ...
And we are starting with...




             mailing lists
     of JBoss community projects (†)




     (†) https://lists.jboss.org/mailman/listinfo/
It is show time!
             
  Searching mailing lists
Starring:

    Front end side:
        jQuery
Sammy.js & {{ mustache }}
      Highcharts
        Isotope

     Back end side:
     Elastic Search
Note: the web UI can be different once we go into production.
Facets




         Note: the web UI can be different once we go into production.
Facets




Note: the web UI can be different once we go into production.
Search query in detail

         {
           from : 0,
           query : { ... },
           fields : [ ... ],
           highlight : { ... },
           facets : { ... }
         }

 Full example: https://gist.github.com/1012051
query : {...}



query_string : { query : "jboss server" }




    Full example: https://gist.github.com/1012051
query : {...}
filtered : {
    query : { query_string : { query : "jboss server" } },
    filter: {
      and : [
        { range : { date : { from :"2007­07­25", to :"2010­12­16"}}},
        { terms : { _index : ["weld"]}},
        { terms : { mail_list : ["dev"]}},
        { terms : { from.not_analyzed : [
            "Galder Zamarreno <galder.zamarreno@redhat.com>",
            "Pete Muir <pmuir@redhat.com>"]}}
      ]}
    }
               Full example: https://gist.github.com/1012051
fields : [...]


    "date", "document_url", "from", "mail_list", 
   "message_id", "message_snippet", "subject", 
"subject_original", "to", "in_reply_to", "references"




          Full example: https://gist.github.com/1012051
highlight : {...}

    pre_tags : ["<span class='hlt'>"],
    post_tags : ["</span>"],
    fields : {
      first_text_message : { number_of_fragments : 3 },
      first_html_message :{ number_of_fragments : 3 },
      message_attachments :{ number_of_fragments : 3 },
      subject : { number_of_fragments : 0 }
    }


             Full example: https://gist.github.com/1012051
facets : {...}


         histogram : {...},
         projects : {...},
         listType : {...},
         author : {...},
         author_filter : {...}




Full example: https://gist.github.com/1012051
facets.histogram : {...}



date_histogram : { field : "date", interval : "month" }




       Full example: https://gist.github.com/1012051
facets.author : {...}
terms : { field : "from.not_analyzed", size : 15 },
facet_filter : {
  and : [
    { query :{ query_string : { query : "jboss server" }}},
    { range :{ date :{ from :"2007­07­25", to :"2010­12­16"}}},
    { terms :{ _index :["weld"]}},
    { terms :{ mail_list :["dev"]}}
  ]
},
global : true

             Full example: https://gist.github.com/1012051
facets.author_filter : {...}
terms : { field : "from.not_analyzed", size : 15 },
facet_filter : {
  and : [
    { query :{ query_string : { query : "jboss server" }}},
    { terms :{ from.not_analyzed : [
        "Galder Zamarreno <galder.zamarreno@redhat.com>",
        "Pete Muir <pmuir@redhat.com>" ]}
    },
    { range :{ date :{ from :"2007­07­25", to :"2010­12­16"}}},
    { terms :{ _index :["weld"]}},
    { terms :{ mail_list :["dev"]}}
  ]
},
global : true
                Full example: https://gist.github.com/1012051
Document Detail Preview
Search query in detail

         {
           fields : [ ... ],
           query : { ... },
           filter : { ... },
           highlight : { ... }
         }


 Full example: https://gist.github.com/1012096
query : {...} & filter : {...}
  query : {
    bool : {
      should : [{
        query_string : { query : _userQuery_ }
      }],
      must : { term : { message_id : _documentId_ }}
    }
  },
  filter : { ids : { type : "mail", values : [ _documentId_ ] } }

             Full example: https://gist.github.com/1012096
highlight : {...}
  highlight : {
    pre_tags : [ "<span class='phlt'>" ],
    post_tags : [ "</span>" ],
    number_of_fragments : 0,
    fields : {
      subject_original : { },
      first_text_message : { },
      first_html_message : { },
      message_attachments : { number_of_fragments : 3 }
    }
  }
           Full example: https://gist.github.com/1012096
Document Detail Preview

                      Threads
{from : 0, size : _count_,
  fields : [ ... ],
    script_fields : { millis : { script : "doc['date'].date.millis" }
    },
  query : {
    constant_score : {
      filter : {
        or : {
          filters : [
            { terms : { references : _references_ } },
            { terms : { message_id_original : _references_ } }
          ]}}}}}


              Full example: https://gist.github.com/1012115
Challenges

Parsing and Indexing emails
Mail Subject normalization
    Mail conversations
      Quoted content
   Author identification
            ...
Challenges

1) Parsing and indexing emails

    Emails can have very complex hierarchical 
 structure with different content types in it. They 
 can be even malformed (typically SPAM emails).

Apache Mime4J and jsoup can do a very good job.



 ES mapping example: https://gist.github.com/1011913
Challenges

 2) Mail subject normalization

  Emails in public mailing lists contain specific 
   patterns in subject that should be handled 
carefully. Probably you do not want to index them 
        as a part of the mail subject at all.
Challenges

     2) Mail subject normalization

          Typically something like the following:

● [infinispan­dev] Feedback on Infinispan patch
● Re: [infinispan­dev] Feedback on Infinispan patch

● Re: [hibernate­dev] [infinispan­dev] Feedback on Infi...
Challenges

2) Mail subject normalization

      But it can be very funny too:

    [jbossws­users] [JBossWS] – Re:
      [jbossws­users] [JBossWS] ­ "
              [rules­dev] ;
Challenges

                   2) Mail subject normalization

                                   Or quite complex:

 [infinispan­dev] [Fwd: [Fwd: Sometimes TCP responses not getting through on localhost]]

[hibernate­dev] [Fwd: [redhat.com #1341720] [Fwd: Re: Unable to checkout core/trunk/core]]
Challenges

3) Identification of conversation threads

     Probably the two biggest challenges are Thread 
           hijacking and Non­linear threads.

   Probably the two most straightforward approaches 
   are using mail headers (<in­reply­to>,<references>) 
                   and mail subjects.
Challenges

     4) Quoted content

Should the quoted content be included?




   (I think it should be included...)
Challenges

   5) Author identification
    Email value is not always enough.

" 이희승 (Trustin Lee)" <trustin@gmail.com>
      Trustin Lee <tlee@redhat.com>
"Trustin Lee ( 이희승 )" <trustin@gmail.com>
 Trustin Lee ( 이희승 ) <trustin@gmail.com>
     Trustin Lee <trustin@gmail.com>
   Trustin Lee <trustin@gleamynode.net>
        이희승 <trustin@gmail.com>
Want to try yourself?


              You are welcome!
Some parts of the search application to be soon 
available on GitHub for anybody to index and 
              search mbox files.
And one more thing...
        Don't go home emptyhanded.



               BigDesk
 A tiny monitoring tool for Elastic Search.

https://github.com/lukas­vlcek/bigdesk
Thank you!



   lvlcek@redhat.com
 lukas.vlcek@gmail.com

More Related Content

What's hot

Sling models by Justin Edelson
Sling models by Justin Edelson Sling models by Justin Edelson
Sling models by Justin Edelson
AEM HUB
 
Android | Busy Java Developers Guide to Android: UI | Ted Neward
Android | Busy Java Developers Guide to Android: UI | Ted NewardAndroid | Busy Java Developers Guide to Android: UI | Ted Neward
Android | Busy Java Developers Guide to Android: UI | Ted Neward
JAX London
 

What's hot (20)

HTML5: the new frontier of the web
HTML5: the new frontier of the webHTML5: the new frontier of the web
HTML5: the new frontier of the web
 
Angular jS Introduction by Google
Angular jS Introduction by GoogleAngular jS Introduction by Google
Angular jS Introduction by Google
 
Sling models by Justin Edelson
Sling models by Justin Edelson Sling models by Justin Edelson
Sling models by Justin Edelson
 
Getting started with angular js
Getting started with angular jsGetting started with angular js
Getting started with angular js
 
The Time for Vanilla Web Components has Arrived
The Time for Vanilla Web Components has ArrivedThe Time for Vanilla Web Components has Arrived
The Time for Vanilla Web Components has Arrived
 
Modern development paradigms
Modern development paradigmsModern development paradigms
Modern development paradigms
 
AEM Sightly Template Language
AEM Sightly Template LanguageAEM Sightly Template Language
AEM Sightly Template Language
 
AEM - Client Libraries
AEM - Client LibrariesAEM - Client Libraries
AEM - Client Libraries
 
Android | Busy Java Developers Guide to Android: UI | Ted Neward
Android | Busy Java Developers Guide to Android: UI | Ted NewardAndroid | Busy Java Developers Guide to Android: UI | Ted Neward
Android | Busy Java Developers Guide to Android: UI | Ted Neward
 
Mastering the Sling Rewriter
Mastering the Sling RewriterMastering the Sling Rewriter
Mastering the Sling Rewriter
 
[2015/2016] Backbone JS
[2015/2016] Backbone JS[2015/2016] Backbone JS
[2015/2016] Backbone JS
 
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted NewardArchitecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
 
XML and Web Services with Groovy
XML and Web Services with GroovyXML and Web Services with Groovy
XML and Web Services with Groovy
 
Stencil the time for vanilla web components has arrived
Stencil the time for vanilla web components has arrivedStencil the time for vanilla web components has arrived
Stencil the time for vanilla web components has arrived
 
Aem best practices
Aem best practicesAem best practices
Aem best practices
 
OSGi and Spring Data for simple (Web) Application Development
OSGi and Spring Data  for simple (Web) Application DevelopmentOSGi and Spring Data  for simple (Web) Application Development
OSGi and Spring Data for simple (Web) Application Development
 
Html 5 in a big nutshell
Html 5 in a big nutshellHtml 5 in a big nutshell
Html 5 in a big nutshell
 
Learn about Eclipse e4 from Lars Vogel at SF-JUG
Learn about Eclipse e4 from Lars Vogel at SF-JUGLearn about Eclipse e4 from Lars Vogel at SF-JUG
Learn about Eclipse e4 from Lars Vogel at SF-JUG
 
Adobe Experience Manager Core Components
Adobe Experience Manager Core ComponentsAdobe Experience Manager Core Components
Adobe Experience Manager Core Components
 
Build Reusable Web Components using HTML5 Web cComponents
Build Reusable Web Components using HTML5 Web cComponentsBuild Reusable Web Components using HTML5 Web cComponents
Build Reusable Web Components using HTML5 Web cComponents
 

Viewers also liked

Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
OseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platformOseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platform
@CULT Srl
 
MoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQLMoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQL
Alex Tomic
 
JBug_React_and_Flux_2015
JBug_React_and_Flux_2015JBug_React_and_Flux_2015
JBug_React_and_Flux_2015
Lukas Vlcek
 
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Simone Onofri
 

Viewers also liked (20)

Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
OseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platformOseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platform
 
MoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQLMoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQL
 
Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"
 
JBug_React_and_Flux_2015
JBug_React_and_Flux_2015JBug_React_and_Flux_2015
JBug_React_and_Flux_2015
 
Social Miner: Webinar people marketing em 30 min
Social Miner: Webinar people marketing em 30 minSocial Miner: Webinar people marketing em 30 min
Social Miner: Webinar people marketing em 30 min
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Oxalide Academy : Workshop #3 Elastic Search
Oxalide Academy : Workshop #3 Elastic SearchOxalide Academy : Workshop #3 Elastic Search
Oxalide Academy : Workshop #3 Elastic Search
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Elastic search
Elastic searchElastic search
Elastic search
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search Walkthrough
 
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
 
Using Elastic Search Outside Full-Text Search
Using Elastic Search Outside Full-Text SearchUsing Elastic Search Outside Full-Text Search
Using Elastic Search Outside Full-Text Search
 
Elastic search adaptto2014
Elastic search adaptto2014Elastic search adaptto2014
Elastic search adaptto2014
 
03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 

Similar to Building search app with ElasticSearch

Webinar: Build an Application Series - Session 2 - Getting Started
Webinar: Build an Application Series - Session 2 - Getting StartedWebinar: Build an Application Series - Session 2 - Getting Started
Webinar: Build an Application Series - Session 2 - Getting Started
MongoDB
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
Yaqi Zhao
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
MongoDB
 
Jquery dojo slides
Jquery dojo slidesJquery dojo slides
Jquery dojo slides
helenmga
 
10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling
DATAVERSITY
 

Similar to Building search app with ElasticSearch (20)

Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responses
 
Beyond Logging: Using MongoDB to Power a Private Social Network (Oh, and log ...
Beyond Logging: Using MongoDB to Power a Private Social Network (Oh, and log ...Beyond Logging: Using MongoDB to Power a Private Social Network (Oh, and log ...
Beyond Logging: Using MongoDB to Power a Private Social Network (Oh, and log ...
 
The emerging world of mongo db csp
The emerging world of mongo db   cspThe emerging world of mongo db   csp
The emerging world of mongo db csp
 
Webinar: Build an Application Series - Session 2 - Getting Started
Webinar: Build an Application Series - Session 2 - Getting StartedWebinar: Build an Application Series - Session 2 - Getting Started
Webinar: Build an Application Series - Session 2 - Getting Started
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Sps mad2019 es el momento, empieza a desarrollar para microsoft teams
Sps mad2019   es el momento, empieza a desarrollar para microsoft teams Sps mad2019   es el momento, empieza a desarrollar para microsoft teams
Sps mad2019 es el momento, empieza a desarrollar para microsoft teams
 
Schema Design with MongoDB
Schema Design with MongoDBSchema Design with MongoDB
Schema Design with MongoDB
 
MongoDB, PHP and the cloud - php cloud summit 2011
MongoDB, PHP and the cloud - php cloud summit 2011MongoDB, PHP and the cloud - php cloud summit 2011
MongoDB, PHP and the cloud - php cloud summit 2011
 
Building Apps with MongoDB
Building Apps with MongoDBBuilding Apps with MongoDB
Building Apps with MongoDB
 
Building a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to SpaceBuilding a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to Space
 
Building your first app with mongo db
Building your first app with mongo dbBuilding your first app with mongo db
Building your first app with mongo db
 
"Scala in Goozy", Alexey Zlobin
"Scala in Goozy", Alexey Zlobin "Scala in Goozy", Alexey Zlobin
"Scala in Goozy", Alexey Zlobin
 
Python Code Camp for Professionals 4/4
Python Code Camp for Professionals 4/4Python Code Camp for Professionals 4/4
Python Code Camp for Professionals 4/4
 
MongoDB & Mongoid with Rails
MongoDB & Mongoid with RailsMongoDB & Mongoid with Rails
MongoDB & Mongoid with Rails
 
Jquery dojo slides
Jquery dojo slidesJquery dojo slides
Jquery dojo slides
 
10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling
 
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
 
Python Code Camp for Professionals 3/4
Python Code Camp for Professionals 3/4Python Code Camp for Professionals 3/4
Python Code Camp for Professionals 3/4
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 

More from Lukas Vlcek (6)

Elasticsearch Monitoring in Openshift
Elasticsearch Monitoring in OpenshiftElasticsearch Monitoring in Openshift
Elasticsearch Monitoring in Openshift
 
Elasticsearch @JBoss.org, 2014
Elasticsearch @JBoss.org, 2014Elasticsearch @JBoss.org, 2014
Elasticsearch @JBoss.org, 2014
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
JBoss Snowdrop
JBoss SnowdropJBoss Snowdrop
JBoss Snowdrop
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 

Building search app with ElasticSearch