SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Building an open-source based search solution –
                  first steps

                      Roman Kern

             Institute of Knowledge Management
                Graz University of Technology
                       Know-Center Graz
            rkern@tugraz.at, rkern@know-center.at


          Data Science Meetup / 2012-04-12
Overview           Graz University of Technology




  Motivation


  Background


  Solr Ecosystem


  Solr Features


  Conclusions




                                                   2 / 28
Motivation                                              Graz University of Technology




   Search
       Change in users expectations
       Missing, sub-optimal search causes frustration

   Science
       Information retrieval
       Success story
       Mostly focused on web search

   Industry
       Enterprise search
       Heterogeneous data sources

                                                                                        3 / 28
Background of the Speaker                      Graz University of Technology




http://a1.net




                            http://wissen.de
                                                                               4 / 28
Apache Lucene Umbrella Project                          Graz University of Technology




   Components
       Search engine ⇒ Lucene
       Search server ⇒ Solr
       Web search engine ⇒ Nutch
       Lightweight crawler ⇒ Droids
       File-format parsing ⇒ Tika
       Communicate with CMS ⇒ ManifoldCF
       Distributed coordination ⇒ ZooKeeper
       Natural language processing ⇒ OpenNLP
       Related projects: Hadoop, Mahout, Carrot2, ...

   Common aspects
   Apache license, implemented in Java, community
                                                                                        5 / 28
Lucene                                Graz University of Technology




  Search Engine Library
         Java API
             Only for expert users
         Search-Index
             File-system
             In-memory index
         Advanced features
             Incremental indexing
             Update while searching
         Base for many projects
             Solr
             ir-lib
             elasticsearch
         LIA (Lucene in Action)


  http://lucene.apache.org/core/                                      6 / 28
Nutch                                       Graz University of Technology




  Web search engine
        Builds upon Solr
        Web crawler
            Link database, crawl database
        Distributed
            Runs on Hadoop
        Mode of operation
            Crawl a single domain
            Crawl the web with seed sites



  http://nutch.apache.org/




                                                                            7 / 28
Droids                                   Graz University of Technology




   Crawler component
         Lightweight crawler
         Main features
             Throttling
             Multi-threaded
             Well behaved (robots.txt)



   http://incubator.apache.org/droids/




                                                                         8 / 28
Tika                                               Graz University of Technology




   Text extraction
        Text & meta-data
        File-formats
             Office
                  Microsoft Formats (Apache POI)
                  OpenDocument
             Common text formats
                  PDF (PDFBox)
                  HTML (tagsoup)
             Non-text
                  Images
                  Sound



   http://tika.apache.org/


                                                                                   9 / 28
ManifoldCF                                  Graz University of Technology




  Content Management System Connectors
       Communicate with CMS/DMS
       Connectors
            FileNet P8 (IBM)
            Documentum (EMC)
            LiveLink (OpenText)
            Meridio (Autonomy)
            Windows shares (Microsoft)
            SharePoint (Microsoft)
            More: Alfresco, JDBC, ...
       Data is then stored and indexed
            e.g. Solr



  http://incubator.apache.org/connectors/


                                                                            10 / 28
ZooKeeper                        Graz University of Technology




  Distributed coordination
       Orchestrate servers
       Distributed
            Configuration
            Name lookup
            Synchronization




  http://zookeeper.apache.org/
                                                                 11 / 28
OpenNLP                                                      Graz University of Technology




  Natural language processing
       Process plain text
       Maximum entropy classification with beam search
       Models
            Sentence splitting
            Token splitting
            Part-of-speech (POS) tagging
            Named entity recognition
            more: chunker, parser, co-reference resolution



  http://opennlp.sourceforge.net/




                                                                                             12 / 28
Hadoop                                                  Graz University of Technology




  Distributed computing
       Scale out framework
       Distributed file-system
            Data is partitioned
            Stored on multiple nodes
       Map/Reduce paradigm
            Map your algorithms to mappers & reducers



  Related projects: HBase, Pig, Hive, ...


  http://hadoop.apache.org/



                                                                                        13 / 28
Mahout                            Graz University of Technology




  Distributed machine learning
       Scale out framework
       Machine learning
            Recommender systems
            Clustering
            Classification
       Integration
            Standalone
            Hadoop
            Amazon EC2



  http://mahout.apache.org/



                                                                  14 / 28
Details   Graz University of Technology




                                          15 / 28
Search Server                                          Graz University of Technology




   What Solr is
       Web-Service
       Full-text indexing & search
       Support to store arbitrary content

   What Solr isn’t
       Solr = grep
       Database
            But, somehow similar to No-SQL databases

   Solr vs. IR-Lib
       Solr: easy to use, easy to integrate, XML configuration
       IR-Lib: expert knowledge to use, Java configuration, fast

                                                                                       16 / 28
Index Structure                                        Graz University of Technology




   Inverted Index
       Dictionary of words (terms)
       Map from term to document

   Document
       List of fields
       Input fields are them mapped according to the schema

   Field-types
       Defined in the schema
       Type (string, boolean, date, number) - internally mapped to
       string


                                                                                       17 / 28
Index Management                                         Graz University of Technology




  API
        HTTP Server
        Various formats (XML, binary, JavaScript, ...)

  Document life-cycle
        There is no update
        Delete (done automatically by Solr)
        Insert
        Implications
            An unique id is necessary
            Use batch updates
        Commit, rollback (and optimize)


                                                                                         18 / 28
Input Handling                                             Graz University of Technology




   Different input formats
       XML
       CSV
       JDBC (database)
            DIH (data import handler)
            Support incremental updates (via timestamps)
       Solr Cell
            Binary content
            Apache Tika
            Text content and metadata




                                                                                           19 / 28
Text Processing                                     Graz University of Technology




   Scope
       During indexing & query

   Tokenization
       Split text into tokens
       Lower-case alignment
       Stemming (e.g. ponies, pony ⇒ poni, triplicate ⇒
       triplic, ...)
       Synonyms (via Thesaurus)
       Stop-word filtering
       Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi)
       n-grams, soundex, umlauts


                                                                                    20 / 28
Query Processing                                               Graz University of Technology




   Query parsers
        Lucene query parser (rich syntax)
              AND, OR, NOT, range queries, wildcards, fuzzy query, phrase
              query
              Boosting of individual parts
              Example: ((boltzmann OR schroedinger) NOT einstein)
        Dismax query parser
              No query syntax
              Searches over multiple fields (separate boost for each field)
              Configure the amount of terms to be mandatory
              Distance between terms is used for ranking (phrase boosting)


   Dismax is a good starting point, but may become expensive




                                                                                               21 / 28
Search Features               Graz University of Technology




   Query filter
       Additional query
       No impact on ranking
       Results are cached

   Boosting query
       Only in Dismax

   Query elevation
       Fix certain queries

   Request handler
       Pre-define clauses
       Invariants
                                                              22 / 28
Search Result                                               Graz University of Technology




   Ranking
       Relevance
       Sort on field value (only single term per document)

   Available data & features
       Sequence of IDs & score
       Stored fields
       Snippets (plus highlighting)
       Facets
             Count the search hits
             Types: field value, dates, queries
             Sort, prefix, ...
             Could be used for term suggestion (aka. query suggestion)
       Field collapsing (grouping)
       Spell checking (did-you-mean)
                                                                                            23 / 28
Additional Solr Features                Graz University of Technology




   Query by Example
       More like this

   Stats
       Per field
       Min, max, sum, missing, ...

   Admin-GUI
       Webapp to troubleshoot queries
       Browse schema

   JMX
       Read properties & statistics
       Can be accessed remotely
                                                                        24 / 28
Integration                               Graz University of Technology




   Deployment
       Within a web application server
       Embedded

   Monitor
       Log output

   Access
       Various language bindings
       Java, Ruby, JavaScript, PHP, ...



                                                                          25 / 28
Multi-core                                           Graz University of Technology




   Multiple indices
       Each index has its own configuration

   Operations
       Reload (when configuration has been changed)
       Rename
       Swap
       Merge
       Create, Status




                                                                                     26 / 28
Scale Solr                       Graz University of Technology




   Replication
       Master and slaves nodes
       Replication
       Slaves poll master

   Dispatch search request
       Load balancer




                                                                 27 / 28
Sharding Indexes                                       Graz University of Technology




   Single index
       Index spawned over multiple machines
       Search is done in parallel

   Mapping
       Application has to provide a deterministic mapping
       Document ⇒ index




                                                                                       28 / 28
Conclusions                                           Graz University of Technology




   Ecosystem
          Vivid community
          Corporative backing

   Solr
          Easy to get started
          Hard to optimize for specific requirements




                                                                                      29 / 28
The End        Graz University of Technology




  Thank you!




                                               30 / 28

Contenu connexe

Similaire à DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intromarpierc
 
Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.Matija Gobec
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
eResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software developmenteResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software developmentAndrea Wiggins
 
Adcom2006 Full 6
Adcom2006 Full 6Adcom2006 Full 6
Adcom2006 Full 6umavanth
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Baruch Sadogursky
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...San Diego Supercomputer Center
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
OGCE SciDAC2010 Tutorial
OGCE SciDAC2010 TutorialOGCE SciDAC2010 Tutorial
OGCE SciDAC2010 Tutorialmarpierc
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Serve and Scale ML Models ( Low latency prediction systems) at Scale
Serve and Scale ML Models ( Low latency prediction systems) at Scale Serve and Scale ML Models ( Low latency prediction systems) at Scale
Serve and Scale ML Models ( Low latency prediction systems) at Scale Srinivasa Rao Aravilli
 
grid mining
grid mininggrid mining
grid miningARNOLD
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Timothy Chen
 
ryssiouk_anastasia_arestyposter
ryssiouk_anastasia_arestyposterryssiouk_anastasia_arestyposter
ryssiouk_anastasia_arestyposterAnastasia Ryssiouk
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersVladimir Pavlov
 
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Stuart Chalk
 

Similaire à DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps (20)

Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
 
Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
eResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software developmenteResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software development
 
Adcom2006 Full 6
Adcom2006 Full 6Adcom2006 Full 6
Adcom2006 Full 6
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
OGCE SciDAC2010 Tutorial
OGCE SciDAC2010 TutorialOGCE SciDAC2010 Tutorial
OGCE SciDAC2010 Tutorial
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Serve and Scale ML Models ( Low latency prediction systems) at Scale
Serve and Scale ML Models ( Low latency prediction systems) at Scale Serve and Scale ML Models ( Low latency prediction systems) at Scale
Serve and Scale ML Models ( Low latency prediction systems) at Scale
 
grid mining
grid mininggrid mining
grid mining
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
 
ryssiouk_anastasia_arestyposter
ryssiouk_anastasia_arestyposterryssiouk_anastasia_arestyposter
ryssiouk_anastasia_arestyposter
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted Waters
 
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
 

Dernier

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Dernier (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

  • 1. Building an open-source based search solution – first steps Roman Kern Institute of Knowledge Management Graz University of Technology Know-Center Graz rkern@tugraz.at, rkern@know-center.at Data Science Meetup / 2012-04-12
  • 2. Overview Graz University of Technology Motivation Background Solr Ecosystem Solr Features Conclusions 2 / 28
  • 3. Motivation Graz University of Technology Search Change in users expectations Missing, sub-optimal search causes frustration Science Information retrieval Success story Mostly focused on web search Industry Enterprise search Heterogeneous data sources 3 / 28
  • 4. Background of the Speaker Graz University of Technology http://a1.net http://wissen.de 4 / 28
  • 5. Apache Lucene Umbrella Project Graz University of Technology Components Search engine ⇒ Lucene Search server ⇒ Solr Web search engine ⇒ Nutch Lightweight crawler ⇒ Droids File-format parsing ⇒ Tika Communicate with CMS ⇒ ManifoldCF Distributed coordination ⇒ ZooKeeper Natural language processing ⇒ OpenNLP Related projects: Hadoop, Mahout, Carrot2, ... Common aspects Apache license, implemented in Java, community 5 / 28
  • 6. Lucene Graz University of Technology Search Engine Library Java API Only for expert users Search-Index File-system In-memory index Advanced features Incremental indexing Update while searching Base for many projects Solr ir-lib elasticsearch LIA (Lucene in Action) http://lucene.apache.org/core/ 6 / 28
  • 7. Nutch Graz University of Technology Web search engine Builds upon Solr Web crawler Link database, crawl database Distributed Runs on Hadoop Mode of operation Crawl a single domain Crawl the web with seed sites http://nutch.apache.org/ 7 / 28
  • 8. Droids Graz University of Technology Crawler component Lightweight crawler Main features Throttling Multi-threaded Well behaved (robots.txt) http://incubator.apache.org/droids/ 8 / 28
  • 9. Tika Graz University of Technology Text extraction Text & meta-data File-formats Office Microsoft Formats (Apache POI) OpenDocument Common text formats PDF (PDFBox) HTML (tagsoup) Non-text Images Sound http://tika.apache.org/ 9 / 28
  • 10. ManifoldCF Graz University of Technology Content Management System Connectors Communicate with CMS/DMS Connectors FileNet P8 (IBM) Documentum (EMC) LiveLink (OpenText) Meridio (Autonomy) Windows shares (Microsoft) SharePoint (Microsoft) More: Alfresco, JDBC, ... Data is then stored and indexed e.g. Solr http://incubator.apache.org/connectors/ 10 / 28
  • 11. ZooKeeper Graz University of Technology Distributed coordination Orchestrate servers Distributed Configuration Name lookup Synchronization http://zookeeper.apache.org/ 11 / 28
  • 12. OpenNLP Graz University of Technology Natural language processing Process plain text Maximum entropy classification with beam search Models Sentence splitting Token splitting Part-of-speech (POS) tagging Named entity recognition more: chunker, parser, co-reference resolution http://opennlp.sourceforge.net/ 12 / 28
  • 13. Hadoop Graz University of Technology Distributed computing Scale out framework Distributed file-system Data is partitioned Stored on multiple nodes Map/Reduce paradigm Map your algorithms to mappers & reducers Related projects: HBase, Pig, Hive, ... http://hadoop.apache.org/ 13 / 28
  • 14. Mahout Graz University of Technology Distributed machine learning Scale out framework Machine learning Recommender systems Clustering Classification Integration Standalone Hadoop Amazon EC2 http://mahout.apache.org/ 14 / 28
  • 15. Details Graz University of Technology 15 / 28
  • 16. Search Server Graz University of Technology What Solr is Web-Service Full-text indexing & search Support to store arbitrary content What Solr isn’t Solr = grep Database But, somehow similar to No-SQL databases Solr vs. IR-Lib Solr: easy to use, easy to integrate, XML configuration IR-Lib: expert knowledge to use, Java configuration, fast 16 / 28
  • 17. Index Structure Graz University of Technology Inverted Index Dictionary of words (terms) Map from term to document Document List of fields Input fields are them mapped according to the schema Field-types Defined in the schema Type (string, boolean, date, number) - internally mapped to string 17 / 28
  • 18. Index Management Graz University of Technology API HTTP Server Various formats (XML, binary, JavaScript, ...) Document life-cycle There is no update Delete (done automatically by Solr) Insert Implications An unique id is necessary Use batch updates Commit, rollback (and optimize) 18 / 28
  • 19. Input Handling Graz University of Technology Different input formats XML CSV JDBC (database) DIH (data import handler) Support incremental updates (via timestamps) Solr Cell Binary content Apache Tika Text content and metadata 19 / 28
  • 20. Text Processing Graz University of Technology Scope During indexing & query Tokenization Split text into tokens Lower-case alignment Stemming (e.g. ponies, pony ⇒ poni, triplicate ⇒ triplic, ...) Synonyms (via Thesaurus) Stop-word filtering Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi) n-grams, soundex, umlauts 20 / 28
  • 21. Query Processing Graz University of Technology Query parsers Lucene query parser (rich syntax) AND, OR, NOT, range queries, wildcards, fuzzy query, phrase query Boosting of individual parts Example: ((boltzmann OR schroedinger) NOT einstein) Dismax query parser No query syntax Searches over multiple fields (separate boost for each field) Configure the amount of terms to be mandatory Distance between terms is used for ranking (phrase boosting) Dismax is a good starting point, but may become expensive 21 / 28
  • 22. Search Features Graz University of Technology Query filter Additional query No impact on ranking Results are cached Boosting query Only in Dismax Query elevation Fix certain queries Request handler Pre-define clauses Invariants 22 / 28
  • 23. Search Result Graz University of Technology Ranking Relevance Sort on field value (only single term per document) Available data & features Sequence of IDs & score Stored fields Snippets (plus highlighting) Facets Count the search hits Types: field value, dates, queries Sort, prefix, ... Could be used for term suggestion (aka. query suggestion) Field collapsing (grouping) Spell checking (did-you-mean) 23 / 28
  • 24. Additional Solr Features Graz University of Technology Query by Example More like this Stats Per field Min, max, sum, missing, ... Admin-GUI Webapp to troubleshoot queries Browse schema JMX Read properties & statistics Can be accessed remotely 24 / 28
  • 25. Integration Graz University of Technology Deployment Within a web application server Embedded Monitor Log output Access Various language bindings Java, Ruby, JavaScript, PHP, ... 25 / 28
  • 26. Multi-core Graz University of Technology Multiple indices Each index has its own configuration Operations Reload (when configuration has been changed) Rename Swap Merge Create, Status 26 / 28
  • 27. Scale Solr Graz University of Technology Replication Master and slaves nodes Replication Slaves poll master Dispatch search request Load balancer 27 / 28
  • 28. Sharding Indexes Graz University of Technology Single index Index spawned over multiple machines Search is done in parallel Mapping Application has to provide a deterministic mapping Document ⇒ index 28 / 28
  • 29. Conclusions Graz University of Technology Ecosystem Vivid community Corporative backing Solr Easy to get started Hard to optimize for specific requirements 29 / 28
  • 30. The End Graz University of Technology Thank you! 30 / 28