SlideShare une entreprise Scribd logo
1  sur  61
Télécharger pour lire hors ligne
Improving the Solr Update Chain
           Jan Høydahl
What will I cover?
Who is Jan Høydahl?
Intro to Solr’s (hidden) UpdateChain
How to write your own UpdateProcessors
Example: Web crawl @ Oslo University
A vision for future improvements
Conclusion




                      2
Jan Høydahl
          1995: Developer telecom
          1998: Java developer
          2000: Search - FAST
          2006: Lucene
          2007: Cominvent
          2011: Lucene committer


          > 100 projects



      5
Cominvent AS




Consulting & support       www.solrtraining.com
   Lucene/Solr
       FAST


                       6
Why document processing?
Analysis is Field oriented
Filters only see the “local” field




                      7
Why document processing?
But what if you want to:
  Add or remove fields?
  Make decisions based on other fields?
We need a way to modify the Document




                    8
Why document processing?

Doc1
name
postcode
                         programmer
cv_pdf_url
                        near Barcelona




               9
Why document processing?

Doc1
name
postcode
                          programmer
latlong
                         near Barcelona
cv_pdf_url
cv_text




                10
Why document processing?


   Client
    Doc1
     name
     postcode
     latlong
     cv_pdf_url
     cv_text




                  11
Why document processing?


Client
 Doc1
  name
  postcode     3rd party
  latlong      pipeline
  cv_pdf_url
  cv_text




                     12
Solr’s Update Chain




        13
The Update Chain




        14
The Update Chain




Doc
name
postcode
cv_pdf_url




                     15
The Update Chain
              Postcode
              ToLatLong
              Processor


Doc            Doc
name           name
postcode       postcode
cv_pdf_url     latlong
               cv_pdf_url




                            15
The Update Chain
              Postcode
                             UrlFetcher
              ToLatLong
                             Processor
              Processor


Doc            Doc           Doc
name           name              name
postcode       postcode          postcode
cv_pdf_url     latlong           latlong
               cv_pdf_url        cv_pdf_url
                                 cv_pdf_bin




                            15
The Update Chain
              Postcode                        Tika
                             UrlFetcher
              ToLatLong                       Extracting
                             Processor
              Processor                       Processor


Doc            Doc           Doc              Doc
name           name              name         name
postcode       postcode          postcode     postcode
cv_pdf_url     latlong           latlong      latlong
               cv_pdf_url        cv_pdf_url   cv_pdf_url
                                 cv_pdf_bin   cv_pdf_bin
                                              cv_text



                            15
How it’s wired
Chain definition in solrconfig.xml:




Choose chain in your update request:
.../solr/update/xml?..&update.chain=cv-chain



                       17
Other examples




Language Identification
          18
Other examples
Company

       The Apache Software Foundation
       (ASF) is a non-profit corporation to
       support Apache software projects.
       The ASF was formed from the
       Apache Group and incorporated in
       Delaware, U.S., in June 1999.

  Location                                    Date


                  Entity extraction
                          19
Writing your own processor




            21
Writing your own processor




            21
Writing your own processor




             22
Writing your own processor




             23
Writing your own processor




•Make generic processors - parameterized
•Use SchemaAware, SolrCoreAware and
  ResourceLoaderAware interfaces
•Prefix param names to avoid name clash
•Testing and testable methods
•Donate back to Apache & document on Wiki
                      24
Web crawl with
Language Detection
 @ Oslo University

        25
Solr @ Oslo University




           26
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="fl">content_no content_en</str>
    <str name="pattern">[su00A0]+</str>
    <str name="replacement"> </str>
  </processor>
  <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="title.fl">title_no,title_en</str>
                                           27
    <str name="content.fl">content_no,content_en</str>
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpd
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProces
                             27
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="fl">content_no content_en</str>
    <str name="pattern">[su00A0]+</str>
    <str name="replacement"> </str>
  </processor>
  <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="title.fl">title_no,title_en</str>
                                           27
    <str name="content.fl">content_no,content_en</str>
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
Donations back to Apache

SOLR-2599: FieldCopyProcessor
SOLR-2825: RegexReplaceProcessor
SOLR-2826: URLClassifyProcessor
SOLR-2827: RegexpBoostProcessor
SOLR-2828: StaticRankProcessor
Binary Document Dumper (?)

Many thanks for the donations!


                       29
Room for
improvement?



  32
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                   34
Improvements
Pain:
  Potentially expensive initialization
  StaticRankProcessor: read&parse 50.000 lines

Proposed cure:
  Keep persistent state object in factory:
  private final Map<Object,Object> sharedObjCache
  new StaticRankProcessor(params, request,
  response, nextProcessor, sharedObjCache);
  Processor uses sharedObjCache for state



                       35
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                    36
Improvements
Pain:
  Multi chains often need identical Processors
  UiO’s two chains share 80% -> copy/paste

Proposed cure:
  Allow sharing of named instances
  Define:
  <processor name="langid" class="..">
  Refer:
  <processor ref="langid" />
  See SOLR-2823

                     37
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                   38
Improvements
Pain:
  Chains are linear only
  Hard to do branching, sub chains, conditional...

Proposed cure (SOLR-2841):
  New scriptable Update Chain - alternative to XML
  Script chain logic in solr/conf/updateproc.groovy
  Full flexibility:
  chain myChain {
     if(doc.getFieldValue("type").equals("pdf"))
       process(tikaproc)
   }


                      39
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support




                     40
Improvements
Pain:
  Single threaded
  Heavy processing not efficient

Proposed cure:
  Local: Use multi threaded update requests
  SolrCloud: Dedicated nodes, role=“processor” ?
  Wrap an external pipeline in UpdateProcessor
    Example: OpenPipelineUpdateProcessor ?




                        41
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support




                     42
Improvements
Pain:
  Not really a “problem” :-)
  Nice to write processors in Python, Groovy, JS...

Proposed cure:
  Now: Finish SOLR-1725: Script based Processor
  Later: Make scripts first-class processors
    <processor script="myScript.py" />
    or
    <processor ref="myScript" />




                      43
One last thing...


       44
New standalone framework?
•The UpdateChain is Solr specific
•Interest for a pure pipeline framework
  •Search engine independent
  •Scalable
  •Rich pool of processors
  •Several existing candidates
•Some initial thoughts:
  http://wiki.apache.org/solr/DocumentProcessing




                          45
Summary


   46
Summary
•Document centric vs field centric processing
•UpdateChain is there - use it!
•Works well for most “light” cases
•Scaling issues, but caching config may help
•More processors welcome!




                       47
Questions?
Jan Høydahl, Cominvent AS
@cominvent
www.cominvent.com
Extra


 49
Alternative pipelines
   OpenPipeline (Dieselpoint)
•OpenPipe (T-Rank, now on GitHub)
•Pypes (ESR)
•UIMA (Apache)
•Eclipse SMILA
•Apache commons pipeline
•Piped (FoundIT, Norway)
•Behemoth (DigitaPebble)
•FindWise and TwigKit also has some technology



                      50
Calling out from UpdateChain
This is one way an
external pipeline
system can be
integrated with Solr.

The main benefit of
such a method is you
can continue to feed
content with SolrJ, DIH
or other Update
Request Handlers.


                          51
Scaling with external pipeline
Here is a more
advanced,
distributed
case, where a
Solr node is
dedicated for
processing, and
the entry point
Solr only
dispatches the
requests.


                  52

Contenu connexe

Tendances

Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.gutierrezga00
 
Automating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell ScriptingAutomating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell ScriptingRoy Zimmer
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
 
Introduction to column oriented databases in PHP
Introduction to column oriented databases in PHPIntroduction to column oriented databases in PHP
Introduction to column oriented databases in PHPZend by Rogue Wave Software
 
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesWWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesSören Auer
 
Scylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScyllaDB
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScyllaDB
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex Espen Brækken
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Storesandyseaborne
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrAngel Borroy López
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 

Tendances (20)

Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
 
Automating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell ScriptingAutomating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell Scripting
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
 
Introduction to column oriented databases in PHP
Introduction to column oriented databases in PHPIntroduction to column oriented databases in PHP
Introduction to column oriented databases in PHP
 
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesWWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Scylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the West
 
Presto overview
Presto overviewPresto overview
Presto overview
 
Owl2 rl
Owl2 rlOwl2 rl
Owl2 rl
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
System Programming and Administration
System Programming and AdministrationSystem Programming and Administration
System Programming and Administration
 
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache Solr
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 

En vedette

Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
The Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David SmileyThe Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David SmileyLucidworks
 
Romain Rogister DSP ppt V2003
Romain  Rogister  DSP  ppt V2003Romain  Rogister  DSP  ppt V2003
Romain Rogister DSP ppt V2003Romain Rogister
 
Central Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwalCentral Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwalUjwal Limbu
 
Data transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processorData transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processorvishalgohel12195
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Electionravikgiitk
 
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks  -  Baltimore Perl mongersPerl Bag of Tricks  -  Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongersbrian d foy
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersSematext Group, Inc.
 
Microprocessors and controllers
Microprocessors and controllersMicroprocessors and controllers
Microprocessors and controllersWendy Hemo
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
 
Cpu scheduling(suresh)
Cpu scheduling(suresh)Cpu scheduling(suresh)
Cpu scheduling(suresh)Nagarajan
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 

En vedette (20)

Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
The Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David SmileyThe Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David Smiley
 
Intel
IntelIntel
Intel
 
Cpu spec
Cpu specCpu spec
Cpu spec
 
Introduction of cpu
Introduction of cpuIntroduction of cpu
Introduction of cpu
 
Romain Rogister DSP ppt V2003
Romain  Rogister  DSP  ppt V2003Romain  Rogister  DSP  ppt V2003
Romain Rogister DSP ppt V2003
 
Central Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwalCentral Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwal
 
Data transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processorData transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processor
 
Arrandale presentation1
Arrandale presentation1Arrandale presentation1
Arrandale presentation1
 
Intel i7
Intel i7Intel i7
Intel i7
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Election
 
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks  -  Baltimore Perl mongersPerl Bag of Tricks  -  Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongers
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Microprocessors and controllers
Microprocessors and controllersMicroprocessors and controllers
Microprocessors and controllers
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
 
Cpu scheduling(suresh)
Cpu scheduling(suresh)Cpu scheduling(suresh)
Cpu scheduling(suresh)
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 

Similaire à Improving the Solr Update Chain

Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...Crossref
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsAndreas Schreiber
 
Python tools for testing web services over HTTP
Python tools for testing web services over HTTPPython tools for testing web services over HTTP
Python tools for testing web services over HTTPMykhailo Kolesnyk
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementAndreas Schreiber
 
Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)ThirdWaveInsights
 
Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Chef
 
Innovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open InterfacesInnovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open InterfacesSteve Speicher
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossref
 
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays
 
Software Engineering - RS4
Software Engineering - RS4Software Engineering - RS4
Software Engineering - RS4AtakanAral
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Apex behind the scenes
Apex behind the scenesApex behind the scenes
Apex behind the scenesEnkitec
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009marpierc
 

Similaire à Improving the Solr Update Chain (20)

Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
 
Python tools for testing web services over HTTP
Python tools for testing web services over HTTPPython tools for testing web services over HTTP
Python tools for testing web services over HTTP
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)
 
Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015
 
Innovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open InterfacesInnovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open Interfaces
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef Workshops
 
Soap Toolkit Dcphp
Soap Toolkit DcphpSoap Toolkit Dcphp
Soap Toolkit Dcphp
 
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
 
Software Engineering - RS4
Software Engineering - RS4Software Engineering - RS4
Software Engineering - RS4
 
Mufix Network Programming Lecture
Mufix Network Programming LectureMufix Network Programming Lecture
Mufix Network Programming Lecture
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Apex behind the scenes
Apex behind the scenesApex behind the scenes
Apex behind the scenes
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009
 

Plus de Cominvent AS

Solr's missing plugin ecosystem
Solr's missing plugin ecosystemSolr's missing plugin ecosystem
Solr's missing plugin ecosystemCominvent AS
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr CloudCominvent AS
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyCominvent AS
 
Dagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkDagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkCominvent AS
 
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan HøydahlOslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan HøydahlCominvent AS
 
Open source breakfast norge findwise
Open source breakfast norge findwiseOpen source breakfast norge findwise
Open source breakfast norge findwiseCominvent AS
 
Frokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent asFrokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent asCominvent AS
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to SolrCominvent AS
 
Cominvent AS company Presentation
Cominvent AS company PresentationCominvent AS company Presentation
Cominvent AS company PresentationCominvent AS
 

Plus de Cominvent AS (9)

Solr's missing plugin ecosystem
Solr's missing plugin ecosystemSolr's missing plugin ecosystem
Solr's missing plugin ecosystem
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoy
 
Dagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkDagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søk
 
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan HøydahlOslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
 
Open source breakfast norge findwise
Open source breakfast norge findwiseOpen source breakfast norge findwise
Open source breakfast norge findwise
 
Frokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent asFrokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent as
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to Solr
 
Cominvent AS company Presentation
Cominvent AS company PresentationCominvent AS company Presentation
Cominvent AS company Presentation
 

Dernier

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Dernier (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Improving the Solr Update Chain

  • 1. Improving the Solr Update Chain Jan Høydahl
  • 2. What will I cover? Who is Jan Høydahl? Intro to Solr’s (hidden) UpdateChain How to write your own UpdateProcessors Example: Web crawl @ Oslo University A vision for future improvements Conclusion 2
  • 3.
  • 4.
  • 5. Jan Høydahl 1995: Developer telecom 1998: Java developer 2000: Search - FAST 2006: Lucene 2007: Cominvent 2011: Lucene committer > 100 projects 5
  • 6. Cominvent AS Consulting & support www.solrtraining.com Lucene/Solr FAST 6
  • 7. Why document processing? Analysis is Field oriented Filters only see the “local” field 7
  • 8. Why document processing? But what if you want to: Add or remove fields? Make decisions based on other fields? We need a way to modify the Document 8
  • 9. Why document processing? Doc1 name postcode programmer cv_pdf_url near Barcelona 9
  • 10. Why document processing? Doc1 name postcode programmer latlong near Barcelona cv_pdf_url cv_text 10
  • 11. Why document processing? Client Doc1 name postcode latlong cv_pdf_url cv_text 11
  • 12. Why document processing? Client Doc1 name postcode 3rd party latlong pipeline cv_pdf_url cv_text 12
  • 16. The Update Chain Postcode ToLatLong Processor Doc Doc name name postcode postcode cv_pdf_url latlong cv_pdf_url 15
  • 17. The Update Chain Postcode UrlFetcher ToLatLong Processor Processor Doc Doc Doc name name name postcode postcode postcode cv_pdf_url latlong latlong cv_pdf_url cv_pdf_url cv_pdf_bin 15
  • 18. The Update Chain Postcode Tika UrlFetcher ToLatLong Extracting Processor Processor Processor Doc Doc Doc Doc name name name name postcode postcode postcode postcode cv_pdf_url latlong latlong latlong cv_pdf_url cv_pdf_url cv_pdf_url cv_pdf_bin cv_pdf_bin cv_text 15
  • 19.
  • 20.
  • 21. How it’s wired Chain definition in solrconfig.xml: Choose chain in your update request: .../solr/update/xml?..&update.chain=cv-chain 17
  • 23. Other examples Company The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999. Location Date Entity extraction 19
  • 24.
  • 25. Writing your own processor 21
  • 26. Writing your own processor 21
  • 27. Writing your own processor 22
  • 28. Writing your own processor 23
  • 29. Writing your own processor •Make generic processors - parameterized •Use SchemaAware, SolrCoreAware and ResourceLoaderAware interfaces •Prefix param names to avoid name clash •Testing and testable methods •Donate back to Apache & document on Wiki 24
  • 30. Web crawl with Language Detection @ Oslo University 25
  • 31. Solr @ Oslo University 26
  • 32. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  • 33. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpd <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProces 27
  • 34. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  • 35. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 36. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 37. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 38. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 39. Donations back to Apache SOLR-2599: FieldCopyProcessor SOLR-2825: RegexReplaceProcessor SOLR-2826: URLClassifyProcessor SOLR-2827: RegexpBoostProcessor SOLR-2828: StaticRankProcessor Binary Document Dumper (?) Many thanks for the donations! 29
  • 40.
  • 41.
  • 43. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 34
  • 44. Improvements Pain: Potentially expensive initialization StaticRankProcessor: read&parse 50.000 lines Proposed cure: Keep persistent state object in factory: private final Map<Object,Object> sharedObjCache new StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache); Processor uses sharedObjCache for state 35
  • 45. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 36
  • 46. Improvements Pain: Multi chains often need identical Processors UiO’s two chains share 80% -> copy/paste Proposed cure: Allow sharing of named instances Define: <processor name="langid" class=".."> Refer: <processor ref="langid" /> See SOLR-2823 37
  • 47. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 38
  • 48. Improvements Pain: Chains are linear only Hard to do branching, sub chains, conditional... Proposed cure (SOLR-2841): New scriptable Update Chain - alternative to XML Script chain logic in solr/conf/updateproc.groovy Full flexibility: chain myChain { if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) } 39
  • 49. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear chains or sub chains Does not scale very well Lack native scripting language support 40
  • 50. Improvements Pain: Single threaded Heavy processing not efficient Proposed cure: Local: Use multi threaded update requests SolrCloud: Dedicated nodes, role=“processor” ? Wrap an external pipeline in UpdateProcessor Example: OpenPipelineUpdateProcessor ? 41
  • 51. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear chains or sub chains Does not scale very well Lack native scripting language support 42
  • 52. Improvements Pain: Not really a “problem” :-) Nice to write processors in Python, Groovy, JS... Proposed cure: Now: Finish SOLR-1725: Script based Processor Later: Make scripts first-class processors <processor script="myScript.py" /> or <processor ref="myScript" /> 43
  • 54. New standalone framework? •The UpdateChain is Solr specific •Interest for a pure pipeline framework •Search engine independent •Scalable •Rich pool of processors •Several existing candidates •Some initial thoughts: http://wiki.apache.org/solr/DocumentProcessing 45
  • 55. Summary 46
  • 56. Summary •Document centric vs field centric processing •UpdateChain is there - use it! •Works well for most “light” cases •Scaling issues, but caching config may help •More processors welcome! 47
  • 57. Questions? Jan Høydahl, Cominvent AS @cominvent www.cominvent.com
  • 59. Alternative pipelines OpenPipeline (Dieselpoint) •OpenPipe (T-Rank, now on GitHub) •Pypes (ESR) •UIMA (Apache) •Eclipse SMILA •Apache commons pipeline •Piped (FoundIT, Norway) •Behemoth (DigitaPebble) •FindWise and TwigKit also has some technology 50
  • 60. Calling out from UpdateChain This is one way an external pipeline system can be integrated with Solr. The main benefit of such a method is you can continue to feed content with SolrJ, DIH or other Update Request Handlers. 51
  • 61. Scaling with external pipeline Here is a more advanced, distributed case, where a Solr node is dedicated for processing, and the entry point Solr only dispatches the requests. 52

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n