See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
Solr features a little known internal document processing pipeline called the UpdateRequestProcesssorChain or simply the UpdateChain.
In this talk we'll discuss the importance of document processing, when the UpdateChain works well and what limitations it's got. We'll then go on to propose a range of possible improvements.
Topics include:
Examples of use with demo
How to write your own UpdateProcessor, best practices
Example: Tika as an UpdateProcessor
A vision for future improvements
2. What will I cover?
Who is Jan Høydahl?
Intro to Solr’s (hidden) UpdateChain
How to write your own UpdateProcessors
Example: Web crawl @ Oslo University
A vision for future improvements
Conclusion
2
8. Why document processing?
But what if you want to:
Add or remove fields?
Make decisions based on other fields?
We need a way to modify the Document
8
23. Other examples
Company
The Apache Software Foundation
(ASF) is a non-profit corporation to
support Apache software projects.
The ASF was formed from the
Apache Group and incorporated in
Delaware, U.S., in June 1999.
Location Date
Entity extraction
19
29. Writing your own processor
•Make generic processors - parameterized
•Use SchemaAware, SolrCoreAware and
ResourceLoaderAware interfaces
•Prefix param names to avoid name clash
•Testing and testable methods
•Donate back to Apache & document on Wiki
24
39. Donations back to Apache
SOLR-2599: FieldCopyProcessor
SOLR-2825: RegexReplaceProcessor
SOLR-2826: URLClassifyProcessor
SOLR-2827: RegexpBoostProcessor
SOLR-2828: StaticRankProcessor
Binary Document Dumper (?)
Many thanks for the donations!
29
43. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support
34
44. Improvements
Pain:
Potentially expensive initialization
StaticRankProcessor: read&parse 50.000 lines
Proposed cure:
Keep persistent state object in factory:
private final Map<Object,Object> sharedObjCache
new StaticRankProcessor(params, request,
response, nextProcessor, sharedObjCache);
Processor uses sharedObjCache for state
35
45. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support
36
46. Improvements
Pain:
Multi chains often need identical Processors
UiO’s two chains share 80% -> copy/paste
Proposed cure:
Allow sharing of named instances
Define:
<processor name="langid" class="..">
Refer:
<processor ref="langid" />
See SOLR-2823
37
47. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support
38
48. Improvements
Pain:
Chains are linear only
Hard to do branching, sub chains, conditional...
Proposed cure (SOLR-2841):
New scriptable Update Chain - alternative to XML
Script chain logic in solr/conf/updateproc.groovy
Full flexibility:
chain myChain {
if(doc.getFieldValue("type").equals("pdf"))
process(tikaproc)
}
39
49. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support
40
50. Improvements
Pain:
Single threaded
Heavy processing not efficient
Proposed cure:
Local: Use multi threaded update requests
SolrCloud: Dedicated nodes, role=“processor” ?
Wrap an external pipeline in UpdateProcessor
Example: OpenPipelineUpdateProcessor ?
41
51. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support
42
52. Improvements
Pain:
Not really a “problem” :-)
Nice to write processors in Python, Groovy, JS...
Proposed cure:
Now: Finish SOLR-1725: Script based Processor
Later: Make scripts first-class processors
<processor script="myScript.py" />
or
<processor ref="myScript" />
43
54. New standalone framework?
•The UpdateChain is Solr specific
•Interest for a pure pipeline framework
•Search engine independent
•Scalable
•Rich pool of processors
•Several existing candidates
•Some initial thoughts:
http://wiki.apache.org/solr/DocumentProcessing
45
56. Summary
•Document centric vs field centric processing
•UpdateChain is there - use it!
•Works well for most “light” cases
•Scaling issues, but caching config may help
•More processors welcome!
47
59. Alternative pipelines
OpenPipeline (Dieselpoint)
•OpenPipe (T-Rank, now on GitHub)
•Pypes (ESR)
•UIMA (Apache)
•Eclipse SMILA
•Apache commons pipeline
•Piped (FoundIT, Norway)
•Behemoth (DigitaPebble)
•FindWise and TwigKit also has some technology
50
60. Calling out from UpdateChain
This is one way an
external pipeline
system can be
integrated with Solr.
The main benefit of
such a method is you
can continue to feed
content with SolrJ, DIH
or other Update
Request Handlers.
51
61. Scaling with external pipeline
Here is a more
advanced,
distributed
case, where a
Solr node is
dedicated for
processing, and
the entry point
Solr only
dispatches the
requests.
52