Presented during Lucene EuroCon 2010 in Prague. This presentation assumes no prior experience with FAST ESP, but some idea of what Solr/Lucene is. It gives you some hints on what to expect when migrating.
5. Cominvent AS: Training
Certfed Solr Training Partner with Lucid Imaginaton
Certfed FAST ESP Training Partner
Apache Lucene EuroCon 05/21/10
Photo: fuidpowerzone.com
7. Assumptons
Decision to migrate to Solr is already done
This is not a "sales talk" for any partcular technology
Basic knowledge of Solr
None or limited knowledge of FAST ESP
Migraton to plain Solr or LucidWorks
(LucidWorks Enterprise editon not considered)
Apache Lucene EuroCon 05/21/10
12. Very strong & scalable document processing framework
Format Language Linguistic
Conversion Detection Normalization Entities
Custom
Taxonomy Sentiment Ontology
Plug-in
Search Alert PARIS (Reuters) - Venus Williams
raced into the second round of the
$11.25 million French Open
Monday, brushing aside Bianka
Lamade, 6-3, 6-3, in 65 minutes.
Apache Lucene EuroCon 05/21/10
13. FAST Document Processors (DP)
DPs transform documents prior to indexing
This is diferent from Solr feld centric analysis
Examples of stages:
Encoding normalizaton, language identfcaton
Text extracton (HTML, PDF, MS Ofce, etc.)
Tokenizaton, lemmatzaton, entty extracton
DPs are chained in pipelines
ESP ships with lots useful DPs and pipelines
Writen in Python, very easy to script new ones
Custom
Taxonomy Sentiment Ontology
Plug-in
Apache Lucene EuroCon 05/21/10
14. Terminology
Lucene/Solr FAST
Replica Search row
Shard Column
Facet Navigator
Spellcheck Did you mean
Update processor Document processor
Request Handler Query Transformer (QT)
Response Writer Result Processor(RP)/TWM
Apache Lucene EuroCon 05/21/10
15. Terminology
Lucene/Solr FAST
Schema Index profile
Index segment Index partition
Lucene IndexWriter/Rdr indexer/fsearch (RTS)
~Multi core ~Multi cluster
(Documents receiving same Collection
processing)
Apache Lucene EuroCon 05/21/10
16. Important diferences
Lucene/Solr FAST
Most features query-time Most features index-time
Field centric analysis Document centric analysis
One language per field Multi lingual fields
One Update handler per Format conversion in
input type (XML, CSV) document pipeline
Slim disk & memory Quite fat disk & memory
footprint footprint
One Java Web app 15-20 processes
Apache Lucene EuroCon 05/21/10
17. Solr Architecture
Thanks to Christan Moen/ATILIKA for graphics
Apache Lucene EuroCon 05/21/10
19. Steps of the migraton
Review current features & architecture
Keep all features? Add new?
Install Solr and do a quick iteraton (1-2 days):
Draf schema.xml & solrconfg.xml
Dump & index some real data
Play around with queries – Solritas is nice here
Design spec covering all migraton areas:
Schema, Content, Feeding & Analysis
Frontends, Querying & API
Admin & Operatonal
Implement :)
Apache Lucene EuroCon 05/21/10
21. Migratng index-profle -> Solr schema
ESP index profle -> Solr schema.xml
FAST felds example:
Solr equivalent:
Example: A feld with "tokenize=auto" in FAST → type="text"
Create new <feldType>'s as needed
Apache Lucene EuroCon 05/21/10
22. Product facets & generic felds
With FAST you ofen use «generic1», «generic2» etc to
model product facets which may vary between product
groups. Front ends need logic to convert.
Apache Lucene EuroCon 05/21/10
23. Product facets & generic felds
With Solr, using dynamic felds, each document can have
as many facets you like.
Makes it easy to e.g. Introduce a new «color» facet for
cars or a «MegaPixels» facet for digital cameras
Apache Lucene EuroCon 05/21/10
24. Composite felds -> DisMax ReqHandler
FAST uses composite felds to search across multple
felds, with weightng defned in Rank Profles
FAST's composite felds & rank profles can be modelled as
Solr «DisMax» queries
Set suitable defaults in solrconfg.xml using named
requesthandler instances.
In case of many felds & performance issues, use
<copyField> to group similarly ranked felds!
Freshness boost, GEO boost etc handled through
Functon Queries
Apache Lucene EuroCon 05/21/10
25. Composite felds -> DisMax ReqHandler
Given a FAST composite feld / Rank Profle
Apache Lucene EuroCon 05/21/10
26. Composite felds -> DisMax ReqHandler
This Solr query will do the same, confgureable per query:
qt=dismax
q=oslo
qf=ttle^5.0 teaser^1.5 body^0.1
bf=recip(rord(last_modifed),1,1000,1000)
...
...
DisjunctonMaxQuery((teaser:foo^1.5 ||ttle:foo^5.0 ||body:foo^0.1)~0.01)
DisjunctonMaxQuery((teaser:foo^1.5 ttle:foo^5.0 body:foo^0.1)~0.01)
DisjunctonMaxQuery((teaser:bar^1.5 ||ttle:bar^5.0 ||body:bar^0.1)~0.01)
DisjunctonMaxQuery((teaser:bar^1.5 ttle:bar^5.0 body:bar^0.1)~0.01)
FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed)))
FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed)))
...
...
Apache Lucene EuroCon 05/21/10
27. Statc document boosts
FAST uses the «hwboost» feld to add a statc Quality boost to
each document.
In Solr, you have more fexibility:
Add a boost to each document
<doc boost="10.0">
Add a boost to each feld
<feld name="ttle" boost="10.0">
Include any numeric document feld in a BoostFuncton
bf=sum(sqrt(popularity)^100.0, statcboost^20.0)
bf=sum(sqrt(popularity)^100.0, statcboost^20.0)
Apache Lucene EuroCon 05/21/10
28. Navigator statstcs
FAST navigators provide statstcs metadata (min/max/avg/sum)
Soluton: Use the StatsComponent
Apache Lucene EuroCon 05/21/10
29. Navigator auto-buckets
FAST numeric navigators give auto-bucketng based on
equal-frequency, equal-width, manual
Soluton:
Create a new feld which is pre-computed
Example: Document A has price=200.000, add pricerange="150.000 – 1.299.999"
Or use facet queries (expensive)
Or implement auto-bucketng and contribute the patch :-)
Apache Lucene EuroCon 05/21/10
30. XRANK
FAST has a feature to boost documents satsfying an "XRANK"
sub-query with a certain statc boost
In Solr, you can solve most XRANK use cases using
FunctonQueries
Apache Lucene EuroCon 05/21/10
31. Scope search
FAST ofers a feld type which holds arbitrary XML
Search in XPath-style:
xml:companies:company:and(revenue:>1000, employees:>=100)
Have not found similar feld type in Lucene.
Anyone?
Apache Lucene EuroCon 05/21/10
32. Migratng Connectors
FAST's connectors are many and mature
For simple use cases, consider Solr's DIH:
Supports DB, RSS, Web-services, Local flesystem
Additonally throgh Lucene Connectors Framework:
EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio,
SharePoint, RSS
New connectors should be writen for LCF
-and be submited back to the community :)
Apache Lucene EuroCon 05/21/10
33. Migratng Web Crawler
FAST's crawler is mature, performing & scalable
Solr has no built-in web crawler
Prepare for a lot of extra work migratng crawler
Alternatves:
The Apache Nutch crawler (steep learning curve)
Apache Droids
Heritx + Solr (example in Solr1.4 book)
OpenPipeline has a (very) simple crawler
Apache Lucene EuroCon 05/21/10
34. Migratng document processing
Solr lacks a sophistcated processing pipeline.
Alternatves:
Solr's UpdateProcessorChain for simple pipelines:
Write a Solr UpdateProcessor (in Java, Jython etc, see SOLR-1725)
OpenPipeline for more advanced requirements:
Check out FindWise's talk
Integrated with Solr
LingPipe NamedEnttyExtractor plugin
Apache Lucene EuroCon 05/21/10
35. Document processing examples
Binary documents with metadata
Actual customer request: Enrich library records with PDF content
Use Open Pipeline with Apache Tika processor
Implmenent Tika as an UpdateRequestProcessor (SOLR-1763)
Custom XML using FAST's XMLMapper
DIH's built-in XPath support
XSLT to Solr input XML
Write an new XMLMapper Update Request Handler?
Apache Lucene EuroCon 05/21/10
36. Mult lingual
FAST is state of the art on linguistcs
FAST is language aware, e.g. the ttle feld is "analyzed"
depending on detected language
Solr is not language aware
Each feld type has one and only one language
Most common soluton:
One feld type per language: text_no, text_en, text_de
Dynamic felds: <dynamicField name="*_en" type="text_en"..../>
Implement language awareness in applicaton layer (feeding + querying)
Apache Lucene EuroCon 05/21/10
37. Mult lingual – advanced
FAST ships with Lemmatzaton for most languages
Solr ships with Stemming – has limitatons
Solutons for mult lingual needs:
Kstem is tghter. Free with
License 3rd party linguistcs
Example:
BasisTech Rosete Linguistc Platorm
Lemmatzaton, POS etc..
Apache Lucene EuroCon 05/21/10
38. Mult lingual – very advanced
FAST allows lemmatzaton by index expansion
This can be useful if your frontend does not know what
languages are being queried, as all the word infectons
are stored in the index.
There is no soluton for this in Solr today,
Workaround: DisMax query spanning all languages:
q=eurocon&qf=text_en^2.0 text_no text_de text_it
Downside: This gets ugly and slow with increasing number
of languages
Apache Lucene EuroCon 05/21/10
39. Migratng Front ends / Query
Using a search middleware with Solr support? Lucky you!
If not, consider introducing one now:
Using FAST Java/.NET APIs?
Choose SolrJ or SolrNET/SolrSharp
Query language diferences. &fq= instead of flter()
Solr facets do not require session/state as FAST's
Apache Lucene EuroCon 05/21/10
40. Result views
FAST uses "result-view" and "search profle" to specify
what felds to return.
Migrate FAST's «views» into named RequestHandler
confgs with all default presets
No need to defne felds to return up-front!, use f=a,b,c...
Apache Lucene EuroCon 05/21/10
41. Operatons
Solr has no central admin-server (untl "SolrCloud")
For GUI installer, use
Multple cores – allows smooth schema upgrade etc.
No built-in query reportng, log analysis or monitoring.
But have a look at:
Apache Lucene EuroCon 05/21/10
42. Summary
Many migratons are (quite) straight-forward!
Warning fags
Mult-lingual and advanced linguistcs
Heavy use of Document Processing, including Entty Extracton
Scope search
Other enterprise complexites (security, connectors etc)
Follow a structured process
Quick prototyping
Design spec for each area
Don't forget to analyze logs and measure user satsfacton!
Apache Lucene EuroCon 05/21/10
43. Thank You
www.cominvent.com
jh@cominvent.com
www.twiter.com/cominvent
linkedin.com/in/janhoy
This presentaton licensed under CC-by-sa license
Apache Lucene EuroCon 05/21/10 You must atribute Cominvent with name and link