3. Consulting
– Cominvent delivers independent search consulting
– Focus on Apache Lucene/Solr & Microsoft FAST ESP
– We know both the proprietary and Open Source worlds,
their benefits and disadvantages. We help you choose.
We help you maximize your chosen engine, and we
help you migrate as your requirements change.
cominvent as
4. Training
– Cominvent AS delivers training public and on-site
– Certified Solr Training Partner for Lucid Imagination
– Certified FAST ESP Training Partner
– Read more: http://www.cominvent.com/training/
cominvent as
Photo: fluidpowerzone.com
5. Commercial Support
– When community & mailing list support is not enough..
– Paid support agreement for Apache Solr/Lucene
– In cooperation with Lucid Imagination
– Read more: http://www.cominvent.com/support/
cominvent as
6. Jan Høydahl – experience
● IT architect, 15 years with
search, telecom, mobile
● Helped build FAST's Global
Services as first engineer
● Founder of Cominvent AS
● Search consultant 10 years
● Certified Solr instructor
cominvent as
7. Recommendations
«His skills on Fast ESP is in-depth, thorough, and
probably amongst the best you can get. Jan is
working independently, but also well in teams.
Whether it is technical or business work, Jan does
not fall behind. His excellent skills to see things from
the holistic perspective is great.»
-Knut Stenmark, DPM AS
cominvent as
8. Sample consulting projects
World wide news agency
Chief architect of FAST ESP search solution, migrating from Autonomy
IDOL. Real-time news, alerting etc.
Major Swedish newspaper
Architect for new Topic Page solution, letting editors define topics based on
keywords and regex rules.
Norwegian Yellow Pages actor
Architect for migrating traditional DB backed catalog search to modern one-
search box solution.
Classifieds and real estate online broker
Advise on migrating from DB to search. Architect for FAST ESP solution
with Norwegian linguistics, search middleware and relevance tuning.
Leading news surveillance company
Helped implement and tune real-time search using FAST ESP and real-time
alerting using FAST RTA.
cominvent as
9. Sample Solr Training references
Library organization
– Danish national library – Global library org,
organization serving all serving hundreds of
Danish libraries libraries world wide
– Migrating from in-house – Helping them migrate
search to Apache Solr for from FAST to Solr
all their search – First step is Classroom
– Delivered Solr training Training in March 2010
course in 2010
cominvent as
11. About Apache Solr
– Open Source enterprise search server
– Built on the popular Apache Lucene library
– 100% Java, runs on all platforms and env.
– Supports billions of documents, high scalability and
advanced features like faceting, highlighting,
document format conversions, GEO search etc
– Indexes most languages including CJK
– Platform not language aware, but each field can be
configured to language specific tokenization,
stemming, stop word processing etc
– Very active developer and user communities
– Apache 2.0 license – commercially friendly
– Rapid growth in companies providing support etc
cominvent as
12. Solr-user community growth
Solr-user growth
1600
1400
1200
1000
Messages
800
Column B
600
400
200
0
2006 Mar 2006 Jul 2006 Nov 2007 Mar 2007 Jul 2007 Nov 2008 Mar 2008 Jul 2008 Nov 2009 Apr 2009 Aug 2009 Dec
2006 Jan 2006 May 2006 Sep 2007 Jan 2007 May 2007 Sep 2008 Jan 2008 May 2008 Sep 2009 Feb 2009 Jun 2009 Oct 2010 Feb
cominvent as Month
13. Lucene/Solr deployments
– More: http://wiki.apache.org/solr/PublicServers
cominvent as
Thanks to Lucid Imagination for logo collection
14. Solr in media & newspapers
– News search. Also exposes API
– Danish news search
– Swedish news search
– Swedish news search
– Faceted search through classifieds
– Eastern european classifieds
cominvent as
15. Sample FAST-Solr switchers
– Human Rights search
• hurisearch.org (blog)
– FINN katalog (former Sesam)
• katalog.finn.no (announce)
– Mocality – African business search
• mocality.co.ke (linkedin)
– International library search
• Large multi-lingual index
– Norwegian media house
cominvent as
• Multiple newspapers
18. Migration objectives
– Possible objectives include:
• Lower maintenance cost
• Deeper in-house competency
• Less dependent on external consultants
• Ownership and visibility of source code
• Shorter time to market for new features
• Bugs fixed faster – or even fix ourselves
• Larger community, mailing lists that work!
• More choice in external consultants
• Contribute back to Open Source
• Lower HW footprint
cominvent as
19. Migration steps
– Knowledge gathering & Training
– Review current features & arch
• Want to keep all features? Add new?
– Migration areas:
• Index profile
• Content
• Feeding
• Document Processing
• Querying
• Search middleware?
• Admin & Operational
– What to do in Application space vs Search space?
cominvent as
20. Feature comparison ESP – Solr (similarities)
Feature ESP Solr
Full-text, boolean, range search, Yes Yes
sorting, sub-second, facets, did-you-
mean, synonyms, faceting
Scaling for QPS Add rows Add rows
Scaling for document volume Add columns Add shards
Synonyms Index/query side Index/query side
GEO search Yes Yes (1.5)
Boolean query language Yes (FQL) Yes (Lucene or
(e)DisMax)
APIs HTTP, Java, .NET, HTTP, Java, .NET,
C++, PHP Ruby, Python, PHP,
Perl, JS
cominvent as
21. Feature comparison ESP – Solr (differences)
Feature ESP Solr
Admin server Yes No (coming 1.5)
Processes Many (C++, Java, One WAR in Java
Python) app-server, 100%
Java
Navigators / Facets Index-time Query-time
Did-you-mean Dictionary based Dictionary or
index based
Feeding API only HTTP POST or API
Document processing Pipeline (py) Simple pipeline
(Java, JS, Groovy,
Jython, JRuby..)
Multi field querying Composite fields DisMax handler
cominvent as
22. Feature comparison ESP – Solr (differences)
Feature ESP Solr
Relevancy tuning Rank profiles, term Dynamic function
boosting queries and boost
functions
XRANK XRANK operator Function Queries
Freshness boost Freshness in rank Function Queries
profile
Boost GEO distance Rank profile and Function Queries
special
Major schema or software updates Cold update, use Stage new content
stage environment into new Solr core
Pluggability Docprocs, clients Everything :)
Request Handlers,
Query Parsers,
Docprocs, Rank,
Spell, tokenizer++
cominvent as
23. Feature comparison ESP – Solr (differences)
Feature ESP Solr
Lemmatization Can be licensed Can be licensed
for many from 3rd party
languages
Query syntax and(a:foo, b:bar) a:foo OR b:bar
i:range(0, 100) I:[0 TO 100]
d:range(2000-01- d:[2000-01-
01T00:00:00, 01T00:00:00Z TO
2010-03- NOW]
03T12:00:00)
Query params query= q=
offset= start=
hits= rows=
spell=1 spellcheck=true
What fields to return view=viewname fl=title,price,body...
cominvent as
24. Your FAST system - overview
Your web-app
Search middleware?
cominvent as
Graphics diagram: www.microsoft.com
25. Migrating index profile
– ESP index profile -> Solr schema.xml
– Setup field types, use defaults or create your own
– Setup the static fields. ESP:
– Solr equivalent:
– No need for generic*, use dynamic fields:
cominvent as
26. Migrating index profile
– Composite fields?
• Solr can use <copyField> to copy multiple fields into
one, e.g. as we did to map many attributes into one
field
• However, to achieve ranking with different boost of
each field, Solr does not need composite field. Use
DisMax query handler instead. Very powerful!
– No need to edit schema to add new fields. Using
dynamic fields, it is easy to e.g. Introduce a color facet
for cars or a Mpixels facet for digital cameras
cominvent as
27. DisMax query example
– This Solr query can replace use of composite-field
• qt=dismax
• q=oslo
• qf=title^0.7 highpriorityfields^1.5
mediumpriorityfields^0.6 lowpriorityfields^0.2
recallfields^0.0 body^0.0
• bf=recip(rord(creationDate),1,1000,1000)
cominvent as
28. Migrating content
– If using FAST ContentAPI to push programatically
• Use Solr's clients (Java, .NET, Ruby, Python, PHP...)
– If feeding FastXML using FileTraverser
• Feed as Solr XML using HTTP POST or a POST client
– If you feed custom XML with XMLMapper
• Have a look at DIH's import and mapping features
cominvent as
29. Push Feeding example
– Feed XML using HTTP POST:
• curl http://localhost:8080/solr/update?commit=true
-H "Content-Type: text/xml"
--data-binary @mydoc.xml
– Ruby example:
• >gem sources -a http://gemcutter.org
>sudo gem install rsolr
require 'rsolr'
solr = RSolr.connect :url=>'http://localhost:8080'
documents = [{:id=>1, :price=>1.00},
{:id=>2, :price=>10.50}]
solr.add documents
solr.commit
cominvent as
31. Querying examples
– http://localhost:8080/solr/select?q=car&fl=id,title
– Ruby
• res=solr.select :q=>'roses', :fq=>['red','white']
res['response']['docs'].each do |doc|
puts doc['title']
end
cominvent as
32. Migrating document processing
– Solr lacks a sophisticated pipeline with entity
extraction etc. Alternatives:
• Do extraction in Application space (Ruby)
• Write own stage in Solr pipeline for simple cases
• Integrate to do more advanced stuff
– Matchers/extractors
• LingPipe NamedEntityExtractor inside of OpenPipeline
– Synonyms:
• Use Solr's synonym handling index/query side
– Custom stages:
• Write a Solr UpdateProcessor (in Java, Jython etc)
– Got a LOT of custom FAST docproc stages?
• Have a look at SESAT's PY ProcServer for Solr (GPL)
cominvent as
33. Migrating linguistics (lemmatization)
– Solr ships with Stemming instead of Lemmatization
– Stemming has limitations
• Biler, bilen, bilene -> bil
BUT
• Bøker, bøkene -> bøk; boka, bok -> bok
– Kstem better. Free with LucidWorks for Solr
– If you need singular/plural handling only
• Free dictionaries? Check lucene-hunspell
– Lemmatization can be licensed from 3rd party
such as Basistech, who also has language
identification & entity extraction
– Language identification also from Sematext
cominvent as
34. Basistech Rosette for Lucene
– High-end linguistics capabilities for
19 languages
– Language Identification
– Segmentation and tokenization
– Lemmatization
– Noun decompounding
– Part-of-speech tagging
– Entity extraction
– Easily integrated with Lucene/Solr
– More: http://www.basistech.com/lucene/
cominvent as
35. Migrating search middleware
– Using FAST Unity?
• Consider migrating middleware logic such as external
source querying and federation to SESAT (AGPL)
– Using Comperio Front?
• Must migrate custom query and resp formats
• Consider SESAT as well for migrating flow logic
– Or is plain Solr enough?
• Solr has built-in support for shards
• A shard query will query multiple shards
and merge the results into one
• Add custom processing as Query
Components in Solr
• Check contrib & patches!
cominvent as
36. Migrating Web Crawler
– Solr has no built-in web crawler
• Instead you can choose from several integrations
– The Apache Nutch crawler
• Proven with hundreds of millions of pages
• http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
– Apache Droids
• Still an incubator, but aims at becoming a full crawler
• http://incubator.apache.org/droids/
– Heritix + Solr (example in Solr1.4 book)
– OpenPipeline has a (very) simple crawler
– Lucene Connectors Framework
• Preparing crawler support
cominvent as
37. Migrating Connectors
– Solr handles these sources internally through DIH:
• Database, RSS, Web-services, Local filesystem
– Additionally throgh Lucene Connectors Framework:
•
• EMC Documentum, FileNet, JDBC, LiveLink, Patriarch
(Memex), Meridio, SharePoint, RSS
• New connectors should be written for LCF
– Another option: Open Pipeline, supporting:
•
• Sharepoint, IMAP, Documentum, Vignette, Filesystem
cominvent as
38. Operations
– Solr has no admin-server (coming in 1.5)
– Possible to run multiple Tomcat on same server
– Multiple cores in same Tomcat – easier migration
– No built-in query reports, use 3rd party tools
– No built-in monitoring, have a look at Nagios?
cominvent as
39. More info
– Solr WIKI: http://wiki.apache.org/solr/
– Deployments: http://wiki.apache.org/solr/PublicServers
– Reference Guide: http://tinyurl.com/ygj3q9j
– Solr Book: http://tinyurl.com/solrbook
– Solr training: http://www.solrtraining.com/
cominvent as
40. Thank You
www.cominvent.com
jh@cominvent.com
www.twitter.com/cominvent
This presentation licensed under CC-by-sa license
cominvent as You must attribute Cominvent with name and link