2. What is it?
• Text search index (engine)
• Open source
• Not a search product
• A tool that allows you to create a search
solution
3. What is it like?
• Google, Google Appliance.
• FAST
• Oracle Secure Enterprise Search
• etc.
4. Google Appliance:
• Sucks data in
• Can’t really configure
• Stuck with results
• Bonnet is locked
5. Solr:
• You need to feed data in
• Highly configurable
• Search results can be tuned
• There is no bonnet
6. Why am I doing a talk?
• Did a course
• LucidWorks content
• Presented by FindWise
• FindWise are a search specialist that use a
range of search engines
7. Caveats
• Course was in Solr 4.1.0, we use 3.6.1 for
APVMA
• Course focussed on search, not ingestion or
presentation
• Java API recommended for ingestion
• ‘Browse’ interface uses Velocity templates for
presentation, but probably isn’t good enough
for most projects.
10. Apache Tika
• Data import handler
• Used to be part of Lucene
• XML
• PDF
• Word
• Excel
• etc.
11. Manifold CF
• Apache
• Connector framework
• Used to connect to content repositories (source)
• Sharepoint
• Documentum
• CMIS
• JDBC
• RSS
12. Hydra
• FindWise
• Although Solr supports validation (e.g.
‘required’), don’t use it for data cleanup.
• Validation failure inconvenient: whole job fails
• Feed in clean data.
• Use Hydra for cleanup.
13. Apache ZooKeeper
• Used for SolrCloud
• Clustering and sharding
• Solr 4.1.0 only
• Side project for Hadoop
• Used to manage Hadoop clusters
16. Design Schema
• A data modelling exercise
• schema.xml
• Dynamic fields can be useful in the first pass:
<dynamicField name=“*" type="string"
indexed="true" />
17. Prototyping
• Get the data in (index)
• csv, XML, JSON
• post.jar
• URL to search and inspect raw results
• ‘browse’ interface allows developer to
understand how the search is working
• solrconfig.xml
18. Integration
• Not covered
• Content ingestion
• Presentation of results
• Up to you…