DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps
1. Building an open-source based search solution –
first steps
Roman Kern
Institute of Knowledge Management
Graz University of Technology
Know-Center Graz
rkern@tugraz.at, rkern@know-center.at
Data Science Meetup / 2012-04-12
2. Overview Graz University of Technology
Motivation
Background
Solr Ecosystem
Solr Features
Conclusions
2 / 28
3. Motivation Graz University of Technology
Search
Change in users expectations
Missing, sub-optimal search causes frustration
Science
Information retrieval
Success story
Mostly focused on web search
Industry
Enterprise search
Heterogeneous data sources
3 / 28
4. Background of the Speaker Graz University of Technology
http://a1.net
http://wissen.de
4 / 28
5. Apache Lucene Umbrella Project Graz University of Technology
Components
Search engine ⇒ Lucene
Search server ⇒ Solr
Web search engine ⇒ Nutch
Lightweight crawler ⇒ Droids
File-format parsing ⇒ Tika
Communicate with CMS ⇒ ManifoldCF
Distributed coordination ⇒ ZooKeeper
Natural language processing ⇒ OpenNLP
Related projects: Hadoop, Mahout, Carrot2, ...
Common aspects
Apache license, implemented in Java, community
5 / 28
6. Lucene Graz University of Technology
Search Engine Library
Java API
Only for expert users
Search-Index
File-system
In-memory index
Advanced features
Incremental indexing
Update while searching
Base for many projects
Solr
ir-lib
elasticsearch
LIA (Lucene in Action)
http://lucene.apache.org/core/ 6 / 28
7. Nutch Graz University of Technology
Web search engine
Builds upon Solr
Web crawler
Link database, crawl database
Distributed
Runs on Hadoop
Mode of operation
Crawl a single domain
Crawl the web with seed sites
http://nutch.apache.org/
7 / 28
8. Droids Graz University of Technology
Crawler component
Lightweight crawler
Main features
Throttling
Multi-threaded
Well behaved (robots.txt)
http://incubator.apache.org/droids/
8 / 28
9. Tika Graz University of Technology
Text extraction
Text & meta-data
File-formats
Office
Microsoft Formats (Apache POI)
OpenDocument
Common text formats
PDF (PDFBox)
HTML (tagsoup)
Non-text
Images
Sound
http://tika.apache.org/
9 / 28
10. ManifoldCF Graz University of Technology
Content Management System Connectors
Communicate with CMS/DMS
Connectors
FileNet P8 (IBM)
Documentum (EMC)
LiveLink (OpenText)
Meridio (Autonomy)
Windows shares (Microsoft)
SharePoint (Microsoft)
More: Alfresco, JDBC, ...
Data is then stored and indexed
e.g. Solr
http://incubator.apache.org/connectors/
10 / 28
11. ZooKeeper Graz University of Technology
Distributed coordination
Orchestrate servers
Distributed
Configuration
Name lookup
Synchronization
http://zookeeper.apache.org/
11 / 28
12. OpenNLP Graz University of Technology
Natural language processing
Process plain text
Maximum entropy classification with beam search
Models
Sentence splitting
Token splitting
Part-of-speech (POS) tagging
Named entity recognition
more: chunker, parser, co-reference resolution
http://opennlp.sourceforge.net/
12 / 28
13. Hadoop Graz University of Technology
Distributed computing
Scale out framework
Distributed file-system
Data is partitioned
Stored on multiple nodes
Map/Reduce paradigm
Map your algorithms to mappers & reducers
Related projects: HBase, Pig, Hive, ...
http://hadoop.apache.org/
13 / 28
14. Mahout Graz University of Technology
Distributed machine learning
Scale out framework
Machine learning
Recommender systems
Clustering
Classification
Integration
Standalone
Hadoop
Amazon EC2
http://mahout.apache.org/
14 / 28
16. Search Server Graz University of Technology
What Solr is
Web-Service
Full-text indexing & search
Support to store arbitrary content
What Solr isn’t
Solr = grep
Database
But, somehow similar to No-SQL databases
Solr vs. IR-Lib
Solr: easy to use, easy to integrate, XML configuration
IR-Lib: expert knowledge to use, Java configuration, fast
16 / 28
17. Index Structure Graz University of Technology
Inverted Index
Dictionary of words (terms)
Map from term to document
Document
List of fields
Input fields are them mapped according to the schema
Field-types
Defined in the schema
Type (string, boolean, date, number) - internally mapped to
string
17 / 28
18. Index Management Graz University of Technology
API
HTTP Server
Various formats (XML, binary, JavaScript, ...)
Document life-cycle
There is no update
Delete (done automatically by Solr)
Insert
Implications
An unique id is necessary
Use batch updates
Commit, rollback (and optimize)
18 / 28
19. Input Handling Graz University of Technology
Different input formats
XML
CSV
JDBC (database)
DIH (data import handler)
Support incremental updates (via timestamps)
Solr Cell
Binary content
Apache Tika
Text content and metadata
19 / 28
20. Text Processing Graz University of Technology
Scope
During indexing & query
Tokenization
Split text into tokens
Lower-case alignment
Stemming (e.g. ponies, pony ⇒ poni, triplicate ⇒
triplic, ...)
Synonyms (via Thesaurus)
Stop-word filtering
Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi)
n-grams, soundex, umlauts
20 / 28
21. Query Processing Graz University of Technology
Query parsers
Lucene query parser (rich syntax)
AND, OR, NOT, range queries, wildcards, fuzzy query, phrase
query
Boosting of individual parts
Example: ((boltzmann OR schroedinger) NOT einstein)
Dismax query parser
No query syntax
Searches over multiple fields (separate boost for each field)
Configure the amount of terms to be mandatory
Distance between terms is used for ranking (phrase boosting)
Dismax is a good starting point, but may become expensive
21 / 28
22. Search Features Graz University of Technology
Query filter
Additional query
No impact on ranking
Results are cached
Boosting query
Only in Dismax
Query elevation
Fix certain queries
Request handler
Pre-define clauses
Invariants
22 / 28
23. Search Result Graz University of Technology
Ranking
Relevance
Sort on field value (only single term per document)
Available data & features
Sequence of IDs & score
Stored fields
Snippets (plus highlighting)
Facets
Count the search hits
Types: field value, dates, queries
Sort, prefix, ...
Could be used for term suggestion (aka. query suggestion)
Field collapsing (grouping)
Spell checking (did-you-mean)
23 / 28
24. Additional Solr Features Graz University of Technology
Query by Example
More like this
Stats
Per field
Min, max, sum, missing, ...
Admin-GUI
Webapp to troubleshoot queries
Browse schema
JMX
Read properties & statistics
Can be accessed remotely
24 / 28
25. Integration Graz University of Technology
Deployment
Within a web application server
Embedded
Monitor
Log output
Access
Various language bindings
Java, Ruby, JavaScript, PHP, ...
25 / 28
26. Multi-core Graz University of Technology
Multiple indices
Each index has its own configuration
Operations
Reload (when configuration has been changed)
Rename
Swap
Merge
Create, Status
26 / 28
27. Scale Solr Graz University of Technology
Replication
Master and slaves nodes
Replication
Slaves poll master
Dispatch search request
Load balancer
27 / 28
28. Sharding Indexes Graz University of Technology
Single index
Index spawned over multiple machines
Search is done in parallel
Mapping
Application has to provide a deterministic mapping
Document ⇒ index
28 / 28
29. Conclusions Graz University of Technology
Ecosystem
Vivid community
Corporative backing
Solr
Easy to get started
Hard to optimize for specific requirements
29 / 28
30. The End Graz University of Technology
Thank you!
30 / 28