2. What is Solr?
Solr is the popular open source enterprise
search platform from the Apache Lucene
project.
Solr powers the search and navigation
features of many of the world's largest
internet sites.
3. Lucene
Apache Lucene is a high-performance,
full-featured text search engine library
written entirely in Java. It is a technology
suitable for nearly any application that
requires full-text search, especially cross-
platform.
4. Lucene Vs Solr
Lucene is a search library built in Java. Solr is a
web application built on top of Lucene.
Certainly Solr = Lucene + Added features. Often
there would a question, when to choose Solr and
when to choose Lucene.
To get more control use Lucene. For faster
development, easy to learn, choose Solr.
http://www.findbestopensource.com/article-detail/lucene-vs-solr
5. Why do we need Solr?
Full-text Search
– MySQL “like %keyword%”
Too slow! And weak!
6. Major Features of Solr
Advanced Full-Text Search Capabilities
Optimized for High Volume Web Traffic
Standards Based Open Interfaces - XML,JSON and HTTP
Comprehensive HTML Administration Interfaces
Server statistics exposed over JMX for monitoring
Scalability - Efficient Replication to other Solr Search
Servers
Flexible and Adaptable with XML configuration
Extensible Plugin Architecture
http://lucene.apache.org/solr/
7. Typical Application Architecture
Cache
(memcached,
Redis, etc.)
http request
Web Server
Database
(MySQL)
DIH
Solr / Lucene
All the components could be distributed, to
make the architecture scalable.
8. Lucene/Solr Architecture
Request Handlers Response Writers Update Handlers
/admin /select /spell XML Binary JSON XML CSV binary
Search Components Update Processors
Query Highlighting Extracting
Signature Request
Spelling Statistics Schema Logging Handler
Faceting Debug Indexing (PDF/WORD)
More like this Clustering Apache Tika
Query
Parsing
Distributed Search Config Data Import
Handler (SQL/
Analysis
RSS)
High-
Faceting Filtering Search Caching
lighting
Index
Replication
Core Search Apache Lucene Indexing
IndexReader/Searcher Text Analysis IndexWriter
8
9. Demo – A live website powered by Solr
I’ll be showing you more later!
13. Demo – Run Solr!
java -jar start.jar
Production enviroment:
– java -Xms200m -Xmx1400m -jar start.jar
>>/home/web_logs/solr/solr$date.log 2>&1 &
– tailf /home/web_logs/solr/solr20120423.log
2012-04-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983
14. Demo – Web Admin Interface
http://localhost:8983/solr/admin
15. Demo – Web Admin Interface
http://localhost:8983/solr/admin
• SCHEMA: This downloads the schema configuration file
(XML) directly to the browser.
• CONFIG: It is similar to the SCHEMA choice, but this is the
main configuration file for Solr.
• ANALYSIS: It is used for diagnosing potential
query/indexing problems having to do with the text analysis.
This is a somewhat advanced screen and will be discussed
later.
•SCHEMA BROWSER: This is a neat view of the schema
reflecting various heuristics of the actual data in the index.
We'll return here later.
•STATISTICS: Here you will find stats such as timing and
cache hit ratios. In Chapter 9, we will visit this screen to
evaluate Solr's performance.
16. Demo – Web Admin Interface
http://localhost:8983/solr/admin
• INFO: This lists static versioning information
about internal components to Solr. Frankly, it's
not very useful.
• DISTRIBUTION: It contains
Distributed/Replicated status information, only
applicable for such configurations.
• PING: Ignore this, although it can be used
for a health-check in distributed mode.
• LOGGING: This allows you to adjust the
logging levels for different parts of Solr at
runtime. For Jetty as we're running it, this
output goes to the console and nowhere else.
24. Indexing Data - Communicating with Solr
– Direct HTTP or a convenient client API
– Data streamed remotely or from Solr's
filesystem
25. Indexing Data - Data formats/sources
– Solr-XML:
– Solr-binary:
This is only supported by the SolrJ client API.
– CSV:
CSV is a character separated value format (often a comma).
– Rich documents like PDF, XLS, DOC, PPT
– Solr's DIH DataImportHandler contrib add-on is a powerful
capability that can communicate with both databases and XML
sources (for example: web services). It supports configurable
relational and schema mapping options and supports custom
transformation additions if needed. The DIH uniquely supports
delta updates if the source data has modification dates.
26. Lucene/Solr Indexing
PDF
<doc>
<title> HTTP POST HTTP POST
/update /update/csv /update/xml /update/extract
XML Update Extracting
XML Update CSV Update
with custom RequestHandler
Handler Handler
processor chain (PDF, Word, …)
Update Processor Chain (per handler) Text Index
Analyzers
RSS Data Import Remove Duplicates
pull
feed Handler processor
Custom Transform Lucene
Database pull processor
RSS pull Logging
pull
SQL DB Simple processor
transforms Index Lucene Index
processor
33. DIH (Data Import Handler)
Most applications store data in relational databases or XML files
and searching over such data is a common use-case.
The DataImportHandler is a Solr contrib that provides a configuration driven way to
import this data into Solr in both "full builds" and using incremental delta imports.
jdbc/DIH
MySQL Solr
• full-import
• delta-import
34. DIH (Data Import Handler)
• Imports data from databases through JDBC (Java Database Connectivity)
• Imports XML data from a URL (HTTP GET) or a file
• Can combine data from different tables or sources in various ways
• Extraction/Transformation of the data
• Import of updated (delta) data from a database, assuming a last-
updated date
• A diagnostic/development web page
• Extensible to support alternative data sources and transformation steps
35. DIH (Data Import Handler)
• curl http://localhost:8983/solr/dataimport to verify the configuration.
• curl http://localhost:8983/solr/dataimport?command=full-import
• curl http://localhost:8983/solr/dataimport?command=delta-import
36. DIH (Data Import Handler) - Full Import Example 完全索引
data-config.xml
38. DIH (Data Import Handler) - Demo
Linux aaa 2.6.18-243.el5 #1 SMP Mon Feb 7 18:47:27 EST
2011 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
cpu cores :1
MemTotal: 2058400 kB
2 millions rows imported in about 20 minutes.
39. Sharding
Sharding is the process of breaking a
single logical index in a horizontal fashion
across records versus breaking it up
vertically by entities.
S1 S2 S3 S4
41. Sharding-Query
The ability to search across shards is built
into the query request handlers. You do
not need to do any special configuration
to activate it.
43. Combining replication and sharding
Sharding
M1 M2 M3 Masters
Replication
S1 S2 S3 S1 S2 S3
Slave Pool 1 Slave Pool 2
Queries sent to pools of slave shards
44. Combining replication and sharding
http://wiki.apache.org/solr/SolrCloud
http://zookeeper.apache.org/doc/r3.3.2/zookeeperOver.html
46. Solr Caching
Caching is a key part of what makes Solr
fast and scalable
There are a number of different caches
configured in solrconfig.xml:
– filterCache
– queryResultCache
– documentCache
47. More Info
《 Solr 1.4 Enterprise Search Server 》
http://wiki.apache.org/solr/
http://solr.pl/en/
《解密搜索引擎技术实战》