SlideShare a Scribd company logo
1 of 48
Solr
Hao Chen 2012.04
What is Solr?
 Solr is the popular open source enterprise
  search platform from the Apache Lucene
  project.
 Solr powers the search and navigation
  features of many of the world's largest
  internet sites.
Lucene
   Apache Lucene is a high-performance,
    full-featured text search engine library
    written entirely in Java. It is a technology
    suitable for nearly any application that
    requires full-text search, especially cross-
    platform.
Lucene Vs Solr
   Lucene is a search library built in Java. Solr is a
    web application built on top of Lucene.
   Certainly Solr = Lucene + Added features. Often
    there would a question, when to choose Solr and
    when to choose Lucene.
   To get more control use Lucene. For faster
    development, easy to learn, choose Solr.




             http://www.findbestopensource.com/article-detail/lucene-vs-solr
Why do we need Solr?
   Full-text Search
    – MySQL “like %keyword%”



         Too slow! And weak!
Major Features of Solr
   Advanced Full-Text Search Capabilities
   Optimized for High Volume Web Traffic
   Standards Based Open Interfaces - XML,JSON and HTTP
   Comprehensive HTML Administration Interfaces
   Server statistics exposed over JMX for monitoring
   Scalability - Efficient Replication to other Solr Search
    Servers
   Flexible and Adaptable with XML configuration
   Extensible Plugin Architecture



                                             http://lucene.apache.org/solr/
Typical Application Architecture

                                         Cache
                                      (memcached,
                                       Redis, etc.)
http request
               Web Server
                                                       Database
                                                        (MySQL)

                                                              DIH


                                                      Solr / Lucene


  All the components could be distributed, to
  make the architecture scalable.
Lucene/Solr Architecture
Request Handlers                         Response Writers                 Update Handlers
/admin       /select      /spell         XML     Binary    JSON               XML     CSV   binary

Search Components                                                 Update Processors
     Query             Highlighting                                                     Extracting
                                                                    Signature            Request
    Spelling            Statistics              Schema               Logging             Handler
    Faceting             Debug                                       Indexing         (PDF/WORD)
 More like this        Clustering                                                     Apache Tika
                                                                    Query
                                                                    Parsing
     Distributed Search                         Config                                 Data Import
                                                                                      Handler (SQL/
                                                                   Analysis
                                                                                          RSS)
                                                                     High-
   Faceting       Filtering            Search         Caching
                                                                   lighting
                                                                                         Index
                                                                                       Replication
       Core Search                   Apache Lucene                Indexing
  IndexReader/Searcher                Text Analysis             IndexWriter
                                                                                                8
Demo – A live website powered by Solr


     I’ll be showing you more later!
Demo – The backend of the website
Demo - Standard directory layout
Demo - Multiple cores
Demo – Run Solr!
   java -jar start.jar
   Production enviroment:
    – java -Xms200m -Xmx1400m -jar start.jar
      >>/home/web_logs/solr/solr$date.log 2>&1 &
    – tailf /home/web_logs/solr/solr20120423.log
       2012-04-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983
Demo – Web Admin Interface
  http://localhost:8983/solr/admin
Demo – Web Admin Interface
  http://localhost:8983/solr/admin

  • SCHEMA: This downloads the schema configuration file
  (XML) directly to the browser.
  • CONFIG: It is similar to the SCHEMA choice, but this is the
  main configuration file for Solr.
  • ANALYSIS: It is used for diagnosing potential
  query/indexing problems having to do with the text analysis.
  This is a somewhat advanced screen and will be discussed
  later.
  •SCHEMA BROWSER: This is a neat view of the schema
  reflecting various heuristics of the actual data in the index.
  We'll return here later.
  •STATISTICS: Here you will find stats such as timing and
  cache hit ratios. In Chapter 9, we will visit this screen to
  evaluate Solr's performance.
Demo – Web Admin Interface
  http://localhost:8983/solr/admin


                 • INFO: This lists static versioning information
                 about internal components to Solr. Frankly, it's
                 not very useful.

                 • DISTRIBUTION: It contains
                 Distributed/Replicated status information, only
                 applicable for such configurations.

                 • PING: Ignore this, although it can be used
                 for a health-check in distributed mode.

                 • LOGGING: This allows you to adjust the
                 logging levels for different parts of Solr at
                 runtime. For Jetty as we're running it, this
                 output goes to the console and nowhere else.
Query
Indexing
Query
 INFO: [core1] webapp=/solr path=/admin/ping params={}
  status=0 QTime=2
 Apr 23, 2012 5:42:46 PM org.apache.solr.core.SolrCore execute

   INFO: [core1] webapp=/solr path=/select
    params={wt=json&rows=100&json.nl=map&start=0&q=searchKey
    word:ipad2} hits=48 status=0 QTime=0
Query
   INFO: [] webapp=/solr path=/select
    params={wt=json&rows=20&json.nl=map&start=0&sort
    =volume+desc&q=CId:50011744+AND+price:
    [100+TO+*]} hits=1547 status=0 QTime=41

   q=CId:50011744+AND+price:[100+TO+*]
   sort=volume+desc
   start=0
   rows=20

   hits=1547 status=0 QTime=41
Query
   q - 查询字符串,必需
   fl - 指定返回那些字段内容,用逗号或空格分隔多个。
   start - 返回第一条记录在完整找到结果中的偏移位置, 0 开始,一般分页用
    。
   rows - 指定返回结果最多有多少条记录,配合 start 来实现分页。
   sort - 排序,格式: sort=<field name>+<desc|asc>[,<field
    name>+<desc|asc>]… 。示例:( inStock desc, price asc )表示先
    “ inStock” 降序 , 再 “ price” 升序,默认是相关性降序。
   wt - (writer type) 指定输出格式,可以有 xml, json, php, phps, 后面 solr
    1.3 增加的,要用通知我们,因为默认没有打开。
   fq - ( filter query )过滤查询,作用:在 q 查询符合结果中同时是 fq 查询
    符合的,例如: q=mm&fq=date_time:[20081001 TO 20091031] ,找关键
    字 mm ,并且 date_time 是 20081001 到 20091031 之间的。




                  More: http://wiki.apache.org/solr/CommonQueryParameters
Demo – PHP Solr Client
Query - Demo
Indexing Data
Indexing Data - Communicating with Solr

  – Direct HTTP or a convenient client API
  – Data streamed remotely or from Solr's
    filesystem
Indexing Data - Data formats/sources

  – Solr-XML:

  – Solr-binary:
    This is only supported by the SolrJ client API.

  – CSV:
    CSV is a character separated value format (often a comma).

  – Rich documents like PDF, XLS, DOC, PPT

  – Solr's DIH DataImportHandler contrib add-on is a powerful
    capability that can communicate with both databases and XML
    sources (for example: web services). It supports configurable
    relational and schema mapping options and supports custom
    transformation additions if needed. The DIH uniquely supports
    delta updates if the source data has modification dates.
Lucene/Solr Indexing
                                                                          PDF
  <doc>
  <title>       HTTP POST                                               HTTP POST



                   /update      /update/csv       /update/xml      /update/extract
                                                 XML Update           Extracting
                 XML Update     CSV Update
                                                 with custom        RequestHandler
                   Handler        Handler
                                                processor chain     (PDF, Word, …)


                                Update Processor Chain (per handler)      Text Index
                                                                          Analyzers
 RSS             Data Import        Remove Duplicates
         pull
feed              Handler              processor
                                    Custom Transform                   Lucene
                Database pull           processor
                   RSS pull              Logging
         pull
SQL DB             Simple               processor
                 transforms               Index                   Lucene Index
                                        processor
Indexing Data -   Schema


   schema.xml
Advanced
   Chinese Word Segmentation ( 中文分词
    )
   DIH (Data Import Handler)
   Sharding
   Replication
   Performance Tuning
Chinese Word Segmentation ( 中文分词 )
Chinese Word Segmentation ( 中文分词 )
Chinese Word Segmentation ( 中文分词 )
IKAnalyzer3.2.8.jar
Chinese Word Segmentation ( 中文分词 )


  相关原理请参阅《 解密搜索引擎技术实战
  》
DIH (Data Import Handler)
Most applications store data in relational databases or XML files
and searching over such data is a common use-case.

The DataImportHandler is a Solr contrib that provides a configuration driven way to
import this data into Solr in both "full builds" and using incremental delta imports.




                                    jdbc/DIH

                   MySQL                                   Solr


                                •    full-import
                                •    delta-import
DIH (Data Import Handler)
•   Imports data from databases through JDBC (Java Database Connectivity)

•   Imports XML data from a URL (HTTP GET) or a file

•   Can combine data from different tables or sources in various ways

•   Extraction/Transformation of the data

•   Import of updated (delta) data from a database, assuming a last-
    updated date

•   A diagnostic/development web page

•   Extensible to support alternative data sources and transformation steps
DIH (Data Import Handler)
•   curl http://localhost:8983/solr/dataimport to verify the configuration.

•     curl http://localhost:8983/solr/dataimport?command=full-import
•   curl http://localhost:8983/solr/dataimport?command=delta-import
DIH (Data Import Handler) - Full Import Example 完全索引
data-config.xml
DIH (Data Import Handler) - Delta Import Example 增量索引
data-config.xml
DIH (Data Import Handler) - Demo

     Linux aaa 2.6.18-243.el5 #1 SMP Mon Feb 7 18:47:27 EST
     2011 x86_64 x86_64 x86_64 GNU/Linux
     Intel(R) Xeon(R) CPU       E5620 @ 2.40GHz
     cpu cores    :1

     MemTotal:    2058400 kB



     2 millions rows imported in about 20 minutes.
Sharding
   Sharding is the process of breaking a
    single logical index in a horizontal fashion
    across records versus breaking it up
    vertically by entities.




                S1    S2   S3   S4
Sharding-Indexing
SHARDS =
 ['http://server1:8983/solr/',
  'http://server2:8983/solr/']

unique_id = document[:id]
if unique_id.hash % SHARDS.size == local_thread_id
# index to shard
end
Sharding-Query
The ability to search across shards is built
 into the query request handlers. You do
 not need to do any special configuration
 to activate it.
Replication

          Master




                   Slaves
Combining replication and sharding

                                            Sharding
                    M1   M2   M3            Masters

                                    Replication



     S1    S2      S3          S1      S2    S3



    Slave Pool 1              Slave Pool 2




                              Queries sent to pools of slave shards
Combining replication and sharding




              http://wiki.apache.org/solr/SolrCloud
              http://zookeeper.apache.org/doc/r3.3.2/zookeeperOver.html
Performance Tuning
 JVM
 http cache
 Solr Cache
 Better schema
 Better indexing strategy
Solr Caching
 Caching is a key part of what makes Solr
  fast and scalable
 There are a number of different caches
  configured in solrconfig.xml:
    – filterCache
    – queryResultCache
    – documentCache
More Info
 《 Solr 1.4 Enterprise Search Server 》
 http://wiki.apache.org/solr/
 http://solr.pl/en/
 《解密搜索引擎技术实战》
Thank you!

More Related Content

What's hot

Apache Ambari Stack Extensibility
Apache Ambari Stack ExtensibilityApache Ambari Stack Extensibility
Apache Ambari Stack ExtensibilityJayush Luniya
 
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaBuilding the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaMichael Noel
 
(ATS3-APP04) AVS and SN 6.6 Updates 3 Deep Dive
(ATS3-APP04) AVS and SN 6.6 Updates 3 Deep Dive(ATS3-APP04) AVS and SN 6.6 Updates 3 Deep Dive
(ATS3-APP04) AVS and SN 6.6 Updates 3 Deep DiveBIOVIA
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Oracle dba training
Oracle  dba    training Oracle  dba    training
Oracle dba training P S Rani
 
Mule connectors-part 1
Mule connectors-part 1Mule connectors-part 1
Mule connectors-part 1VirtusaPolaris
 
Oracle DBA Tutorial for Beginners -Oracle training institute in bangalore
Oracle DBA Tutorial for Beginners -Oracle training institute in bangaloreOracle DBA Tutorial for Beginners -Oracle training institute in bangalore
Oracle DBA Tutorial for Beginners -Oracle training institute in bangaloreTIB Academy
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
Oracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewOracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewDave Segleau
 
Solaris cluster roadshow day 2 technical presentation
Solaris cluster roadshow day 2 technical presentationSolaris cluster roadshow day 2 technical presentation
Solaris cluster roadshow day 2 technical presentationxKinAnx
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
RESTful Web services using JAX-RS
RESTful Web services using JAX-RSRESTful Web services using JAX-RS
RESTful Web services using JAX-RSArun Gupta
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Cloudera Impala technical deep dive
Cloudera Impala technical deep diveCloudera Impala technical deep dive
Cloudera Impala technical deep divehuguk
 
Ultimate SharePoint Infrastructure Best Practises Session - Isle of Man Share...
Ultimate SharePoint Infrastructure Best Practises Session - Isle of Man Share...Ultimate SharePoint Infrastructure Best Practises Session - Isle of Man Share...
Ultimate SharePoint Infrastructure Best Practises Session - Isle of Man Share...Michael Noel
 

What's hot (20)

Apache Ambari Stack Extensibility
Apache Ambari Stack ExtensibilityApache Ambari Stack Extensibility
Apache Ambari Stack Extensibility
 
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South AmericaBuilding the Perfect SharePoint 2010 Farm - Sharing the Point South America
Building the Perfect SharePoint 2010 Farm - Sharing the Point South America
 
Oracle DBA
Oracle DBAOracle DBA
Oracle DBA
 
(ATS3-APP04) AVS and SN 6.6 Updates 3 Deep Dive
(ATS3-APP04) AVS and SN 6.6 Updates 3 Deep Dive(ATS3-APP04) AVS and SN 6.6 Updates 3 Deep Dive
(ATS3-APP04) AVS and SN 6.6 Updates 3 Deep Dive
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Oracle dba training
Oracle  dba    training Oracle  dba    training
Oracle dba training
 
Java SE 8 & EE 7 Launch
Java SE 8 & EE 7 LaunchJava SE 8 & EE 7 Launch
Java SE 8 & EE 7 Launch
 
Mule connectors-part 1
Mule connectors-part 1Mule connectors-part 1
Mule connectors-part 1
 
181 Rac
181 Rac181 Rac
181 Rac
 
Oracle DBA Tutorial for Beginners -Oracle training institute in bangalore
Oracle DBA Tutorial for Beginners -Oracle training institute in bangaloreOracle DBA Tutorial for Beginners -Oracle training institute in bangalore
Oracle DBA Tutorial for Beginners -Oracle training institute in bangalore
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Oracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewOracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overview
 
Mule connectors-session1
Mule connectors-session1Mule connectors-session1
Mule connectors-session1
 
Solaris cluster roadshow day 2 technical presentation
Solaris cluster roadshow day 2 technical presentationSolaris cluster roadshow day 2 technical presentation
Solaris cluster roadshow day 2 technical presentation
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
RESTful Web services using JAX-RS
RESTful Web services using JAX-RSRESTful Web services using JAX-RS
RESTful Web services using JAX-RS
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Cloudera Impala technical deep dive
Cloudera Impala technical deep diveCloudera Impala technical deep dive
Cloudera Impala technical deep dive
 
Ultimate SharePoint Infrastructure Best Practises Session - Isle of Man Share...
Ultimate SharePoint Infrastructure Best Practises Session - Isle of Man Share...Ultimate SharePoint Infrastructure Best Practises Session - Isle of Man Share...
Ultimate SharePoint Infrastructure Best Practises Session - Isle of Man Share...
 

Similar to Solr -

Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
SQL and NoSQL in SQL Server
SQL and NoSQL in SQL ServerSQL and NoSQL in SQL Server
SQL and NoSQL in SQL ServerMichael Rys
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...lucenerevolution
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEPBIOVIA
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Expertezed 2012 Webcast - XML DB Use Cases
Expertezed 2012 Webcast - XML DB Use CasesExpertezed 2012 Webcast - XML DB Use Cases
Expertezed 2012 Webcast - XML DB Use CasesMarco Gralike
 
BP-1 Performance and Scalability
BP-1 Performance and ScalabilityBP-1 Performance and Scalability
BP-1 Performance and ScalabilityAlfresco Software
 
Couchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionCouchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionKelum Senanayake
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
The Power of Relationships in Your Big Data
The Power of Relationships in Your Big DataThe Power of Relationships in Your Big Data
The Power of Relationships in Your Big DataPaulo Fagundes
 
OOW09 Ebs Tuning Final
OOW09 Ebs Tuning FinalOOW09 Ebs Tuning Final
OOW09 Ebs Tuning Finaljucaab
 

Similar to Solr - (20)

Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
SQL and NoSQL in SQL Server
SQL and NoSQL in SQL ServerSQL and NoSQL in SQL Server
SQL and NoSQL in SQL Server
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...
 
Sparkflows Use Cases
Sparkflows Use CasesSparkflows Use Cases
Sparkflows Use Cases
 
SparkFlow
SparkFlow SparkFlow
SparkFlow
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Expertezed 2012 Webcast - XML DB Use Cases
Expertezed 2012 Webcast - XML DB Use CasesExpertezed 2012 Webcast - XML DB Use Cases
Expertezed 2012 Webcast - XML DB Use Cases
 
BP-1 Performance and Scalability
BP-1 Performance and ScalabilityBP-1 Performance and Scalability
BP-1 Performance and Scalability
 
Solr 101
Solr 101Solr 101
Solr 101
 
Couchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionCouchbase - Yet Another Introduction
Couchbase - Yet Another Introduction
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
The Power of Relationships in Your Big Data
The Power of Relationships in Your Big DataThe Power of Relationships in Your Big Data
The Power of Relationships in Your Big Data
 
OOW09 Ebs Tuning Final
OOW09 Ebs Tuning FinalOOW09 Ebs Tuning Final
OOW09 Ebs Tuning Final
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Solr -

  • 2. What is Solr?  Solr is the popular open source enterprise search platform from the Apache Lucene project.  Solr powers the search and navigation features of many of the world's largest internet sites.
  • 3. Lucene  Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross- platform.
  • 4. Lucene Vs Solr  Lucene is a search library built in Java. Solr is a web application built on top of Lucene.  Certainly Solr = Lucene + Added features. Often there would a question, when to choose Solr and when to choose Lucene.  To get more control use Lucene. For faster development, easy to learn, choose Solr. http://www.findbestopensource.com/article-detail/lucene-vs-solr
  • 5. Why do we need Solr?  Full-text Search – MySQL “like %keyword%” Too slow! And weak!
  • 6. Major Features of Solr  Advanced Full-Text Search Capabilities  Optimized for High Volume Web Traffic  Standards Based Open Interfaces - XML,JSON and HTTP  Comprehensive HTML Administration Interfaces  Server statistics exposed over JMX for monitoring  Scalability - Efficient Replication to other Solr Search Servers  Flexible and Adaptable with XML configuration  Extensible Plugin Architecture http://lucene.apache.org/solr/
  • 7. Typical Application Architecture Cache (memcached, Redis, etc.) http request Web Server Database (MySQL) DIH Solr / Lucene All the components could be distributed, to make the architecture scalable.
  • 8. Lucene/Solr Architecture Request Handlers Response Writers Update Handlers /admin /select /spell XML Binary JSON XML CSV binary Search Components Update Processors Query Highlighting Extracting Signature Request Spelling Statistics Schema Logging Handler Faceting Debug Indexing (PDF/WORD) More like this Clustering Apache Tika Query Parsing Distributed Search Config Data Import Handler (SQL/ Analysis RSS) High- Faceting Filtering Search Caching lighting Index Replication Core Search Apache Lucene Indexing IndexReader/Searcher Text Analysis IndexWriter 8
  • 9. Demo – A live website powered by Solr I’ll be showing you more later!
  • 10. Demo – The backend of the website
  • 11. Demo - Standard directory layout
  • 13. Demo – Run Solr!  java -jar start.jar  Production enviroment: – java -Xms200m -Xmx1400m -jar start.jar >>/home/web_logs/solr/solr$date.log 2>&1 & – tailf /home/web_logs/solr/solr20120423.log 2012-04-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983
  • 14. Demo – Web Admin Interface http://localhost:8983/solr/admin
  • 15. Demo – Web Admin Interface http://localhost:8983/solr/admin • SCHEMA: This downloads the schema configuration file (XML) directly to the browser. • CONFIG: It is similar to the SCHEMA choice, but this is the main configuration file for Solr. • ANALYSIS: It is used for diagnosing potential query/indexing problems having to do with the text analysis. This is a somewhat advanced screen and will be discussed later. •SCHEMA BROWSER: This is a neat view of the schema reflecting various heuristics of the actual data in the index. We'll return here later. •STATISTICS: Here you will find stats such as timing and cache hit ratios. In Chapter 9, we will visit this screen to evaluate Solr's performance.
  • 16. Demo – Web Admin Interface http://localhost:8983/solr/admin • INFO: This lists static versioning information about internal components to Solr. Frankly, it's not very useful. • DISTRIBUTION: It contains Distributed/Replicated status information, only applicable for such configurations. • PING: Ignore this, although it can be used for a health-check in distributed mode. • LOGGING: This allows you to adjust the logging levels for different parts of Solr at runtime. For Jetty as we're running it, this output goes to the console and nowhere else.
  • 18. Query  INFO: [core1] webapp=/solr path=/admin/ping params={} status=0 QTime=2  Apr 23, 2012 5:42:46 PM org.apache.solr.core.SolrCore execute  INFO: [core1] webapp=/solr path=/select params={wt=json&rows=100&json.nl=map&start=0&q=searchKey word:ipad2} hits=48 status=0 QTime=0
  • 19. Query  INFO: [] webapp=/solr path=/select params={wt=json&rows=20&json.nl=map&start=0&sort =volume+desc&q=CId:50011744+AND+price: [100+TO+*]} hits=1547 status=0 QTime=41  q=CId:50011744+AND+price:[100+TO+*]  sort=volume+desc  start=0  rows=20  hits=1547 status=0 QTime=41
  • 20. Query  q - 查询字符串,必需  fl - 指定返回那些字段内容,用逗号或空格分隔多个。  start - 返回第一条记录在完整找到结果中的偏移位置, 0 开始,一般分页用 。  rows - 指定返回结果最多有多少条记录,配合 start 来实现分页。  sort - 排序,格式: sort=<field name>+<desc|asc>[,<field name>+<desc|asc>]… 。示例:( inStock desc, price asc )表示先 “ inStock” 降序 , 再 “ price” 升序,默认是相关性降序。  wt - (writer type) 指定输出格式,可以有 xml, json, php, phps, 后面 solr 1.3 增加的,要用通知我们,因为默认没有打开。  fq - ( filter query )过滤查询,作用:在 q 查询符合结果中同时是 fq 查询 符合的,例如: q=mm&fq=date_time:[20081001 TO 20091031] ,找关键 字 mm ,并且 date_time 是 20081001 到 20091031 之间的。 More: http://wiki.apache.org/solr/CommonQueryParameters
  • 21. Demo – PHP Solr Client
  • 24. Indexing Data - Communicating with Solr – Direct HTTP or a convenient client API – Data streamed remotely or from Solr's filesystem
  • 25. Indexing Data - Data formats/sources – Solr-XML: – Solr-binary: This is only supported by the SolrJ client API. – CSV: CSV is a character separated value format (often a comma). – Rich documents like PDF, XLS, DOC, PPT – Solr's DIH DataImportHandler contrib add-on is a powerful capability that can communicate with both databases and XML sources (for example: web services). It supports configurable relational and schema mapping options and supports custom transformation additions if needed. The DIH uniquely supports delta updates if the source data has modification dates.
  • 26. Lucene/Solr Indexing PDF <doc> <title> HTTP POST HTTP POST /update /update/csv /update/xml /update/extract XML Update Extracting XML Update CSV Update with custom RequestHandler Handler Handler processor chain (PDF, Word, …) Update Processor Chain (per handler) Text Index Analyzers RSS Data Import Remove Duplicates pull feed Handler processor Custom Transform Lucene Database pull processor RSS pull Logging pull SQL DB Simple processor transforms Index Lucene Index processor
  • 27. Indexing Data - Schema  schema.xml
  • 28. Advanced  Chinese Word Segmentation ( 中文分词 )  DIH (Data Import Handler)  Sharding  Replication  Performance Tuning
  • 29. Chinese Word Segmentation ( 中文分词 )
  • 30. Chinese Word Segmentation ( 中文分词 )
  • 31. Chinese Word Segmentation ( 中文分词 ) IKAnalyzer3.2.8.jar
  • 32. Chinese Word Segmentation ( 中文分词 ) 相关原理请参阅《 解密搜索引擎技术实战 》
  • 33. DIH (Data Import Handler) Most applications store data in relational databases or XML files and searching over such data is a common use-case. The DataImportHandler is a Solr contrib that provides a configuration driven way to import this data into Solr in both "full builds" and using incremental delta imports. jdbc/DIH MySQL Solr • full-import • delta-import
  • 34. DIH (Data Import Handler) • Imports data from databases through JDBC (Java Database Connectivity) • Imports XML data from a URL (HTTP GET) or a file • Can combine data from different tables or sources in various ways • Extraction/Transformation of the data • Import of updated (delta) data from a database, assuming a last- updated date • A diagnostic/development web page • Extensible to support alternative data sources and transformation steps
  • 35. DIH (Data Import Handler) • curl http://localhost:8983/solr/dataimport to verify the configuration. • curl http://localhost:8983/solr/dataimport?command=full-import • curl http://localhost:8983/solr/dataimport?command=delta-import
  • 36. DIH (Data Import Handler) - Full Import Example 完全索引 data-config.xml
  • 37. DIH (Data Import Handler) - Delta Import Example 增量索引 data-config.xml
  • 38. DIH (Data Import Handler) - Demo Linux aaa 2.6.18-243.el5 #1 SMP Mon Feb 7 18:47:27 EST 2011 x86_64 x86_64 x86_64 GNU/Linux Intel(R) Xeon(R) CPU E5620 @ 2.40GHz cpu cores :1 MemTotal: 2058400 kB 2 millions rows imported in about 20 minutes.
  • 39. Sharding  Sharding is the process of breaking a single logical index in a horizontal fashion across records versus breaking it up vertically by entities. S1 S2 S3 S4
  • 40. Sharding-Indexing SHARDS = ['http://server1:8983/solr/', 'http://server2:8983/solr/'] unique_id = document[:id] if unique_id.hash % SHARDS.size == local_thread_id # index to shard end
  • 41. Sharding-Query The ability to search across shards is built into the query request handlers. You do not need to do any special configuration to activate it.
  • 42. Replication Master Slaves
  • 43. Combining replication and sharding Sharding M1 M2 M3 Masters Replication S1 S2 S3 S1 S2 S3 Slave Pool 1 Slave Pool 2 Queries sent to pools of slave shards
  • 44. Combining replication and sharding http://wiki.apache.org/solr/SolrCloud http://zookeeper.apache.org/doc/r3.3.2/zookeeperOver.html
  • 45. Performance Tuning  JVM  http cache  Solr Cache  Better schema  Better indexing strategy
  • 46. Solr Caching  Caching is a key part of what makes Solr fast and scalable  There are a number of different caches configured in solrconfig.xml: – filterCache – queryResultCache – documentCache
  • 47. More Info  《 Solr 1.4 Enterprise Search Server 》  http://wiki.apache.org/solr/  http://solr.pl/en/  《解密搜索引擎技术实战》

Editor's Notes

  1.    SolrCloud 是基于 Solr 和 Zookeeper 的分布式搜索方案,是正在开发中的 Solr4.0 的核心组件之一,它的主要思想是使用 Zookeeper 作为集群的配置信息中心。它有几个特色功能: 1 )集中式的配置信息  2 )自动容错  3 )近实时搜索  4 )查询时自动负载均衡