SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Using	
  Hadoop/MapReduce	
  with	
  
Solr/Lucene	
  for	
  Large	
  Scale	
  
Distributed	
  Search
简报单位:精诚资讯(SYSTEX	
  Corp)云中心/Big	
  Data事业	
  
简报时间:2011.12.2	
  Hadoop	
  in	
  China	
  2011	
  
简报人	
  	
  	
  	
  	
  	
  :	
  	
  陈昭宇(James	
  Chen)	
  




                                                             1
About	
  Myself
陈昭宇 (James	
  Chen)	
  	
             关于精诠Big Data事业
•  精诚资讯云中心Big Data事业首席顾问                  精诠Big Data事业团队以北京为
•  十年网络资安产品研发经验,专长Email               基地,以发展中国的Big Data Killer App
    Web Security                     为使命。奠基于全球的Hadoop
•  5年Hadoop相关技术实务经验,曾经                Ecosystem,与所有的Hadoop
   参与多项海量数据平台的架构设计与开                  Distributions、Tools厂商友好,致力于
                                      海量数据的End-to-End解决方案,直接
    •    海量Web日志处理
                                      面对行业,直接解决特定的商业问题,
    •    垃圾邮件采集及分析                    凸显团队的专业价值。
    •    电信行业CDR帐务分析
                                         团队中有亚洲少见的多年Hadoop实
    •    电信加值服务用户行为分析                 战经验成员,在商业Production环境中
    •    智能电网数据采集处理与分析                运行过数百到上万个节点的服务丛集,
    •    在线服务推荐引擎与电子商务应用
                                      并有数位拥有Cloudera Certified
                                      Developer/Administrator for Apache
•  长期关注Hadoop ecosystem与Solr          Hadoop的专业认证。
   /Lucene等开源软件的发展
•  Cloudera Certified Developer for
   Apache Hadoop (CCDH)。
Agenda

•      Why	
  search	
  on	
  big	
  data	
  ?	
  
•      Lucene,	
  Solr,	
  ZooKeeper	
  
•      Distributed	
  Searching	
  
•      Distributed	
  Indexing	
  with	
  Hadoop	
  
•      Case	
  Study:	
  Web	
  Log	
  CategorizaTon	
  
	
  
Why	
  Search	
  on	
  Big	
  Data	
  ?

                                  Structured	
  	
  Unstructured


                                  No	
  Schema	
  LimitaTon


                                  Ad-­‐hoc	
  Analysis


                                  Keep	
  Everything
A	
  java	
  library	
  for	
  search

•  Mature,	
  feature	
  riched	
  and	
  high	
  performance
•  indexing,	
  searching,	
  highlighTng,	
  spell	
  checking,	
  etc.

Not	
  a	
  crawler

•  Doesn’t	
  know	
  anything	
  about	
  Adobe	
  PDF,	
  MS	
  Word,	
  
   etc.

Not	
  fully	
  distributed

•  Does	
  not	
  support	
  distributed	
  index
Ease	
  of	
  Integra@on             Easy	
  setup	
  and	
  
•  Without	
  knowing	
  Java!       configura@on
•  HTTP/XML	
  interface	
           •  Simply	
  deploy	
  in	
  any	
  web	
  
                                        container


                                 Lucene

More	
  Features                     Distributed
•  FaceTng                           •  ReplicaTon
•  HighlighTng                       •  Sharding
•  Caching                           •  MulT	
  Core


                                                              h_p://search.lucidimaginaTon.com	
  
Lucene Default Query Syntax
Lucene Query Syntax [; sort specification]

1.    mission impossible; releaseDate desc
2.    +mission +impossible –actor:cruise
3.    “mission impossible” –actor:cruise
4.    title:spiderman^10 description:spiderman
5.    description:“spiderman movie”~10
6.    +HDTV +weight:[0 TO 100]
7.    Wildcard queries: te?t, te*t, test*

for more information:
http://lucene.apache.org/java/3_4_0/queryparsersyntax.html

                                                 7
Solr Default Parameters

Query Arguments for HTTP GET/POST to /select

 param default     description
 q                The query
 start   0        Offset into the list of matches
 rows    10       Number of documents to return
 fl      *        Stored fields to return
 qt      standard Query type; maps to query handler
 df      (schema) Default field to search
Solr Search Results Example
http://localhost:8983/solr/select
   ?q=videostart=0rows=2fl=name,price

responseresponseHeaderstatus0/status
 QTime1/QTime/responseHeader
 result numFound=16173 start=0
   doc
    str name=nameApple 60 GB iPod with Video/str
    float name=price399.0/float
   /doc
   doc
    str name=nameASUS Extreme N7800GTX/2DHTV/str
    float name=price479.95/float
   /doc
 /result
/response

                                               9
Zookeeper




     Configuration                Group Service              Synchronization
•  Sharing configuration      •  Master/Leader Election   •  Coordinating searches
   across nodes               •  Central Management          across shards/load
•  Maintaining status about   •  Replication                 balancing
   shards                     •  Name Space




                              Coordina@on
Zookeeper	
  Architecture


                                            All	
  servers	
  store	
  a	
  copy	
  of	
  the	
  
                                                      data	
  (in	
  memory)‫‏‬

                    Leader

                                             A	
  leader	
  is	
  elected	
  at	
  startup
  Server     Server          Server


                                              Followers	
  service	
  clients,	
  all	
  
                                               updates	
  go	
  through	
  leader


                                             Update	
  responses	
  are	
  sent	
  
Client     Client     Client      Client   when	
  a	
  majority	
  of	
  servers	
  have	
  
                                               persisted	
  the	
  change
Distributed	
  Search

                                       •  Shard	
  
                                          Management
                         Scalability
                                       •  Sharding	
  
                                          Algorithm


                                            •  Query	
  
                             Performance       DistribuTon
                                            •  Load	
  Balancing



                                       •  ReplicaTon
                        Availability
                                       •  Server	
  Fail-­‐Over
SolrCloud	
  –	
  Distributed	
  Solr	
  

            Cluster	
                    •  Central	
  configuraTon	
  for	
  the	
  enTre	
  cluster
          Management                     •  Cluster	
  state	
  and	
  layout	
  stored	
  in	
  central	
  system


            Zookeeper	
                  •  Live	
  node	
  tracking
            IntegraTon                   •  Central	
  Repository	
  for	
  configuraTon	
  and	
  state


       Remove	
  external	
              •  AutomaTc	
  fail-­‐over	
  and	
  load	
  balancing
        load	
  balancing

     Client	
  don’t	
  need	
  to	
     •  Any	
  node	
  can	
  serve	
  any	
  request
      know	
  shard	
  at	
  all
SolrCloud	
  Architecture

                 Share Storage
                      or                             ZooKeepoer
                                                              	
                     DFS 	

         download                                                 configuration	
         shards	




           Solr Shard 1            Solr Shard 2         Solr Shard 3
             Solr Shard              Solr Shard           Solr Shard

 replication	
              replication	
         replication	


                                    Collection
Start	
  a	
  SolrCloud	
  node
solr.xml	
solr persistent=false!
!
   !--!
   adminPath: RequestHandler path to manage cores.!
     If 'null' (or absent), cores will not be manageable via request
  handler!
   --!
   cores adminPath=/admin/cores defaultCoreName=collection1!
     core name=collection1 instanceDir=. shard=shard1/!
   /cores!
/solr	

start solr instance to serve a shard	
 l    java –Dbootstrap_confdir=./solr/conf !
 l         –Dcollection.configName=myconf !
 l         –DzkRun –jar start.jar
Live Node Tracking	



     Shards  Replication
Distributed	
  Requests	
  of	
  SolrCloud
     Explicitly	
  specify	
  the	
  address	
  of	
  shards

     •  shards=localhost:8983/solr,localhost:7574/solr

     Load	
  balance	
  and	
  fail	
  over	
  

     •  shards=localhost:8983/solr	
  |	
  localhost:8900/solr,localhost:7574/
        solr	
  |	
  localhost:7500/solr	
  

     Query	
  all	
  shards	
  of	
  a	
  collecTon

     •  h_p://localhost:8983/solr/collecTon1/select?distrib=true

     Query	
  specific	
  shard	
  ids

     •  h_p://localhost:8983/solr/collecTon1/select?
        shards=shard_1,shard_2distrib=true
More	
  about	
  SolrCloud


                       STll	
  under	
  development.	
  Not	
  in	
  
                       official	
  release	
  
                       	
  
                       Get	
  SolrCloud:	
  
                             h_ps://svn.apache.org/repos/asf/lucene/
                             dev/trunk	
  
                       Wiki:	
  
                             h_p://wiki.apache.org/solr/SolrCloud	
  
Ka_a	
  –	
  Distributed	
  Lucene



                   Master	
  Slave	
  Architecture

                      Master	
  Fail-­‐Over

                        Index	
  	
  Shard	
  Management

                      ReplicaTon

                   Fault	
  Tolerance	
  
Ka_a	
  Architecture
         fail over	
                                          HDFS	
                 Secondary
 Master
      	
          Master
                       	


   ZK
    	
                 ZK
                        	
    download
                              shards	
 assign
 shards	


 Node
    	
          Node
                   	
        Node
                                	
       Node
                                            	


  shard replication	
                    multicast query	

                                            Java client API
                                            Distributed ranking plug-able
                                            selection policy
Ka_a	
  Index

       Katta Index	

        lucene Index
                   	
        lucene Index
                   	
        lucene Index
                   	




•  Folder	
  with	
  Lucene	
  indexes	
  
•  Shard	
  Index	
  can	
  be	
  zipped
The	
  different	
  between	
  Ka_a	
  and	
  SolrCloud
•  Ka_a	
  is	
  based	
  on	
  lucene	
  not	
  solr	
  
•  Ka_a	
  is	
  master-­‐slave	
  architecture.	
  Master	
  provide	
  management
   	
  command	
  and	
  API	
  
•  Ka_a	
  will	
  automaTcally	
  assign	
  shards	
  and	
  replicaTon	
  for	
  ka_a
   	
  nodes.	
  
•  Client	
  should	
  use	
  Ka_a	
  client	
  API	
  for	
  querying.	
  
•  Ka_a	
  client	
  is	
  Java	
  API,	
  require	
  java	
  programming.	
  
•  Ka_a	
  support	
  read	
  index	
  from	
  HDFS.	
  Plays	
  well	
  with	
  Hadoop	
  
Configure	
  Ka_a
  conf/masters

  • Required	
  on	
  Master

      server0	


  conf/nodes

  • Required	
  fo	
  all	
  nodes

      server1!
      server2!
      server3	


  conf/ka_a.zk.properTes

  • zk	
  can	
  be	
  embedded	
  with	
  master	
  or	
  standalone
  • Standalone	
  zk	
  is	
  required	
  for	
  master	
  fail-­‐over

      zookeeper.server=server0:2181!
      Zookeeper.embedded=true!


  conf/ka_a-­‐env.sh

  • JAVA_HOME	
  required
  • KATTA_MASTER	
  opTonal
    export JAVA_HOME=/usr/lib/j2sdk1.5-sun!
    Export KATTA_MASTER=server0:/home/$USER/katta-distribution!
Ka_a	
  CLI
Search

 • search	
  index	
  name[,index	
  name,…]	
  “query”	
  [count]


Cluster	
  Management

 • listIndexes
 • listNodes	
  
 • startMaster	
  
 • startNode	
  
 • showStructure	
  


Index	
  Management	
  

 • check	
  
 • addIndex	
  index	
  name	
  path	
  to	
  index	
  lucene	
  analyzer	
  class	
  [replicaTon	
  level]	
  
 • removeIndex	
  index	
  name	
  
 • redeployIndex	
  index	
  name	
  
 • listErrors	
  index	
  name	
  
Deploy	
  Ka_a	
  Index

   Deploying	
  Ka_a	
  index	
  on	
  local	
  filesystem

  bin/katta addIndex name of index !
      ![file:///path to index] rep lv	




   Deploying	
  Ka_a	
  index	
  on	
  HDFS

  bin/katta addIndex name of index !
      ![hdfs://namenode:9000/path] rep lv
Building	
  Lucene	
  Index	
  by	
  MapReduce
    Reduce Side Indexing	


                map
                  	
       K: shard id
                           V: doc to be indexed	


                                                                 Index
                                                                shard 1
                                                                      	
                map
                  	
                                  reduce
                                                           	
Input
HDFS                         Shuffle                lucene
                                                         	
Blocks                         /
                              Sort
                                 	
                              Index
                                                                shard 2
                                                                      	
                map
                  	
                                  reduce
                                                           	
                                                    lucene
                                                         	


                map
Building	
  Lucene	
  Index	
  by	
  MapReduce
    Map Side Indexing	


                map
                  	
       K: shard id
                           V: path of indices	
                lucene
                     	
                                                               Index
                                                              shard 1
                                                                    	
                map
                  	
                                reduce
                                                         	
Input
HDFS            lucene
                     	
      Shuffle              Merge
                                                      	
Blocks                         /
                              Sort
                                 	
                            Index
                                                              shard 2
                                                                    	
                map
                  	
                                reduce
                                                         	
                 lucene
                      	
                          Merge
                                                      	


                map
                  	
                 lucene
Mapper
public class IndexMapper extends MapperObject, Text, Text, Text {!
    private String [] shardList;!
   int k=0;!
    !
   protected void setup(Context context) {!
  !!     !shardList=context.getConfiguration().getStrings(shardList);!
    }!
    !
    public void map(Object key, Text value, Context context) !
  !!     !throws IOException, InterruptedException!
  !{!
         context.write(new Text(shardList[k++]), value);!
   !     !k %= shardList.length;!
    }!
Reducer
public class IndexReducer extends ReducerText, Text, Text, IntWritable
    {!
     !Configuration conf;!
     !public void reduce(Text key, IterableText values, Context context)!
     !!    !throws Exception!
     !{!
     !!    !conf = context.getConfiguration();!
     !!    !String indexName = conf.get(indexName);!
     !!    !String shard = key.toString();!
     !!    !writeIndex(indexName, shard, values);!
     !}    !!
     !private void writeIndex(String indexName, String shard,
       IterableText values)!
     !{!
     !!    !IteratorText it=values.iterator();!
     !!    !while(it.hasNext()){!                               use lucene here !	
     !!    !     !//...index using lucene index writer...!
     !!    !}!
     !}!
}!
Sharding	
  Algorithm	
  

•  Good	
  document	
  distribuTon	
  across	
  shards	
  is	
  important	
  
•  Simple	
  approach:	
  
     •  hash(id)	
  %	
  numShards	
  
     •  Fine	
  if	
  number	
  of	
  shards	
  doesn’t	
  change	
  or	
  easy	
  to	
  reindex	
  
•  Be_er:	
  
     •  Consistent	
  Hashing	
  
           •  h_p://en.wikipedia.org/wiki/Consistent_hashing	
  
Case Study – Web Log Categorization
 Customer Background
 •  Tire-1 telco in APAC
 •  Daily web log ~ 20TB
 •  Marketing guys would like to understand more about
    browsing behavior of their customers
 Requirements
 •  The logs collected from the network should be input
    through a “big data” solution.
 •  Provide dashboard of URL category vs. hit count
 •  Perform analysis based on the required attributes and
    algorithms
Hardware Spec

                                   •      1 Master Node (Name Node, Job Tracker,
                                           Katta Master)
  查询             索引        搜索
                                   •  64 TB raw storage size (8 Data Node)
   海量数据库               并行计算
  Hbase / Hive        MapReduce    •  64 cores (8 Data Node)
        分布式文件系统
  Hadoop Distributed File System
                                   •  8 Task Tracker
                                   •  3 katta nodes
                                   •  5 zookeeper nodes
                                   Master	
  Node	
  
                                        •  CPU:	
  Intel®	
  Xeon®	
  Processor	
  E5620	
  2.4G
                                        •  RAM:	
  32G	
  ECC-­‐Registered
                                        •  HDD:	
  SAS	
  300G	
  @	
  15000	
  rpm	
  (RAID-­‐1)	
  
                                   	
  
                                   Data	
  Node	
  
                                        •  CPU:	
  Intel®	
  Xeon®	
  Processor	
  E5620	
  2.4G
                                        •  RAM:	
  32G	
  ECC-­‐Registered
                                        •  HDD:	
  SATA	
  2Tx4,	
  Total	
  8T	
  


                                                                                                        32
System Overview
                                                                              Dash Board  Search	
                   web gateway	




  Log aggregator	
                                                                                               XML response	

                                                                           Web Service API
                                                                                         	

                                                                            Ka_a	
  Client	
  API
                                                                                                	
                        3rd-party                                                                     Search Query	
                         URL DB 	

                                                                Ka_a	
             Ka_a	
            Ka_a	
  
                                                                Node
                                                                   	
              Node
                                                                                      	
             Node
                                                                                                        	
                   MR Jobs	
          URL
      categorization
                   	
                Build	
  Index
                                                  	
                                                                                                        Load Index Shard	


                                                                                                                   Name	
  Node 	
  
     /raw_log	
         /url_cat	
                     HDFS   /shardA/url	
 /shardB/url	
                                                                                          /shardN/url	
                                                                                                                   KaLa	
  Master
Thank You
      
We are hiring !!
mailto: jameschen@systex.com.tw

Contenu connexe

Tendances

Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedInDataWorks Summit
 
MySQL Binary Log API Presentation - OSCON 2011
MySQL Binary Log API Presentation - OSCON 2011MySQL Binary Log API Presentation - OSCON 2011
MySQL Binary Log API Presentation - OSCON 2011Mats Kindahl
 
Expanding dr with_zfssa_110810
Expanding dr with_zfssa_110810Expanding dr with_zfssa_110810
Expanding dr with_zfssa_110810rjmurphyslideshare
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010xlight
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseChristopher Choi
 
Customer overview oracle solaris cluster, enterprise edition
Customer overview oracle solaris cluster, enterprise editionCustomer overview oracle solaris cluster, enterprise edition
Customer overview oracle solaris cluster, enterprise editionsolarisyougood
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesFromDual GmbH
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMMaaz Anjum
 
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2Jeroen Burgers
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
Cloud Foundry Open Tour - London
Cloud Foundry Open Tour - LondonCloud Foundry Open Tour - London
Cloud Foundry Open Tour - Londonmarklucovsky
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Deep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL UniverseDeep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL UniverseJignesh Shah
 
Novell Identity Manager Tips, Tricks and Best Practices
Novell Identity Manager Tips, Tricks and Best PracticesNovell Identity Manager Tips, Tricks and Best Practices
Novell Identity Manager Tips, Tricks and Best PracticesNovell
 

Tendances (20)

Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
Google Compute and MapR
Google Compute and MapRGoogle Compute and MapR
Google Compute and MapR
 
dbaas-clone
dbaas-clonedbaas-clone
dbaas-clone
 
MySQL Binary Log API Presentation - OSCON 2011
MySQL Binary Log API Presentation - OSCON 2011MySQL Binary Log API Presentation - OSCON 2011
MySQL Binary Log API Presentation - OSCON 2011
 
Methods of NoSQL database systems benchmarking
Methods of NoSQL database systems benchmarkingMethods of NoSQL database systems benchmarking
Methods of NoSQL database systems benchmarking
 
Expanding dr with_zfssa_110810
Expanding dr with_zfssa_110810Expanding dr with_zfssa_110810
Expanding dr with_zfssa_110810
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
 
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBase
 
Customer overview oracle solaris cluster, enterprise edition
Customer overview oracle solaris cluster, enterprise editionCustomer overview oracle solaris cluster, enterprise edition
Customer overview oracle solaris cluster, enterprise edition
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
 
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
Cloud Foundry Open Tour - London
Cloud Foundry Open Tour - LondonCloud Foundry Open Tour - London
Cloud Foundry Open Tour - London
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Deep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL UniverseDeep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL Universe
 
Novell Identity Manager Tips, Tricks and Best Practices
Novell Identity Manager Tips, Tricks and Best PracticesNovell Identity Manager Tips, Tricks and Best Practices
Novell Identity Manager Tips, Tricks and Best Practices
 

Similaire à [Hic2011] using hadoop lucene-solr-for-large-scale-search by systex

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Docker Swarm and Traefik 2.0
Docker Swarm and Traefik 2.0Docker Swarm and Traefik 2.0
Docker Swarm and Traefik 2.0Jakub Hajek
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
SQL Server Clustering for Dummies
SQL Server Clustering for DummiesSQL Server Clustering for Dummies
SQL Server Clustering for DummiesMark Broadbent
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Eran Gampel
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksRuslan Meshenberg
 
OpenStack Dragonflow shenzhen and Hangzhou meetups
OpenStack Dragonflow shenzhen and Hangzhou  meetupsOpenStack Dragonflow shenzhen and Hangzhou  meetups
OpenStack Dragonflow shenzhen and Hangzhou meetupsEran Gampel
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesRahul Singh
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesAnant Corporation
 
MySQL Webinar 2/4 Performance tuning, hardware, optimisation
MySQL Webinar 2/4 Performance tuning, hardware, optimisationMySQL Webinar 2/4 Performance tuning, hardware, optimisation
MySQL Webinar 2/4 Performance tuning, hardware, optimisationMark Swarbrick
 
ApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big DataApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big DataOpenSource Connections
 
20141011 my sql clusterv01pptx
20141011 my sql clusterv01pptx20141011 my sql clusterv01pptx
20141011 my sql clusterv01pptxIvan Ma
 
OOW09 Ebs Tuning Final
OOW09 Ebs Tuning FinalOOW09 Ebs Tuning Final
OOW09 Ebs Tuning Finaljucaab
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talkAmrit Sarkar
 

Similaire à [Hic2011] using hadoop lucene-solr-for-large-scale-search by systex (20)

Big Search with Big Data Principles
Big Search with Big Data PrinciplesBig Search with Big Data Principles
Big Search with Big Data Principles
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Docker Swarm and Traefik 2.0
Docker Swarm and Traefik 2.0Docker Swarm and Traefik 2.0
Docker Swarm and Traefik 2.0
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
SQL Server Clustering for Dummies
SQL Server Clustering for DummiesSQL Server Clustering for Dummies
SQL Server Clustering for Dummies
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talks
 
OpenStack Dragonflow shenzhen and Hangzhou meetups
OpenStack Dragonflow shenzhen and Hangzhou  meetupsOpenStack Dragonflow shenzhen and Hangzhou  meetups
OpenStack Dragonflow shenzhen and Hangzhou meetups
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
MySQL Webinar 2/4 Performance tuning, hardware, optimisation
MySQL Webinar 2/4 Performance tuning, hardware, optimisationMySQL Webinar 2/4 Performance tuning, hardware, optimisation
MySQL Webinar 2/4 Performance tuning, hardware, optimisation
 
ApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big DataApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big Data
 
20141011 my sql clusterv01pptx
20141011 my sql clusterv01pptx20141011 my sql clusterv01pptx
20141011 my sql clusterv01pptx
 
OOW09 Ebs Tuning Final
OOW09 Ebs Tuning FinalOOW09 Ebs Tuning Final
OOW09 Ebs Tuning Final
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talk
 
Weblogic
WeblogicWeblogic
Weblogic
 

Plus de James Chen

Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012James Chen
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結James Chen
 

Plus de James Chen (6)

Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lake
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結
 

Dernier

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Dernier (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex

  • 1. Using  Hadoop/MapReduce  with   Solr/Lucene  for  Large  Scale   Distributed  Search 简报单位:精诚资讯(SYSTEX  Corp)云中心/Big  Data事业   简报时间:2011.12.2  Hadoop  in  China  2011   简报人            :    陈昭宇(James  Chen)   1
  • 2. About  Myself 陈昭宇 (James  Chen)     关于精诠Big Data事业 •  精诚资讯云中心Big Data事业首席顾问 精诠Big Data事业团队以北京为 •  十年网络资安产品研发经验,专长Email 基地,以发展中国的Big Data Killer App Web Security 为使命。奠基于全球的Hadoop •  5年Hadoop相关技术实务经验,曾经 Ecosystem,与所有的Hadoop 参与多项海量数据平台的架构设计与开 Distributions、Tools厂商友好,致力于 海量数据的End-to-End解决方案,直接 •  海量Web日志处理 面对行业,直接解决特定的商业问题, •  垃圾邮件采集及分析 凸显团队的专业价值。 •  电信行业CDR帐务分析 团队中有亚洲少见的多年Hadoop实 •  电信加值服务用户行为分析 战经验成员,在商业Production环境中 •  智能电网数据采集处理与分析 运行过数百到上万个节点的服务丛集, •  在线服务推荐引擎与电子商务应用 并有数位拥有Cloudera Certified Developer/Administrator for Apache •  长期关注Hadoop ecosystem与Solr Hadoop的专业认证。 /Lucene等开源软件的发展 •  Cloudera Certified Developer for Apache Hadoop (CCDH)。
  • 3. Agenda •  Why  search  on  big  data  ?   •  Lucene,  Solr,  ZooKeeper   •  Distributed  Searching   •  Distributed  Indexing  with  Hadoop   •  Case  Study:  Web  Log  CategorizaTon    
  • 4. Why  Search  on  Big  Data  ? Structured    Unstructured No  Schema  LimitaTon Ad-­‐hoc  Analysis Keep  Everything
  • 5. A  java  library  for  search •  Mature,  feature  riched  and  high  performance •  indexing,  searching,  highlighTng,  spell  checking,  etc. Not  a  crawler •  Doesn’t  know  anything  about  Adobe  PDF,  MS  Word,   etc. Not  fully  distributed •  Does  not  support  distributed  index
  • 6. Ease  of  Integra@on Easy  setup  and   •  Without  knowing  Java! configura@on •  HTTP/XML  interface   •  Simply  deploy  in  any  web   container Lucene More  Features Distributed •  FaceTng •  ReplicaTon •  HighlighTng •  Sharding •  Caching •  MulT  Core h_p://search.lucidimaginaTon.com  
  • 7. Lucene Default Query Syntax Lucene Query Syntax [; sort specification] 1.  mission impossible; releaseDate desc 2.  +mission +impossible –actor:cruise 3.  “mission impossible” –actor:cruise 4.  title:spiderman^10 description:spiderman 5.  description:“spiderman movie”~10 6.  +HDTV +weight:[0 TO 100] 7.  Wildcard queries: te?t, te*t, test* for more information: http://lucene.apache.org/java/3_4_0/queryparsersyntax.html 7
  • 8. Solr Default Parameters Query Arguments for HTTP GET/POST to /select param default description q The query start 0 Offset into the list of matches rows 10 Number of documents to return fl * Stored fields to return qt standard Query type; maps to query handler df (schema) Default field to search
  • 9. Solr Search Results Example http://localhost:8983/solr/select ?q=videostart=0rows=2fl=name,price responseresponseHeaderstatus0/status QTime1/QTime/responseHeader result numFound=16173 start=0 doc str name=nameApple 60 GB iPod with Video/str float name=price399.0/float /doc doc str name=nameASUS Extreme N7800GTX/2DHTV/str float name=price479.95/float /doc /result /response 9
  • 10. Zookeeper Configuration Group Service Synchronization •  Sharing configuration •  Master/Leader Election •  Coordinating searches across nodes •  Central Management across shards/load •  Maintaining status about •  Replication balancing shards •  Name Space Coordina@on
  • 11. Zookeeper  Architecture All  servers  store  a  copy  of  the   data  (in  memory)‫‏‬ Leader A  leader  is  elected  at  startup Server Server Server Followers  service  clients,  all   updates  go  through  leader Update  responses  are  sent   Client Client Client Client when  a  majority  of  servers  have   persisted  the  change
  • 12. Distributed  Search •  Shard   Management Scalability •  Sharding   Algorithm •  Query   Performance DistribuTon •  Load  Balancing •  ReplicaTon Availability •  Server  Fail-­‐Over
  • 13. SolrCloud  –  Distributed  Solr   Cluster   •  Central  configuraTon  for  the  enTre  cluster Management •  Cluster  state  and  layout  stored  in  central  system Zookeeper   •  Live  node  tracking IntegraTon •  Central  Repository  for  configuraTon  and  state Remove  external   •  AutomaTc  fail-­‐over  and  load  balancing load  balancing Client  don’t  need  to   •  Any  node  can  serve  any  request know  shard  at  all
  • 14. SolrCloud  Architecture Share Storage or ZooKeepoer DFS download configuration shards Solr Shard 1 Solr Shard 2 Solr Shard 3 Solr Shard Solr Shard Solr Shard replication replication replication Collection
  • 15. Start  a  SolrCloud  node solr.xml solr persistent=false! ! !--! adminPath: RequestHandler path to manage cores.! If 'null' (or absent), cores will not be manageable via request handler! --! cores adminPath=/admin/cores defaultCoreName=collection1! core name=collection1 instanceDir=. shard=shard1/! /cores! /solr start solr instance to serve a shard l  java –Dbootstrap_confdir=./solr/conf ! l  –Dcollection.configName=myconf ! l  –DzkRun –jar start.jar
  • 16. Live Node Tracking Shards Replication
  • 17. Distributed  Requests  of  SolrCloud Explicitly  specify  the  address  of  shards •  shards=localhost:8983/solr,localhost:7574/solr Load  balance  and  fail  over   •  shards=localhost:8983/solr  |  localhost:8900/solr,localhost:7574/ solr  |  localhost:7500/solr   Query  all  shards  of  a  collecTon •  h_p://localhost:8983/solr/collecTon1/select?distrib=true Query  specific  shard  ids •  h_p://localhost:8983/solr/collecTon1/select? shards=shard_1,shard_2distrib=true
  • 18. More  about  SolrCloud STll  under  development.  Not  in   official  release     Get  SolrCloud:   h_ps://svn.apache.org/repos/asf/lucene/ dev/trunk   Wiki:   h_p://wiki.apache.org/solr/SolrCloud  
  • 19. Ka_a  –  Distributed  Lucene Master  Slave  Architecture Master  Fail-­‐Over Index    Shard  Management ReplicaTon Fault  Tolerance  
  • 20. Ka_a  Architecture fail over HDFS Secondary Master Master ZK ZK download shards assign shards Node Node Node Node shard replication multicast query Java client API Distributed ranking plug-able selection policy
  • 21. Ka_a  Index Katta Index lucene Index lucene Index lucene Index •  Folder  with  Lucene  indexes   •  Shard  Index  can  be  zipped
  • 22. The  different  between  Ka_a  and  SolrCloud •  Ka_a  is  based  on  lucene  not  solr   •  Ka_a  is  master-­‐slave  architecture.  Master  provide  management  command  and  API   •  Ka_a  will  automaTcally  assign  shards  and  replicaTon  for  ka_a  nodes.   •  Client  should  use  Ka_a  client  API  for  querying.   •  Ka_a  client  is  Java  API,  require  java  programming.   •  Ka_a  support  read  index  from  HDFS.  Plays  well  with  Hadoop  
  • 23. Configure  Ka_a conf/masters • Required  on  Master server0 conf/nodes • Required  fo  all  nodes server1! server2! server3 conf/ka_a.zk.properTes • zk  can  be  embedded  with  master  or  standalone • Standalone  zk  is  required  for  master  fail-­‐over zookeeper.server=server0:2181! Zookeeper.embedded=true! conf/ka_a-­‐env.sh • JAVA_HOME  required • KATTA_MASTER  opTonal export JAVA_HOME=/usr/lib/j2sdk1.5-sun! Export KATTA_MASTER=server0:/home/$USER/katta-distribution!
  • 24. Ka_a  CLI Search • search  index  name[,index  name,…]  “query”  [count] Cluster  Management • listIndexes • listNodes   • startMaster   • startNode   • showStructure   Index  Management   • check   • addIndex  index  name  path  to  index  lucene  analyzer  class  [replicaTon  level]   • removeIndex  index  name   • redeployIndex  index  name   • listErrors  index  name  
  • 25. Deploy  Ka_a  Index Deploying  Ka_a  index  on  local  filesystem bin/katta addIndex name of index ! ![file:///path to index] rep lv Deploying  Ka_a  index  on  HDFS bin/katta addIndex name of index ! ![hdfs://namenode:9000/path] rep lv
  • 26. Building  Lucene  Index  by  MapReduce Reduce Side Indexing map K: shard id V: doc to be indexed Index shard 1 map reduce Input HDFS Shuffle lucene Blocks / Sort Index shard 2 map reduce lucene map
  • 27. Building  Lucene  Index  by  MapReduce Map Side Indexing map K: shard id V: path of indices lucene Index shard 1 map reduce Input HDFS lucene Shuffle Merge Blocks / Sort Index shard 2 map reduce lucene Merge map lucene
  • 28. Mapper public class IndexMapper extends MapperObject, Text, Text, Text {! private String [] shardList;! int k=0;! ! protected void setup(Context context) {! !! !shardList=context.getConfiguration().getStrings(shardList);! }! ! public void map(Object key, Text value, Context context) ! !! !throws IOException, InterruptedException! !{! context.write(new Text(shardList[k++]), value);! ! !k %= shardList.length;! }!
  • 29. Reducer public class IndexReducer extends ReducerText, Text, Text, IntWritable {! !Configuration conf;! !public void reduce(Text key, IterableText values, Context context)! !! !throws Exception! !{! !! !conf = context.getConfiguration();! !! !String indexName = conf.get(indexName);! !! !String shard = key.toString();! !! !writeIndex(indexName, shard, values);! !} !! !private void writeIndex(String indexName, String shard, IterableText values)! !{! !! !IteratorText it=values.iterator();! !! !while(it.hasNext()){! use lucene here ! !! ! !//...index using lucene index writer...! !! !}! !}! }!
  • 30. Sharding  Algorithm   •  Good  document  distribuTon  across  shards  is  important   •  Simple  approach:   •  hash(id)  %  numShards   •  Fine  if  number  of  shards  doesn’t  change  or  easy  to  reindex   •  Be_er:   •  Consistent  Hashing   •  h_p://en.wikipedia.org/wiki/Consistent_hashing  
  • 31. Case Study – Web Log Categorization Customer Background •  Tire-1 telco in APAC •  Daily web log ~ 20TB •  Marketing guys would like to understand more about browsing behavior of their customers Requirements •  The logs collected from the network should be input through a “big data” solution. •  Provide dashboard of URL category vs. hit count •  Perform analysis based on the required attributes and algorithms
  • 32. Hardware Spec •  1 Master Node (Name Node, Job Tracker, Katta Master) 查询 索引 搜索 •  64 TB raw storage size (8 Data Node) 海量数据库 并行计算 Hbase / Hive MapReduce •  64 cores (8 Data Node) 分布式文件系统 Hadoop Distributed File System •  8 Task Tracker •  3 katta nodes •  5 zookeeper nodes Master  Node   •  CPU:  Intel®  Xeon®  Processor  E5620  2.4G •  RAM:  32G  ECC-­‐Registered •  HDD:  SAS  300G  @  15000  rpm  (RAID-­‐1)     Data  Node   •  CPU:  Intel®  Xeon®  Processor  E5620  2.4G •  RAM:  32G  ECC-­‐Registered •  HDD:  SATA  2Tx4,  Total  8T   32
  • 33. System Overview Dash Board Search web gateway Log aggregator XML response Web Service API Ka_a  Client  API 3rd-party Search Query URL DB Ka_a   Ka_a   Ka_a   Node Node Node MR Jobs URL categorization Build  Index Load Index Shard Name  Node   /raw_log /url_cat HDFS /shardA/url /shardB/url /shardN/url KaLa  Master
  • 34. Thank You We are hiring !! mailto: jameschen@systex.com.tw