"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
1. Using
Hadoop/MapReduce
with
Solr/Lucene
for
Large
Scale
Distributed
Search
简报单位:精诚资讯(SYSTEX
Corp)云中心/Big
Data事业
简报时间:2011.12.2
Hadoop
in
China
2011
简报人
:
陈昭宇(James
Chen)
1
3. Agenda
• Why
search
on
big
data
?
• Lucene,
Solr,
ZooKeeper
• Distributed
Searching
• Distributed
Indexing
with
Hadoop
• Case
Study:
Web
Log
CategorizaTon
4. Why
Search
on
Big
Data
?
Structured
Unstructured
No
Schema
LimitaTon
Ad-‐hoc
Analysis
Keep
Everything
5. A
java
library
for
search
• Mature,
feature
riched
and
high
performance
• indexing,
searching,
highlighTng,
spell
checking,
etc.
Not
a
crawler
• Doesn’t
know
anything
about
Adobe
PDF,
MS
Word,
etc.
Not
fully
distributed
• Does
not
support
distributed
index
6. Ease
of
Integra@on Easy
setup
and
• Without
knowing
Java! configura@on
• HTTP/XML
interface
• Simply
deploy
in
any
web
container
Lucene
More
Features Distributed
• FaceTng • ReplicaTon
• HighlighTng • Sharding
• Caching • MulT
Core
h_p://search.lucidimaginaTon.com
8. Solr Default Parameters
Query Arguments for HTTP GET/POST to /select
param default description
q The query
start 0 Offset into the list of matches
rows 10 Number of documents to return
fl * Stored fields to return
qt standard Query type; maps to query handler
df (schema) Default field to search
10. Zookeeper
Configuration Group Service Synchronization
• Sharing configuration • Master/Leader Election • Coordinating searches
across nodes • Central Management across shards/load
• Maintaining status about • Replication balancing
shards • Name Space
Coordina@on
11. Zookeeper
Architecture
All
servers
store
a
copy
of
the
data
(in
memory)
Leader
A
leader
is
elected
at
startup
Server Server Server
Followers
service
clients,
all
updates
go
through
leader
Update
responses
are
sent
Client Client Client Client when
a
majority
of
servers
have
persisted
the
change
13. SolrCloud
–
Distributed
Solr
Cluster
• Central
configuraTon
for
the
enTre
cluster
Management • Cluster
state
and
layout
stored
in
central
system
Zookeeper
• Live
node
tracking
IntegraTon • Central
Repository
for
configuraTon
and
state
Remove
external
• AutomaTc
fail-‐over
and
load
balancing
load
balancing
Client
don’t
need
to
• Any
node
can
serve
any
request
know
shard
at
all
17. Distributed
Requests
of
SolrCloud
Explicitly
specify
the
address
of
shards
• shards=localhost:8983/solr,localhost:7574/solr
Load
balance
and
fail
over
• shards=localhost:8983/solr
|
localhost:8900/solr,localhost:7574/
solr
|
localhost:7500/solr
Query
all
shards
of
a
collecTon
• h_p://localhost:8983/solr/collecTon1/select?distrib=true
Query
specific
shard
ids
• h_p://localhost:8983/solr/collecTon1/select?
shards=shard_1,shard_2distrib=true
18. More
about
SolrCloud
STll
under
development.
Not
in
official
release
Get
SolrCloud:
h_ps://svn.apache.org/repos/asf/lucene/
dev/trunk
Wiki:
h_p://wiki.apache.org/solr/SolrCloud
21. Ka_a
Index
Katta Index
lucene Index
lucene Index
lucene Index
• Folder
with
Lucene
indexes
• Shard
Index
can
be
zipped
22. The
different
between
Ka_a
and
SolrCloud
• Ka_a
is
based
on
lucene
not
solr
• Ka_a
is
master-‐slave
architecture.
Master
provide
management
command
and
API
• Ka_a
will
automaTcally
assign
shards
and
replicaTon
for
ka_a
nodes.
• Client
should
use
Ka_a
client
API
for
querying.
• Ka_a
client
is
Java
API,
require
java
programming.
• Ka_a
support
read
index
from
HDFS.
Plays
well
with
Hadoop
23. Configure
Ka_a
conf/masters
• Required
on
Master
server0
conf/nodes
• Required
fo
all
nodes
server1!
server2!
server3
conf/ka_a.zk.properTes
• zk
can
be
embedded
with
master
or
standalone
• Standalone
zk
is
required
for
master
fail-‐over
zookeeper.server=server0:2181!
Zookeeper.embedded=true!
conf/ka_a-‐env.sh
• JAVA_HOME
required
• KATTA_MASTER
opTonal
export JAVA_HOME=/usr/lib/j2sdk1.5-sun!
Export KATTA_MASTER=server0:/home/$USER/katta-distribution!
24. Ka_a
CLI
Search
• search
index
name[,index
name,…]
“query”
[count]
Cluster
Management
• listIndexes
• listNodes
• startMaster
• startNode
• showStructure
Index
Management
• check
• addIndex
index
name
path
to
index
lucene
analyzer
class
[replicaTon
level]
• removeIndex
index
name
• redeployIndex
index
name
• listErrors
index
name
25. Deploy
Ka_a
Index
Deploying
Ka_a
index
on
local
filesystem
bin/katta addIndex name of index !
![file:///path to index] rep lv
Deploying
Ka_a
index
on
HDFS
bin/katta addIndex name of index !
![hdfs://namenode:9000/path] rep lv
26. Building
Lucene
Index
by
MapReduce
Reduce Side Indexing
map
K: shard id
V: doc to be indexed
Index
shard 1
map
reduce
Input
HDFS Shuffle lucene
Blocks /
Sort
Index
shard 2
map
reduce
lucene
map
27. Building
Lucene
Index
by
MapReduce
Map Side Indexing
map
K: shard id
V: path of indices
lucene
Index
shard 1
map
reduce
Input
HDFS lucene
Shuffle Merge
Blocks /
Sort
Index
shard 2
map
reduce
lucene
Merge
map
lucene
28. Mapper
public class IndexMapper extends MapperObject, Text, Text, Text {!
private String [] shardList;!
int k=0;!
!
protected void setup(Context context) {!
!! !shardList=context.getConfiguration().getStrings(shardList);!
}!
!
public void map(Object key, Text value, Context context) !
!! !throws IOException, InterruptedException!
!{!
context.write(new Text(shardList[k++]), value);!
! !k %= shardList.length;!
}!
30. Sharding
Algorithm
• Good
document
distribuTon
across
shards
is
important
• Simple
approach:
• hash(id)
%
numShards
• Fine
if
number
of
shards
doesn’t
change
or
easy
to
reindex
• Be_er:
• Consistent
Hashing
• h_p://en.wikipedia.org/wiki/Consistent_hashing
31. Case Study – Web Log Categorization
Customer Background
• Tire-1 telco in APAC
• Daily web log ~ 20TB
• Marketing guys would like to understand more about
browsing behavior of their customers
Requirements
• The logs collected from the network should be input
through a “big data” solution.
• Provide dashboard of URL category vs. hit count
• Perform analysis based on the required attributes and
algorithms
32. Hardware Spec
• 1 Master Node (Name Node, Job Tracker,
Katta Master)
查询 索引 搜索
• 64 TB raw storage size (8 Data Node)
海量数据库 并行计算
Hbase / Hive MapReduce • 64 cores (8 Data Node)
分布式文件系统
Hadoop Distributed File System
• 8 Task Tracker
• 3 katta nodes
• 5 zookeeper nodes
Master
Node
• CPU:
Intel®
Xeon®
Processor
E5620
2.4G
• RAM:
32G
ECC-‐Registered
• HDD:
SAS
300G
@
15000
rpm
(RAID-‐1)
Data
Node
• CPU:
Intel®
Xeon®
Processor
E5620
2.4G
• RAM:
32G
ECC-‐Registered
• HDD:
SATA
2Tx4,
Total
8T
32
33. System Overview
Dash Board Search
web gateway
Log aggregator
XML response
Web Service API
Ka_a
Client
API
3rd-party Search Query
URL DB
Ka_a
Ka_a
Ka_a
Node
Node
Node
MR Jobs
URL
categorization
Build
Index
Load Index Shard
Name
Node
/raw_log
/url_cat
HDFS /shardA/url
/shardB/url
/shardN/url
KaLa
Master
34. Thank You
We are hiring !!
mailto: jameschen@systex.com.tw