Building a distributed search system with Hadoop and Lucene
1. Building a distributed search system
with Apache Hadoop and Lucene
Anno Accademico 2012-2013
2. Outline
• Big Data Problem
• Map and Reduce approach: Apache Hadoop
• Distributing a Lucene index using Hadoop
• Measuring Performance
• Conclusion
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
3. “Big Data”
This works analyzes the technological challenge to
manage and administer quantity of information with
global dimension in the order of Terabyte (10E12
bytes) or Petabyte (10E15 bytes) and with an
exponential growth rate.
• Facebook processes 2.5 billion contents/day.
• Youtube: 72 hours of video uploaded per
minutes.
• Twitter:50 million tweets per day.
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
4. Multitier architecture vs Cloud
computing
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Front End Servers
Database Servers
Client
Front End Servers
Cloud
Client
Data asynchronous
analysis
Realtimeprocessing
Realtimeprocessing
5. Apache Hadoop architecture
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
A Hadoop cluster scales computation
capacity, storage capacity and IO bandwidth
by simply adding commodity servers
6. HDFS: the distributed file system
• Files are stored as sets of (large) blocks
– Default block size: 64 MB (ext4 default is 4kB)
– Blocks are replicated for durability and availability
• Namespace is managed by a single name node
– Actual data transfer is directly between client & data node
– Pros and cons of this decision?
foo.txt: 3,9,6
bar.data: 2,4
block #2 of
foo.txt?
9
Read block 9
9
9
9 93
3
3
2
2
24
4
4
6
6
Name node
Data nodesClient
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
7. Map and Reduce
The computation takes a set of input key/value pairs, and
produces a set of output key/value pairs.
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
8. Recap: Map Reduce approach
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Reducer
Inputdata
Outputdata
"The Shuffle"
Intermediate
(key,value) pairs
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
9. Map and Reduce: where is applicable
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
• Distributed “Grep”
• Count of URL Access Frequency
• Reverse Web-Link Graph
• Term-Vector per Host
• Reduce a n level graph in a redundant hash
table
10. Implementation: distributing a Lucene
index using Map and Reduce
The scope of the implementation is to:
1. populate a Lucene distributed index using the
HDFS cluster
2. distributing and retrieving results using Map
and Reduce
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
11. Apache Lucene: indexing
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
n
Apache Lucene is the standard de facto in the open source
community for textual search
Document
Field(type)->Value
Field(type)->Value
Field(type)->Value
12. Apache Lucene: searching
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
In Lucene each document is a vector.
A measure of the relevance is the value of the θ angle between the
document and the query vector
13. Distributing Lucene indexes using
Hadoop
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Index 1
Lucene Indexer
Job
Indexing Searching
Index 2
Index 3
PDF doc
archive
Map Phase: Creates and populate each index
Reduce Phase: None
HDFSCluster
Index 1
Lucene Search
Job
Index 2
Index 3
HDFSCluster
map
Sort
Reduce
ResulSet
Combine
map map
{Search Filter}
(list of Lucene Restrictions)
Map Phase: Queries the indexes
Reduce Phase: Merges and orders result set
14. Measuring Performance
The entire execution time can be formally defined as:
While the single Map (or Reduce) phase:
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Where α is the % of reduce tasks still on going after map phase completion.
15. Measuring Performance
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Data
nodes
CPU Of the
Nodes
RAM
available
Name
Nodes
Number
of file
Total
Bytes
read
Job
Submission
Cost
Total
Job
Time
2 Intel i7 CPU
2.67 GHZ
4 GB 1 1330 2.043 GB 1 min 5 sec 24m 35
sec
3 Intel i7 CPU
2.67 GHZ
4 GB 1 1330 2.043 GB 1 min 21 sec 12 min 10
sec
4 Intel i7 CPU
2.67 GHZ
4 GB 1 1330 2.043 GB 1 min 40 sec 8 min 22
sec
1 (No
Hadoop)
Intel i7 CPU
2.67 GHZ
4GB 0 1330 2043 GB 0 10 min 11
sec
With 4 or more data nodes Hadoop infrastructure setup cost is compensated
16. Measuring Performance (Word Count)
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Having a single Big file speeds up Hadoop consistently, so performance are not really
determined by the quantity of data but how many splits are added to the HDFS
Data
nodes
Cpu Of the
Nodes
RAM
available
Name Nodes Number
of file
Total
Bytes
read
Job
Submission
Cost
3 Intel i7 Cpu
2.67 GHZ
4 GB 1 1 942 MB 3 min 18 sec
4 Intel i7 Cpu
2.67 GHZ
4 GB 1 1 942 MB 2 min 17 sec
1 (No
Hadoop)
Intel i7 Cpu
2.67 GHZ
4 GB 1 1 942 MB 4 min 27 sec
17. Job Detail Page
Tasks Queue
Tasks currently running
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
18. Conclusion
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
What
• Analysis of the current status of Open Source technologies
• Analysis of the potential applications for the web
• Implemented a full working Hadoop architecture
• Designed a web portal based on the previous architecture
Objectives:
• Explore Map and Reduce approach to analyze unstructured data
• Measure performance and understand the Apache Hadoop framework
Outcomes
• Setup of the entire architecture in my company environment (Innovation
Engineering)
• Main benefits in the indexing phase
• Poor impact on the search side (for standard queries format)
• In general major benefits when the HDFS is populated by a relatively small
number of Big (GB) files