SlideShare une entreprise Scribd logo
1  sur  18
Building a distributed search system
with Apache Hadoop and Lucene
Anno Accademico 2012-2013
Outline
• Big Data Problem
• Map and Reduce approach: Apache Hadoop
• Distributing a Lucene index using Hadoop
• Measuring Performance
• Conclusion
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
“Big Data”
This works analyzes the technological challenge to
manage and administer quantity of information with
global dimension in the order of Terabyte (10E12
bytes) or Petabyte (10E15 bytes) and with an
exponential growth rate.
• Facebook processes 2.5 billion contents/day.
• Youtube: 72 hours of video uploaded per
minutes.
• Twitter:50 million tweets per day.
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Multitier architecture vs Cloud
computing
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Front End Servers
Database Servers
Client
Front End Servers
Cloud
Client
Data asynchronous
analysis
Realtimeprocessing
Realtimeprocessing
Apache Hadoop architecture
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
A Hadoop cluster scales computation
capacity, storage capacity and IO bandwidth
by simply adding commodity servers
HDFS: the distributed file system
• Files are stored as sets of (large) blocks
– Default block size: 64 MB (ext4 default is 4kB)
– Blocks are replicated for durability and availability
• Namespace is managed by a single name node
– Actual data transfer is directly between client & data node
– Pros and cons of this decision?
foo.txt: 3,9,6
bar.data: 2,4
block #2 of
foo.txt?
9
Read block 9
9
9
9 93
3
3
2
2
24
4
4
6
6
Name node
Data nodesClient
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Map and Reduce
The computation takes a set of input key/value pairs, and
produces a set of output key/value pairs.
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Recap: Map Reduce approach
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Reducer
Inputdata
Outputdata
"The Shuffle"
Intermediate
(key,value) pairs
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Map and Reduce: where is applicable
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
• Distributed “Grep”
• Count of URL Access Frequency
• Reverse Web-Link Graph
• Term-Vector per Host
• Reduce a n level graph in a redundant hash
table
Implementation: distributing a Lucene
index using Map and Reduce
The scope of the implementation is to:
1. populate a Lucene distributed index using the
HDFS cluster
2. distributing and retrieving results using Map
and Reduce
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Apache Lucene: indexing
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
n
Apache Lucene is the standard de facto in the open source
community for textual search
Document
Field(type)->Value
Field(type)->Value
Field(type)->Value
Apache Lucene: searching
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
In Lucene each document is a vector.
A measure of the relevance is the value of the θ angle between the
document and the query vector
Distributing Lucene indexes using
Hadoop
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Index 1
Lucene Indexer
Job
Indexing Searching
Index 2
Index 3
PDF doc
archive
Map Phase: Creates and populate each index
Reduce Phase: None
HDFSCluster
Index 1
Lucene Search
Job
Index 2
Index 3
HDFSCluster
map
Sort
Reduce
ResulSet
Combine
map map
{Search Filter}
(list of Lucene Restrictions)
Map Phase: Queries the indexes
Reduce Phase: Merges and orders result set
Measuring Performance
The entire execution time can be formally defined as:
While the single Map (or Reduce) phase:
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Where α is the % of reduce tasks still on going after map phase completion.
Measuring Performance
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Data
nodes
CPU Of the
Nodes
RAM
available
Name
Nodes
Number
of file
Total
Bytes
read
Job
Submission
Cost
Total
Job
Time
2 Intel i7 CPU
2.67 GHZ
4 GB 1 1330 2.043 GB 1 min 5 sec 24m 35
sec
3 Intel i7 CPU
2.67 GHZ
4 GB 1 1330 2.043 GB 1 min 21 sec 12 min 10
sec
4 Intel i7 CPU
2.67 GHZ
4 GB 1 1330 2.043 GB 1 min 40 sec 8 min 22
sec
1 (No
Hadoop)
Intel i7 CPU
2.67 GHZ
4GB 0 1330 2043 GB 0 10 min 11
sec
With 4 or more data nodes Hadoop infrastructure setup cost is compensated
Measuring Performance (Word Count)
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Having a single Big file speeds up Hadoop consistently, so performance are not really
determined by the quantity of data but how many splits are added to the HDFS
Data
nodes
Cpu Of the
Nodes
RAM
available
Name Nodes Number
of file
Total
Bytes
read
Job
Submission
Cost
3 Intel i7 Cpu
2.67 GHZ
4 GB 1 1 942 MB 3 min 18 sec
4 Intel i7 Cpu
2.67 GHZ
4 GB 1 1 942 MB 2 min 17 sec
1 (No
Hadoop)
Intel i7 Cpu
2.67 GHZ
4 GB 1 1 942 MB 4 min 27 sec
Job Detail Page
Tasks Queue
Tasks currently running
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Conclusion
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
What
• Analysis of the current status of Open Source technologies
• Analysis of the potential applications for the web
• Implemented a full working Hadoop architecture
• Designed a web portal based on the previous architecture
Objectives:
• Explore Map and Reduce approach to analyze unstructured data
• Measure performance and understand the Apache Hadoop framework
Outcomes
• Setup of the entire architecture in my company environment (Innovation
Engineering)
• Main benefits in the indexing phase
• Poor impact on the search side (for standard queries format)
• In general major benefits when the HDFS is populated by a relatively small
number of Big (GB) files

Contenu connexe

Tendances

Tendances (20)

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Data Science
Data ScienceData Science
Data Science
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Delta Architecture
Delta ArchitectureDelta Architecture
Delta Architecture
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
 

En vedette

Emergency Department Quality Improvement
Emergency Department Quality ImprovementEmergency Department Quality Improvement
Emergency Department Quality Improvement
DrAbdulaziz Saddique
 
Process and product quality Assurance
Process and product quality AssuranceProcess and product quality Assurance
Process and product quality Assurance
Joydip Bhattacharya
 
Management planning presentation
Management planning presentationManagement planning presentation
Management planning presentation
all4school
 
Communication system in healthcare
Communication system in healthcareCommunication system in healthcare
Communication system in healthcare
DrArshpreet18
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
DataWorks Summit
 
Customer satisfaction process
Customer satisfaction processCustomer satisfaction process
Customer satisfaction process
Pimsat University
 
Mercedes-Benz Case Study
Mercedes-Benz Case StudyMercedes-Benz Case Study
Mercedes-Benz Case Study
Wolff Olins
 
CRM Practices in the Airlines Industry
CRM Practices in the Airlines IndustryCRM Practices in the Airlines Industry
CRM Practices in the Airlines Industry
Mandar Ghanekar
 
Customer relationship management and supply chain management
Customer relationship management and supply chain managementCustomer relationship management and supply chain management
Customer relationship management and supply chain management
Rohit Kumar
 
Top 10 communications officer interview questions and answers
Top 10 communications officer interview questions and answersTop 10 communications officer interview questions and answers
Top 10 communications officer interview questions and answers
JackRyab456
 

En vedette (20)

Alternative Approach to Permanent way Alignment Design
Alternative Approach to Permanent way Alignment DesignAlternative Approach to Permanent way Alignment Design
Alternative Approach to Permanent way Alignment Design
 
Columbian Exchange: Chart
Columbian Exchange: ChartColumbian Exchange: Chart
Columbian Exchange: Chart
 
Raw Materials Management
Raw Materials ManagementRaw Materials Management
Raw Materials Management
 
Rail Passenger Demand Forecasting - a view from the industry
Rail Passenger Demand Forecasting - a view from the industryRail Passenger Demand Forecasting - a view from the industry
Rail Passenger Demand Forecasting - a view from the industry
 
Kasaysayan ng retorika sa daigdig
Kasaysayan ng retorika sa daigdigKasaysayan ng retorika sa daigdig
Kasaysayan ng retorika sa daigdig
 
Emergency Department Quality Improvement
Emergency Department Quality ImprovementEmergency Department Quality Improvement
Emergency Department Quality Improvement
 
Predictive Analytics: Extending asset management framework for multi-industry...
Predictive Analytics: Extending asset management framework for multi-industry...Predictive Analytics: Extending asset management framework for multi-industry...
Predictive Analytics: Extending asset management framework for multi-industry...
 
Process and product quality Assurance
Process and product quality AssuranceProcess and product quality Assurance
Process and product quality Assurance
 
Management planning presentation
Management planning presentationManagement planning presentation
Management planning presentation
 
Communication system in healthcare
Communication system in healthcareCommunication system in healthcare
Communication system in healthcare
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
 
Making Display Advertising Work for Auto Dealers
Making Display Advertising Work for Auto DealersMaking Display Advertising Work for Auto Dealers
Making Display Advertising Work for Auto Dealers
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Customer satisfaction process
Customer satisfaction processCustomer satisfaction process
Customer satisfaction process
 
Mercedes-Benz Case Study
Mercedes-Benz Case StudyMercedes-Benz Case Study
Mercedes-Benz Case Study
 
Sandia 2014 Wind Turbine Blade Workshop- Newman
Sandia 2014 Wind Turbine Blade Workshop- NewmanSandia 2014 Wind Turbine Blade Workshop- Newman
Sandia 2014 Wind Turbine Blade Workshop- Newman
 
Pneumatic controllers
Pneumatic controllersPneumatic controllers
Pneumatic controllers
 
CRM Practices in the Airlines Industry
CRM Practices in the Airlines IndustryCRM Practices in the Airlines Industry
CRM Practices in the Airlines Industry
 
Customer relationship management and supply chain management
Customer relationship management and supply chain managementCustomer relationship management and supply chain management
Customer relationship management and supply chain management
 
Top 10 communications officer interview questions and answers
Top 10 communications officer interview questions and answersTop 10 communications officer interview questions and answers
Top 10 communications officer interview questions and answers
 

Similaire à Building a distributed search system with Hadoop and Lucene

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
Lucidworks
 

Similaire à Building a distributed search system with Hadoop and Lucene (20)

Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
963
963963
963
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
hadoop
hadoophadoop
hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Suggested Algorithm to improve Hadoop's performance.
Suggested Algorithm to improve Hadoop's performance.Suggested Algorithm to improve Hadoop's performance.
Suggested Algorithm to improve Hadoop's performance.
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
 

Dernier

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 

Building a distributed search system with Hadoop and Lucene

  • 1. Building a distributed search system with Apache Hadoop and Lucene Anno Accademico 2012-2013
  • 2. Outline • Big Data Problem • Map and Reduce approach: Apache Hadoop • Distributing a Lucene index using Hadoop • Measuring Performance • Conclusion Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
  • 3. “Big Data” This works analyzes the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (10E12 bytes) or Petabyte (10E15 bytes) and with an exponential growth rate. • Facebook processes 2.5 billion contents/day. • Youtube: 72 hours of video uploaded per minutes. • Twitter:50 million tweets per day. Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
  • 4. Multitier architecture vs Cloud computing Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" Front End Servers Database Servers Client Front End Servers Cloud Client Data asynchronous analysis Realtimeprocessing Realtimeprocessing
  • 5. Apache Hadoop architecture Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers
  • 6. HDFS: the distributed file system • Files are stored as sets of (large) blocks – Default block size: 64 MB (ext4 default is 4kB) – Blocks are replicated for durability and availability • Namespace is managed by a single name node – Actual data transfer is directly between client & data node – Pros and cons of this decision? foo.txt: 3,9,6 bar.data: 2,4 block #2 of foo.txt? 9 Read block 9 9 9 9 93 3 3 2 2 24 4 4 6 6 Name node Data nodesClient Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
  • 7. Map and Reduce The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
  • 8. Recap: Map Reduce approach Mapper Mapper Mapper Mapper Reducer Reducer Reducer Reducer Inputdata Outputdata "The Shuffle" Intermediate (key,value) pairs Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
  • 9. Map and Reduce: where is applicable Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" • Distributed “Grep” • Count of URL Access Frequency • Reverse Web-Link Graph • Term-Vector per Host • Reduce a n level graph in a redundant hash table
  • 10. Implementation: distributing a Lucene index using Map and Reduce The scope of the implementation is to: 1. populate a Lucene distributed index using the HDFS cluster 2. distributing and retrieving results using Map and Reduce Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
  • 11. Apache Lucene: indexing Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" n Apache Lucene is the standard de facto in the open source community for textual search Document Field(type)->Value Field(type)->Value Field(type)->Value
  • 12. Apache Lucene: searching Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" In Lucene each document is a vector. A measure of the relevance is the value of the θ angle between the document and the query vector
  • 13. Distributing Lucene indexes using Hadoop Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" Index 1 Lucene Indexer Job Indexing Searching Index 2 Index 3 PDF doc archive Map Phase: Creates and populate each index Reduce Phase: None HDFSCluster Index 1 Lucene Search Job Index 2 Index 3 HDFSCluster map Sort Reduce ResulSet Combine map map {Search Filter} (list of Lucene Restrictions) Map Phase: Queries the indexes Reduce Phase: Merges and orders result set
  • 14. Measuring Performance The entire execution time can be formally defined as: While the single Map (or Reduce) phase: Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" Where α is the % of reduce tasks still on going after map phase completion.
  • 15. Measuring Performance Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" Data nodes CPU Of the Nodes RAM available Name Nodes Number of file Total Bytes read Job Submission Cost Total Job Time 2 Intel i7 CPU 2.67 GHZ 4 GB 1 1330 2.043 GB 1 min 5 sec 24m 35 sec 3 Intel i7 CPU 2.67 GHZ 4 GB 1 1330 2.043 GB 1 min 21 sec 12 min 10 sec 4 Intel i7 CPU 2.67 GHZ 4 GB 1 1330 2.043 GB 1 min 40 sec 8 min 22 sec 1 (No Hadoop) Intel i7 CPU 2.67 GHZ 4GB 0 1330 2043 GB 0 10 min 11 sec With 4 or more data nodes Hadoop infrastructure setup cost is compensated
  • 16. Measuring Performance (Word Count) Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" Having a single Big file speeds up Hadoop consistently, so performance are not really determined by the quantity of data but how many splits are added to the HDFS Data nodes Cpu Of the Nodes RAM available Name Nodes Number of file Total Bytes read Job Submission Cost 3 Intel i7 Cpu 2.67 GHZ 4 GB 1 1 942 MB 3 min 18 sec 4 Intel i7 Cpu 2.67 GHZ 4 GB 1 1 942 MB 2 min 17 sec 1 (No Hadoop) Intel i7 Cpu 2.67 GHZ 4 GB 1 1 942 MB 4 min 27 sec
  • 17. Job Detail Page Tasks Queue Tasks currently running Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
  • 18. Conclusion Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene" What • Analysis of the current status of Open Source technologies • Analysis of the potential applications for the web • Implemented a full working Hadoop architecture • Designed a web portal based on the previous architecture Objectives: • Explore Map and Reduce approach to analyze unstructured data • Measure performance and understand the Apache Hadoop framework Outcomes • Setup of the entire architecture in my company environment (Innovation Engineering) • Main benefits in the indexing phase • Poor impact on the search side (for standard queries format) • In general major benefits when the HDFS is populated by a relatively small number of Big (GB) files