SlideShare a Scribd company logo
1 of 20
Download to read offline
Scalable high-dimensional
indexing with Hadoop
TEXMEX team, INRIA Rennes, France
Denis Shestakov, PhD
denis.shestakov at {aalto.fi,inria.fr}
linkedin: linkedin.com/in/dshestakov
mendeley: mendeley.com/profiles/denis-shestakov
Denis Shestakov, Diana Moise,
Gylfi Gudmundsson, Laurent Amsaleg
Outline
● Motivation
● Approach overview: scaling indexing &
searching using Hadoop
● Experimental setup: datasets, resources,
configuration
● Results
● Observations & implications
● Things to share
● Future directions
Motivation
● Big data is here
○ Lots of multimedia content
○ Even forgetting 'big' companies, 1TB/day of
multimedia is now common for many parties
● Solution: apply more computational power
○ Luckily, easier access to such power via grid/cloud
resources
● Applications:
○ Large-scale image retrieval: e.g., detecting copyright
violations in huge image repositories
○ Google Goggles-like systems: annotating the scene
Our approach
● Index & search huge image collection using
MapReduce-based eCP algorithm
○ See our work at ICMR'13: Indexing and searching
100M images with MapReduce [7]
○ See Section II for quick overview
● Use the Grid5000 plartform
○ Distributed infrastructure available to French
researchers & their partners
● Use the Hadoop framework
○ Most popular open-source implementation of
MapReduce model
○ Data stored in HDFS that splits it into chunks (64MB or
often bigger) and distributes it across nodes
Our approach
● Hadoop used for both indexing and searching
● Our search scenario:
■ Searching for batch of images
● Thousands of images in one run
● Focus on throughput, not on response time
for individual image
■ Use case: copyright violation detection
● Note: indexed dataset can be searched on single
machine with adequate disk capacity if necessary
Experimental setup
● Used Grid5000 platform:
○ Nodes in rennes site of Grid5000
■ Up to 110 nodes available
■ Nodes capacity/performance varied
● Heterogenous, come from three clusters
● From 8 cores to 24 cores per node
● From 24GB to 48GB RAM per node
● Hadoop ver.1.0.1
○ (!) No changes in Hadoop internals
■ Pros: easy to migrate, try and compare by others
■ Cons: not top performance
Experimental setup
● Over 100 mln images (~30 billion SIFT descriptors)
○ Collected from the Web and provided by one of the
partners in Quaero project
■ One of the largest reported in literature
○ Images resized to 150px on largest side
○ Worked with
■ The whole set (~4TB)
■ The subset, 20mln images (~1TB)
○ Used as distracting dataset
Experimental setup
● For evaluation of indexing quality:
○ Added to distracting datasets:
■ INRIA Copydays (127 images)
○ Queried for
■ Copydays batch (3055 images = 127 original
images and their associated variants incl. strong
distortions, e.g. print-crumple-scan )
■ 12k batch (12081 images = 245 random images
from dataset and their variants)
○ Checked if original images returned as top voted
search results
Results: workflow overview
● Experiment on indexing & searching 1TB took 5-6
hours
Results: indexing 1TB
Results: indexing 4TB
● 4TB
● 100 nodes
● Used tuned parameters
○ Except change in #mappers/#reducers per node
■ To fit bigger index tree (for 4TB) to RAM
■ 4 mappers/2 reducers
● Time: 507min
Results: search quality
Results: search scalability
Results: search execution
Search 12k batch over 1TB using 100 nodes
Results: searching 4TB
● 4TB
● 87 nodes
● Copydays query batch (3k images)
○ Throughput: 460ms per image
● 12k query batch
○ Throughput: 210ms per image
● Bigger batches improve throughput insignificantly
○ bigger batch -> bigger lookup table -> more RAM per
mapper required -> less mappers per node
Observations &
implications
● HDFS block size limits scalability
○ 1TB dataset => 1186 blocks of 1024MB size
○ Assuming 8-core nodes and reported searching
method: no scaling after 149 nodes (i.e.
8x149=1192)
○ Solutions:
■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for
512MB blocks
■ Re-visit search process: e.g., partial-loading of lookup
table
● Big data is here but not resources to process
○ E.g, indexing&searching >10TB not possible given resources we had
Things to share
● Our methods/system can be applied to audio datasets
○ No major changes expected
○ Contact me if interested
● Code for MapReduce-eCP algorithm available on request
○ Should run smoothly on your Hadoop cluster
○ Interested in comparisons
● Hadoop job history logs behind our experiments (not only
for those reported at CBMI) available on request
○ Describe indexing/searching our dataset by giving details
on map/reduce tasks execution
○ Insights on better analysis/visualization are welcome
○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE
Future directions
● Deal with big batches of query images
○ ~200k query images
● Share auxiliary data (index tree, lookup table) by
mappers
○ Multithreaded map tasks
● (environment-specific) Test scalability on more nodes
○ Use several sites of Grid5000 infrastructure
■ rennes+nancy sites (up to 300 nodes) --in
progress
Acknowledgements
● TEXMEX team, INRIA Rennes http://www.
irisa.fr/texmex/index_en.php
● Quaero project, http://www.quaero.org/
● Grid5000 infrastructure & its Rennes
maintenance team, https://www.grid5000.fr
Thank you!
Questions?

More Related Content

What's hot

Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
royans
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 

What's hot (20)

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
PyTables
PyTablesPyTables
PyTables
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 

Similar to Scalable high-dimensional indexing with Hadoop

NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
Guillaume Lefranc
 

Similar to Scalable high-dimensional indexing with Hadoop (20)

Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
InternReport
InternReportInternReport
InternReport
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 

More from Denis Shestakov

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
Denis Shestakov
 

More from Denis Shestakov (9)

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep Web
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Scalable high-dimensional indexing with Hadoop

  • 1. Scalable high-dimensional indexing with Hadoop TEXMEX team, INRIA Rennes, France Denis Shestakov, PhD denis.shestakov at {aalto.fi,inria.fr} linkedin: linkedin.com/in/dshestakov mendeley: mendeley.com/profiles/denis-shestakov Denis Shestakov, Diana Moise, Gylfi Gudmundsson, Laurent Amsaleg
  • 2. Outline ● Motivation ● Approach overview: scaling indexing & searching using Hadoop ● Experimental setup: datasets, resources, configuration ● Results ● Observations & implications ● Things to share ● Future directions
  • 3. Motivation ● Big data is here ○ Lots of multimedia content ○ Even forgetting 'big' companies, 1TB/day of multimedia is now common for many parties ● Solution: apply more computational power ○ Luckily, easier access to such power via grid/cloud resources ● Applications: ○ Large-scale image retrieval: e.g., detecting copyright violations in huge image repositories ○ Google Goggles-like systems: annotating the scene
  • 4. Our approach ● Index & search huge image collection using MapReduce-based eCP algorithm ○ See our work at ICMR'13: Indexing and searching 100M images with MapReduce [7] ○ See Section II for quick overview ● Use the Grid5000 plartform ○ Distributed infrastructure available to French researchers & their partners ● Use the Hadoop framework ○ Most popular open-source implementation of MapReduce model ○ Data stored in HDFS that splits it into chunks (64MB or often bigger) and distributes it across nodes
  • 5. Our approach ● Hadoop used for both indexing and searching ● Our search scenario: ■ Searching for batch of images ● Thousands of images in one run ● Focus on throughput, not on response time for individual image ■ Use case: copyright violation detection ● Note: indexed dataset can be searched on single machine with adequate disk capacity if necessary
  • 6. Experimental setup ● Used Grid5000 platform: ○ Nodes in rennes site of Grid5000 ■ Up to 110 nodes available ■ Nodes capacity/performance varied ● Heterogenous, come from three clusters ● From 8 cores to 24 cores per node ● From 24GB to 48GB RAM per node ● Hadoop ver.1.0.1 ○ (!) No changes in Hadoop internals ■ Pros: easy to migrate, try and compare by others ■ Cons: not top performance
  • 7. Experimental setup ● Over 100 mln images (~30 billion SIFT descriptors) ○ Collected from the Web and provided by one of the partners in Quaero project ■ One of the largest reported in literature ○ Images resized to 150px on largest side ○ Worked with ■ The whole set (~4TB) ■ The subset, 20mln images (~1TB) ○ Used as distracting dataset
  • 8. Experimental setup ● For evaluation of indexing quality: ○ Added to distracting datasets: ■ INRIA Copydays (127 images) ○ Queried for ■ Copydays batch (3055 images = 127 original images and their associated variants incl. strong distortions, e.g. print-crumple-scan ) ■ 12k batch (12081 images = 245 random images from dataset and their variants) ○ Checked if original images returned as top voted search results
  • 9. Results: workflow overview ● Experiment on indexing & searching 1TB took 5-6 hours
  • 11. Results: indexing 4TB ● 4TB ● 100 nodes ● Used tuned parameters ○ Except change in #mappers/#reducers per node ■ To fit bigger index tree (for 4TB) to RAM ■ 4 mappers/2 reducers ● Time: 507min
  • 14. Results: search execution Search 12k batch over 1TB using 100 nodes
  • 15. Results: searching 4TB ● 4TB ● 87 nodes ● Copydays query batch (3k images) ○ Throughput: 460ms per image ● 12k query batch ○ Throughput: 210ms per image ● Bigger batches improve throughput insignificantly ○ bigger batch -> bigger lookup table -> more RAM per mapper required -> less mappers per node
  • 16. Observations & implications ● HDFS block size limits scalability ○ 1TB dataset => 1186 blocks of 1024MB size ○ Assuming 8-core nodes and reported searching method: no scaling after 149 nodes (i.e. 8x149=1192) ○ Solutions: ■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for 512MB blocks ■ Re-visit search process: e.g., partial-loading of lookup table ● Big data is here but not resources to process ○ E.g, indexing&searching >10TB not possible given resources we had
  • 17. Things to share ● Our methods/system can be applied to audio datasets ○ No major changes expected ○ Contact me if interested ● Code for MapReduce-eCP algorithm available on request ○ Should run smoothly on your Hadoop cluster ○ Interested in comparisons ● Hadoop job history logs behind our experiments (not only for those reported at CBMI) available on request ○ Describe indexing/searching our dataset by giving details on map/reduce tasks execution ○ Insights on better analysis/visualization are welcome ○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE
  • 18. Future directions ● Deal with big batches of query images ○ ~200k query images ● Share auxiliary data (index tree, lookup table) by mappers ○ Multithreaded map tasks ● (environment-specific) Test scalability on more nodes ○ Use several sites of Grid5000 infrastructure ■ rennes+nancy sites (up to 300 nodes) --in progress
  • 19. Acknowledgements ● TEXMEX team, INRIA Rennes http://www. irisa.fr/texmex/index_en.php ● Quaero project, http://www.quaero.org/ ● Grid5000 infrastructure & its Rennes maintenance team, https://www.grid5000.fr