SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Terabyte-scale image similarity 
search with Hadoop 
Denis Shestakov 
Hadoop Summit Europe 2014, Amsterdam, Netherlands, 02.04.2014
About me 
● Big Data researcher/engineer 
○ recent projects: large-scale image retrieval 
○ before: web crawling 
● Hadoop/MapReduce contractor 
○ design/development/tuning Hadoop applications 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Talk Outline 
● Intro to image search 
● Image retrieval with MapReduce 
● Image indexing/searching workloads 
● Hadoop tools for large joins 
● Smart Hadoop configuration 
● Misc & conclusions 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Intro to Image Search 
● Finding images given a text 
○ dog → 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Intro to Image Search 
● Finding images given an image 
○ By content-similarity 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image Search Applications 
● Regular image search 
○ Google Images, Bing Images, TinEye, etc 
● Product search (by image) 
● Object recognition 
○ Face, logo, vehicle, etc. 
● Computer vision 
● Augmented reality 
● Medical imaging 
● Astrophysics 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Intro to Image Search 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Intro to Image Search 
How does it work? 
● Images resized to smaller size 
● Then transformed to chosen feature description 
representation 
○ image → set of feature descriptors (=high-dimensional 
vectors) 
○ Many transformations exist 
■ SIFT (Scale-invariant feature transform) used by us 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Intro to Image Search 
How does it work? 
image_id SIFT descriptor 
10011 21, 143, 5, …, 201, 186 
10011 121, 14, 75, …, 20, 109 
10011 37, 40, 0, …, 213, 96 
... ... 
10011 81, 235, 67, …, 102,63 
Typical: several hundreds of feature descriptors 
per image 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Intro to Image Search 
How does it work? 
● Compare (e.g., by calculating Euclidean distance) 
feature descriptors of a query image with 
descriptors of images in collection to search 
● Images with ‘closest’ descriptors are similar to a 
query image 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Intro to Image Search 
Why MapReduce? 
● Direct comparisons of descriptors costly even for 
very small collections 
● Lots of approaches to ‘organize’ feature 
descriptors for fast search 
○ Build an index 
○ Index all the descriptors 
○ At search, check query descriptors only against 
certain groups of descriptors 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image Retrieval with MapReduce 
Why MapReduce? 
● Poorly scalable 
○ up to ~10-20 mln images 
● But multimedia grows exponentially 
● Scaling is required … 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image Retrieval with MapReduce 
Use case: 
● Copyright violation detection in large image 
databank 
○ >100mln images 
● Searching for batch of images 
○ Thousands of images in one query 
○ Focus on throughput, not on response time for 
individual image 
● SIFT features 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image Retrieval with MapReduce 
Indexing images 
● Generating index tree 
● Clustering images into a large set of clusters 
(max cluster size = 5000) 
○ Mapper input: 
■ unsorted SIFT descriptors 
■ index tree (loaded by every mapper) 
○ Mapper output: 
■ (cluster_id, SIFT) 
○ Reducer output: 
■ SIFTs sorted by cluster_id 
Denis Shestakov 
denshe at gmail.com 
MapReduce 
linkedin: linkedin.com/in/dshestakov
Image Retrieval with MapReduce 
Searching 
● Generating lookup table 
○ indexing query SIFTs 
MapReduce 
● Finding best matches for query SIFTs 
○ Mapper input: 
■ sorted SIFT descriptors 
■ lookup table (loaded by every mapper) 
○ Mapper output: 
■ (query-sift-id, knn of image-ids) 
○ Reducer output: 
MapReduce 
■ Best votes (image-ids) for query-image-id 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image Retrieval with MapReduce 
In nutshell: 
● Indexing phase 
○ Clustering SIFTs with one-pass k-means 
● Searching phase 
○ Map-side join of clustered SIFTs and lookup table 
(query SIFTs) 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image search workloads 
Time to discuss Hadoop specifics: 
● Standard Apache Hadoop distribution, ver.1.0.1 
○ (!) No changes in Hadoop internals 
■ Easy to migrate 
● Around 100 nodes from Grid5000 
○ 8/24 cores, 24/32/48GB RAM per node 
○ capacity/performance varied 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image search workloads 
Dataset: 
● 110 mln images (~30 billion SIFT descriptors) 
○ ~30 billion SIFT descriptors 
○ 4TB 
○ Largest reported in literature 
○ Images resized to 150px on largest side 
○ Worked also with subset, 1TB 
○ Used as distracting dataset 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image search workloads 
Queries: 
● Query batches 
○ Up to 250k query images in one batch 
○ Batch includes original images and their distorted 
variants 
■ Some variants are very hard to find 
● e.g., print-crumple-scan 
● Check if original images returned as top votes 
○ (out of scope) state-of-the-art search quality 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image search workloads 
Indexing workload characteristics 
● computationally-intensive (map phase) 
● data-intensive (at map&reduce phases) 
● large auxiliary data structure (i.e., index tree) 
○ grows as dataset grows 
○ e.g., 1.8GB for 110M images (4TB) 
● map input < map output 
● network is heavily utilized during shuffling 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image search workloads 
Indexing workload 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image search workloads 
Searching workload 
● large aux.data structure (e.g., lookup table) 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Image search workloads 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov 
● Basic settings: 
○ 512MB HDFS 
block size 
○ 3 replicas 
○ 8 map slots 
○ 2 reduce slots 
● 4TB dataset: 
○ 4 map slots
Hadoop tools for large joins 
● Some workloads require all mappers to load a 
large-size data structure 
○ Like image indexing/searching workloads 
● Spreading data file across all nodes 
○ Hadoop DistributedCache 
● Not efficient if structure is of gigabytes-size 
○ Partial solution: increase HDFS block sizes → 
decrease #mappers 
● Another approach: multithreaded mappers 
○ Not well documented 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Hadoop tools for large joins 
● Multithreaded mapper spans a configured number 
of threads, each thread executes a map task 
● Mapper threads share the RAM 
● Downsides: 
○ synchronization when reading input 
○ synchronization when writing output 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Hadoop tools for large joins 
Indexing 4T with 4 mappers slots, each running 
two threads 
● index tree size: 1.8GB 
Indexing time on 100 nodes 
● 8h27min → 6h8min 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Hadoop tools for large joins 
● In some workloads mappers require only a part 
of auxiliary data structure 
○ I.e., relevant to data block processed 
○ Image searching workflow 
● Approach: Hadoop MapFile 
○ Very efficient 
■ Big batches, >10000 query images 
■ ~2 times faster on batches including around 
25000 images 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Smart Hadoop configuration 
Here is the problem: 
● Apache Hadoop, v.1.0.1 
● Capacity/performance of nodes varied 
○ 8/24 cores, 24-48GB RAM, etc 
● One config file (#mappers, #reducers, maxim. 
map/reduce memory, ...) for all nodes 
● Issue for memory-intensive workloads! 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Smart Hadoop configuration 
Solution (hack): 
● deploy Hadoop on all nodes with settings addressing 
the least equipped nodes 
● create sub-cluster configuration files adjusted to better 
equipped nodes 
○ substitute original config file with the new one on better 
equipped nodes 
● restart tasktrackers with new configuration files on 
better equipped nodes 
Call it smart deployment 
● Or known under another name? Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Smart Hadoop configuration 
Denis Shestakov 
denshe at gmail.com 
Indexing 1T on 106 nodes: 75min → 65min 
linkedin: linkedin.com/in/dshestakov
Conclusions 
● Several directions for further optimization 
● Presented techniques applicable to video and 
audio datasets 
○ Given a transformation into feature vectors 
○ Only small changes expected (e.g, new Writable) 
● Hadoop smart deployment trick 
● (Wanted) Best practices for Hadoop job 
history log analysis 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Supporting publications 
Things to share 
Hadoop job history logs available on request: 
● Describe indexing/searching 4TB dataset 
● Insights on better analysis/visualization are welcome 
● Get cbmi13 example-set at http://goo.gl/e06wE 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov
Supporting publications 
Supporting Materials 
Check full-texts of our publications: 
● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and 
searching 100M images with Map-Reduce. In Proc. ACM ICMR'13, 
2013. 
● D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional 
● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale 
image similarity search: experience and best practice. In Proc. IEEE 
BigData'13, 2013. 
Denis Shestakov 
denshe at gmail.com 
indexing with Hadoop. In Proc. CBMI'13, 2013. 
linkedin: linkedin.com/in/dshestakov
Acknowledgements 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov 
● My colleagues at INRIA 
Rennes 
● Aalto University 
● Grid5000 infrastructure
That’s it! 
Denis Shestakov 
denshe at gmail.com 
linkedin: linkedin.com/in/dshestakov 
Thanks!

Contenu connexe

En vedette

Hipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large ScaleHipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large Scale
Liu Liu
 
15 minute presentation about Thesis
15 minute presentation about Thesis15 minute presentation about Thesis
15 minute presentation about Thesis
Sven Meys
 

En vedette (20)

Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Mild reminder
Mild reminderMild reminder
Mild reminder
 
Hipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large ScaleHipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large Scale
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
 
15 minute presentation about Thesis
15 minute presentation about Thesis15 minute presentation about Thesis
15 minute presentation about Thesis
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep Web
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Optimize IT Infrastructure
Optimize IT InfrastructureOptimize IT Infrastructure
Optimize IT Infrastructure
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
 
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageWebinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Large-scale social media analysis with Hadoop
Large-scale social media analysis with HadoopLarge-scale social media analysis with Hadoop
Large-scale social media analysis with Hadoop
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Terabyte-scale image similarity search with Hadoop

  • 1. Terabyte-scale image similarity search with Hadoop Denis Shestakov Hadoop Summit Europe 2014, Amsterdam, Netherlands, 02.04.2014
  • 2. About me ● Big Data researcher/engineer ○ recent projects: large-scale image retrieval ○ before: web crawling ● Hadoop/MapReduce contractor ○ design/development/tuning Hadoop applications Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 3. Talk Outline ● Intro to image search ● Image retrieval with MapReduce ● Image indexing/searching workloads ● Hadoop tools for large joins ● Smart Hadoop configuration ● Misc & conclusions Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 4. Intro to Image Search ● Finding images given a text ○ dog → Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 5. Intro to Image Search ● Finding images given an image ○ By content-similarity Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 6. Image Search Applications ● Regular image search ○ Google Images, Bing Images, TinEye, etc ● Product search (by image) ● Object recognition ○ Face, logo, vehicle, etc. ● Computer vision ● Augmented reality ● Medical imaging ● Astrophysics Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 7. Intro to Image Search Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 8. Intro to Image Search How does it work? ● Images resized to smaller size ● Then transformed to chosen feature description representation ○ image → set of feature descriptors (=high-dimensional vectors) ○ Many transformations exist ■ SIFT (Scale-invariant feature transform) used by us Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 9. Intro to Image Search How does it work? image_id SIFT descriptor 10011 21, 143, 5, …, 201, 186 10011 121, 14, 75, …, 20, 109 10011 37, 40, 0, …, 213, 96 ... ... 10011 81, 235, 67, …, 102,63 Typical: several hundreds of feature descriptors per image Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 10. Intro to Image Search How does it work? ● Compare (e.g., by calculating Euclidean distance) feature descriptors of a query image with descriptors of images in collection to search ● Images with ‘closest’ descriptors are similar to a query image Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 11. Intro to Image Search Why MapReduce? ● Direct comparisons of descriptors costly even for very small collections ● Lots of approaches to ‘organize’ feature descriptors for fast search ○ Build an index ○ Index all the descriptors ○ At search, check query descriptors only against certain groups of descriptors Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 12. Image Retrieval with MapReduce Why MapReduce? ● Poorly scalable ○ up to ~10-20 mln images ● But multimedia grows exponentially ● Scaling is required … Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 13. Image Retrieval with MapReduce Use case: ● Copyright violation detection in large image databank ○ >100mln images ● Searching for batch of images ○ Thousands of images in one query ○ Focus on throughput, not on response time for individual image ● SIFT features Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 14. Image Retrieval with MapReduce Indexing images ● Generating index tree ● Clustering images into a large set of clusters (max cluster size = 5000) ○ Mapper input: ■ unsorted SIFT descriptors ■ index tree (loaded by every mapper) ○ Mapper output: ■ (cluster_id, SIFT) ○ Reducer output: ■ SIFTs sorted by cluster_id Denis Shestakov denshe at gmail.com MapReduce linkedin: linkedin.com/in/dshestakov
  • 15. Image Retrieval with MapReduce Searching ● Generating lookup table ○ indexing query SIFTs MapReduce ● Finding best matches for query SIFTs ○ Mapper input: ■ sorted SIFT descriptors ■ lookup table (loaded by every mapper) ○ Mapper output: ■ (query-sift-id, knn of image-ids) ○ Reducer output: MapReduce ■ Best votes (image-ids) for query-image-id Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 16. Image Retrieval with MapReduce In nutshell: ● Indexing phase ○ Clustering SIFTs with one-pass k-means ● Searching phase ○ Map-side join of clustered SIFTs and lookup table (query SIFTs) Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 17. Image search workloads Time to discuss Hadoop specifics: ● Standard Apache Hadoop distribution, ver.1.0.1 ○ (!) No changes in Hadoop internals ■ Easy to migrate ● Around 100 nodes from Grid5000 ○ 8/24 cores, 24/32/48GB RAM per node ○ capacity/performance varied Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 18. Image search workloads Dataset: ● 110 mln images (~30 billion SIFT descriptors) ○ ~30 billion SIFT descriptors ○ 4TB ○ Largest reported in literature ○ Images resized to 150px on largest side ○ Worked also with subset, 1TB ○ Used as distracting dataset Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 19. Image search workloads Queries: ● Query batches ○ Up to 250k query images in one batch ○ Batch includes original images and their distorted variants ■ Some variants are very hard to find ● e.g., print-crumple-scan ● Check if original images returned as top votes ○ (out of scope) state-of-the-art search quality Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 20. Image search workloads Indexing workload characteristics ● computationally-intensive (map phase) ● data-intensive (at map&reduce phases) ● large auxiliary data structure (i.e., index tree) ○ grows as dataset grows ○ e.g., 1.8GB for 110M images (4TB) ● map input < map output ● network is heavily utilized during shuffling Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 21. Image search workloads Indexing workload Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 22. Image search workloads Searching workload ● large aux.data structure (e.g., lookup table) Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 23. Image search workloads Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov ● Basic settings: ○ 512MB HDFS block size ○ 3 replicas ○ 8 map slots ○ 2 reduce slots ● 4TB dataset: ○ 4 map slots
  • 24. Hadoop tools for large joins ● Some workloads require all mappers to load a large-size data structure ○ Like image indexing/searching workloads ● Spreading data file across all nodes ○ Hadoop DistributedCache ● Not efficient if structure is of gigabytes-size ○ Partial solution: increase HDFS block sizes → decrease #mappers ● Another approach: multithreaded mappers ○ Not well documented Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 25. Hadoop tools for large joins ● Multithreaded mapper spans a configured number of threads, each thread executes a map task ● Mapper threads share the RAM ● Downsides: ○ synchronization when reading input ○ synchronization when writing output Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 26. Hadoop tools for large joins Indexing 4T with 4 mappers slots, each running two threads ● index tree size: 1.8GB Indexing time on 100 nodes ● 8h27min → 6h8min Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 27. Hadoop tools for large joins ● In some workloads mappers require only a part of auxiliary data structure ○ I.e., relevant to data block processed ○ Image searching workflow ● Approach: Hadoop MapFile ○ Very efficient ■ Big batches, >10000 query images ■ ~2 times faster on batches including around 25000 images Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 28. Smart Hadoop configuration Here is the problem: ● Apache Hadoop, v.1.0.1 ● Capacity/performance of nodes varied ○ 8/24 cores, 24-48GB RAM, etc ● One config file (#mappers, #reducers, maxim. map/reduce memory, ...) for all nodes ● Issue for memory-intensive workloads! Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 29. Smart Hadoop configuration Solution (hack): ● deploy Hadoop on all nodes with settings addressing the least equipped nodes ● create sub-cluster configuration files adjusted to better equipped nodes ○ substitute original config file with the new one on better equipped nodes ● restart tasktrackers with new configuration files on better equipped nodes Call it smart deployment ● Or known under another name? Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 30. Smart Hadoop configuration Denis Shestakov denshe at gmail.com Indexing 1T on 106 nodes: 75min → 65min linkedin: linkedin.com/in/dshestakov
  • 31. Conclusions ● Several directions for further optimization ● Presented techniques applicable to video and audio datasets ○ Given a transformation into feature vectors ○ Only small changes expected (e.g, new Writable) ● Hadoop smart deployment trick ● (Wanted) Best practices for Hadoop job history log analysis Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 32. Supporting publications Things to share Hadoop job history logs available on request: ● Describe indexing/searching 4TB dataset ● Insights on better analysis/visualization are welcome ● Get cbmi13 example-set at http://goo.gl/e06wE Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov
  • 33. Supporting publications Supporting Materials Check full-texts of our publications: ● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and searching 100M images with Map-Reduce. In Proc. ACM ICMR'13, 2013. ● D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional ● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Proc. IEEE BigData'13, 2013. Denis Shestakov denshe at gmail.com indexing with Hadoop. In Proc. CBMI'13, 2013. linkedin: linkedin.com/in/dshestakov
  • 34. Acknowledgements Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov ● My colleagues at INRIA Rennes ● Aalto University ● Grid5000 infrastructure
  • 35. That’s it! Denis Shestakov denshe at gmail.com linkedin: linkedin.com/in/dshestakov Thanks!