Talk given at Hadoop Summit Europe 2014, Amsterdam, Netherlands on 02.04.2014
Talk abstract: In this talk I focus on a specific Hadoop application, image similarity search, and present our experience on designing, building and testing a Hadoop-based image similarity search scalable to terabyte-sized image collections. I start with overviewing how to adapt image retrieval techniques to MapReduce model. Second, I describe image indexing and searching workloads and show how these workflows are rather atypical for Hadoop. E.g., I explain how to tune Hadoop to fit to such computational tasks and particularly specify the parameters and values that deliver best performance. Next I present the Hadoop cluster heterogeneity problem and describe a solution to it by proposing a platform-aware Hadoop configuration. Then I introduce the tools, provided by the standard Apache Hadoop framework, useful for a large class of application workloads similar to ours, where a large-size auxiliary data structure is required for processing the dataset. Finally, I overview a series of experiments conducted on four terabytes image dataset (biggest reported in the academic literature). The findings will be shared as best practices and recommendations to the practitioners working with huge multimedia collections.
Speaker: Dr. Denis Shestakov is an experienced researcher in the area of big data engineering and, recently, a practitioner as a Hadoop/MapReduce consultant. Denis has been involved in various big data projects in web analytics and search, multimedia search and bioinformatics. See his profile at LinkedIn: http://fi.linkedin.com/in/dshestakov/
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Terabyte-scale image similarity search with Hadoop
1. Terabyte-scale image similarity
search with Hadoop
Denis Shestakov
Hadoop Summit Europe 2014, Amsterdam, Netherlands, 02.04.2014
2. About me
● Big Data researcher/engineer
○ recent projects: large-scale image retrieval
○ before: web crawling
● Hadoop/MapReduce contractor
○ design/development/tuning Hadoop applications
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
3. Talk Outline
● Intro to image search
● Image retrieval with MapReduce
● Image indexing/searching workloads
● Hadoop tools for large joins
● Smart Hadoop configuration
● Misc & conclusions
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
4. Intro to Image Search
● Finding images given a text
○ dog →
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
5. Intro to Image Search
● Finding images given an image
○ By content-similarity
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
7. Intro to Image Search
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
8. Intro to Image Search
How does it work?
● Images resized to smaller size
● Then transformed to chosen feature description
representation
○ image → set of feature descriptors (=high-dimensional
vectors)
○ Many transformations exist
■ SIFT (Scale-invariant feature transform) used by us
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
9. Intro to Image Search
How does it work?
image_id SIFT descriptor
10011 21, 143, 5, …, 201, 186
10011 121, 14, 75, …, 20, 109
10011 37, 40, 0, …, 213, 96
... ...
10011 81, 235, 67, …, 102,63
Typical: several hundreds of feature descriptors
per image
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
10. Intro to Image Search
How does it work?
● Compare (e.g., by calculating Euclidean distance)
feature descriptors of a query image with
descriptors of images in collection to search
● Images with ‘closest’ descriptors are similar to a
query image
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
11. Intro to Image Search
Why MapReduce?
● Direct comparisons of descriptors costly even for
very small collections
● Lots of approaches to ‘organize’ feature
descriptors for fast search
○ Build an index
○ Index all the descriptors
○ At search, check query descriptors only against
certain groups of descriptors
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
12. Image Retrieval with MapReduce
Why MapReduce?
● Poorly scalable
○ up to ~10-20 mln images
● But multimedia grows exponentially
● Scaling is required …
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
13. Image Retrieval with MapReduce
Use case:
● Copyright violation detection in large image
databank
○ >100mln images
● Searching for batch of images
○ Thousands of images in one query
○ Focus on throughput, not on response time for
individual image
● SIFT features
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
14. Image Retrieval with MapReduce
Indexing images
● Generating index tree
● Clustering images into a large set of clusters
(max cluster size = 5000)
○ Mapper input:
■ unsorted SIFT descriptors
■ index tree (loaded by every mapper)
○ Mapper output:
■ (cluster_id, SIFT)
○ Reducer output:
■ SIFTs sorted by cluster_id
Denis Shestakov
denshe at gmail.com
MapReduce
linkedin: linkedin.com/in/dshestakov
15. Image Retrieval with MapReduce
Searching
● Generating lookup table
○ indexing query SIFTs
MapReduce
● Finding best matches for query SIFTs
○ Mapper input:
■ sorted SIFT descriptors
■ lookup table (loaded by every mapper)
○ Mapper output:
■ (query-sift-id, knn of image-ids)
○ Reducer output:
MapReduce
■ Best votes (image-ids) for query-image-id
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
16. Image Retrieval with MapReduce
In nutshell:
● Indexing phase
○ Clustering SIFTs with one-pass k-means
● Searching phase
○ Map-side join of clustered SIFTs and lookup table
(query SIFTs)
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
17. Image search workloads
Time to discuss Hadoop specifics:
● Standard Apache Hadoop distribution, ver.1.0.1
○ (!) No changes in Hadoop internals
■ Easy to migrate
● Around 100 nodes from Grid5000
○ 8/24 cores, 24/32/48GB RAM per node
○ capacity/performance varied
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
18. Image search workloads
Dataset:
● 110 mln images (~30 billion SIFT descriptors)
○ ~30 billion SIFT descriptors
○ 4TB
○ Largest reported in literature
○ Images resized to 150px on largest side
○ Worked also with subset, 1TB
○ Used as distracting dataset
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
19. Image search workloads
Queries:
● Query batches
○ Up to 250k query images in one batch
○ Batch includes original images and their distorted
variants
■ Some variants are very hard to find
● e.g., print-crumple-scan
● Check if original images returned as top votes
○ (out of scope) state-of-the-art search quality
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
20. Image search workloads
Indexing workload characteristics
● computationally-intensive (map phase)
● data-intensive (at map&reduce phases)
● large auxiliary data structure (i.e., index tree)
○ grows as dataset grows
○ e.g., 1.8GB for 110M images (4TB)
● map input < map output
● network is heavily utilized during shuffling
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
24. Hadoop tools for large joins
● Some workloads require all mappers to load a
large-size data structure
○ Like image indexing/searching workloads
● Spreading data file across all nodes
○ Hadoop DistributedCache
● Not efficient if structure is of gigabytes-size
○ Partial solution: increase HDFS block sizes →
decrease #mappers
● Another approach: multithreaded mappers
○ Not well documented
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
25. Hadoop tools for large joins
● Multithreaded mapper spans a configured number
of threads, each thread executes a map task
● Mapper threads share the RAM
● Downsides:
○ synchronization when reading input
○ synchronization when writing output
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
26. Hadoop tools for large joins
Indexing 4T with 4 mappers slots, each running
two threads
● index tree size: 1.8GB
Indexing time on 100 nodes
● 8h27min → 6h8min
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
27. Hadoop tools for large joins
● In some workloads mappers require only a part
of auxiliary data structure
○ I.e., relevant to data block processed
○ Image searching workflow
● Approach: Hadoop MapFile
○ Very efficient
■ Big batches, >10000 query images
■ ~2 times faster on batches including around
25000 images
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
28. Smart Hadoop configuration
Here is the problem:
● Apache Hadoop, v.1.0.1
● Capacity/performance of nodes varied
○ 8/24 cores, 24-48GB RAM, etc
● One config file (#mappers, #reducers, maxim.
map/reduce memory, ...) for all nodes
● Issue for memory-intensive workloads!
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
29. Smart Hadoop configuration
Solution (hack):
● deploy Hadoop on all nodes with settings addressing
the least equipped nodes
● create sub-cluster configuration files adjusted to better
equipped nodes
○ substitute original config file with the new one on better
equipped nodes
● restart tasktrackers with new configuration files on
better equipped nodes
Call it smart deployment
● Or known under another name? Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
30. Smart Hadoop configuration
Denis Shestakov
denshe at gmail.com
Indexing 1T on 106 nodes: 75min → 65min
linkedin: linkedin.com/in/dshestakov
31. Conclusions
● Several directions for further optimization
● Presented techniques applicable to video and
audio datasets
○ Given a transformation into feature vectors
○ Only small changes expected (e.g, new Writable)
● Hadoop smart deployment trick
● (Wanted) Best practices for Hadoop job
history log analysis
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
32. Supporting publications
Things to share
Hadoop job history logs available on request:
● Describe indexing/searching 4TB dataset
● Insights on better analysis/visualization are welcome
● Get cbmi13 example-set at http://goo.gl/e06wE
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
33. Supporting publications
Supporting Materials
Check full-texts of our publications:
● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and
searching 100M images with Map-Reduce. In Proc. ACM ICMR'13,
2013.
● D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional
● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale
image similarity search: experience and best practice. In Proc. IEEE
BigData'13, 2013.
Denis Shestakov
denshe at gmail.com
indexing with Hadoop. In Proc. CBMI'13, 2013.
linkedin: linkedin.com/in/dshestakov
34. Acknowledgements
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov
● My colleagues at INRIA
Rennes
● Aalto University
● Grid5000 infrastructure