Terabyte-scale image similarity search with Hadoop

Terabyte-scale image similarity
search with Hadoop
Denis Shestakov
Hadoop Summit Europe 2014, Amsterdam, Netherlands, 02.04.2014

About me
● Big Data researcher/engineer
○ recent projects: large-scale image retrieval
○ before: web crawling
● Hadoop/MapReduce contractor
○ design/development/tuning Hadoop applications
Denis Shestakov
denshe at gmail.com
linkedin: linkedin.com/in/dshestakov

Talk Outline
● Intro to image search
● Image retrieval with MapReduce
● Image indexing/searching workloads
● Hadoop tools for large joins
● Smart Hadoop configuration
● Misc & conclusions
Denis Shestakov
denshe at gmail.com

Intro to Image Search
● Finding images given a text
○ dog →
Denis Shestakov
denshe at gmail.com

● Finding images given an image
○ By content-similarity
Denis Shestakov
denshe at gmail.com

Image Search Applications
● Regular image search
○ Google Images, Bing Images, TinEye, etc
● Product search (by image)
● Object recognition
○ Face, logo, vehicle, etc.
● Computer vision
● Augmented reality
● Medical imaging
● Astrophysics
Denis Shestakov
denshe at gmail.com

Denis Shestakov
denshe at gmail.com

How does it work?
● Images resized to smaller size
● Then transformed to chosen feature description
representation
○ image → set of feature descriptors (=high-dimensional
vectors)
○ Many transformations exist
■ SIFT (Scale-invariant feature transform) used by us
Denis Shestakov
denshe at gmail.com

How does it work?
image_id SIFT descriptor
10011 21, 143, 5, …, 201, 186
10011 121, 14, 75, …, 20, 109
10011 37, 40, 0, …, 213, 96
... ...
10011 81, 235, 67, …, 102,63
Typical: several hundreds of feature descriptors
per image
Denis Shestakov
denshe at gmail.com

How does it work?
● Compare (e.g., by calculating Euclidean distance)
feature descriptors of a query image with
descriptors of images in collection to search
● Images with ‘closest’ descriptors are similar to a
query image
Denis Shestakov
denshe at gmail.com

Why MapReduce?
● Direct comparisons of descriptors costly even for
very small collections
● Lots of approaches to ‘organize’ feature
descriptors for fast search
○ Build an index
○ Index all the descriptors
○ At search, check query descriptors only against
certain groups of descriptors
Denis Shestakov
denshe at gmail.com

Image Retrieval with MapReduce
Why MapReduce?
● Poorly scalable
○ up to ~10-20 mln images
● But multimedia grows exponentially
● Scaling is required …
Denis Shestakov
denshe at gmail.com

Use case:
● Copyright violation detection in large image
databank
○ >100mln images
● Searching for batch of images
○ Thousands of images in one query
○ Focus on throughput, not on response time for
individual image
● SIFT features
Denis Shestakov
denshe at gmail.com

Indexing images
● Generating index tree
● Clustering images into a large set of clusters
(max cluster size = 5000)
○ Mapper input:
■ unsorted SIFT descriptors
■ index tree (loaded by every mapper)
○ Mapper output:
■ (cluster_id, SIFT)
○ Reducer output:
■ SIFTs sorted by cluster_id
Denis Shestakov
denshe at gmail.com
MapReduce

Searching
● Generating lookup table
○ indexing query SIFTs
MapReduce
● Finding best matches for query SIFTs
○ Mapper input:
■ sorted SIFT descriptors
■ lookup table (loaded by every mapper)
○ Mapper output:
■ (query-sift-id, knn of image-ids)
○ Reducer output:
MapReduce
■ Best votes (image-ids) for query-image-id
Denis Shestakov
denshe at gmail.com

In nutshell:
● Indexing phase
○ Clustering SIFTs with one-pass k-means
● Searching phase
○ Map-side join of clustered SIFTs and lookup table
(query SIFTs)
Denis Shestakov
denshe at gmail.com

Image search workloads
Time to discuss Hadoop specifics:
● Standard Apache Hadoop distribution, ver.1.0.1
○ (!) No changes in Hadoop internals
■ Easy to migrate
● Around 100 nodes from Grid5000
○ 8/24 cores, 24/32/48GB RAM per node
○ capacity/performance varied
Denis Shestakov
denshe at gmail.com

Dataset:
● 110 mln images (~30 billion SIFT descriptors)
○ ~30 billion SIFT descriptors
○ 4TB
○ Largest reported in literature
○ Images resized to 150px on largest side
○ Worked also with subset, 1TB
○ Used as distracting dataset
Denis Shestakov
denshe at gmail.com

Queries:
● Query batches
○ Up to 250k query images in one batch
○ Batch includes original images and their distorted
variants
■ Some variants are very hard to find
● e.g., print-crumple-scan
● Check if original images returned as top votes
○ (out of scope) state-of-the-art search quality
Denis Shestakov
denshe at gmail.com

Indexing workload characteristics
● computationally-intensive (map phase)
● data-intensive (at map&reduce phases)
● large auxiliary data structure (i.e., index tree)
○ grows as dataset grows
○ e.g., 1.8GB for 110M images (4TB)
● map input < map output
● network is heavily utilized during shuffling
Denis Shestakov
denshe at gmail.com

Indexing workload
Denis Shestakov
denshe at gmail.com

Searching workload
● large aux.data structure (e.g., lookup table)
Denis Shestakov
denshe at gmail.com

Denis Shestakov
denshe at gmail.com
● Basic settings:
○ 512MB HDFS
block size
○ 3 replicas
○ 8 map slots
○ 2 reduce slots
● 4TB dataset:
○ 4 map slots

Hadoop tools for large joins
● Some workloads require all mappers to load a
large-size data structure
○ Like image indexing/searching workloads
● Spreading data file across all nodes
○ Hadoop DistributedCache
● Not efficient if structure is of gigabytes-size
○ Partial solution: increase HDFS block sizes →
decrease #mappers
● Another approach: multithreaded mappers
○ Not well documented
Denis Shestakov
denshe at gmail.com

● Multithreaded mapper spans a configured number
of threads, each thread executes a map task
● Mapper threads share the RAM
● Downsides:
○ synchronization when reading input
○ synchronization when writing output
Denis Shestakov
denshe at gmail.com

Indexing 4T with 4 mappers slots, each running
two threads
● index tree size: 1.8GB
Indexing time on 100 nodes
● 8h27min → 6h8min
Denis Shestakov
denshe at gmail.com

● In some workloads mappers require only a part
of auxiliary data structure
○ I.e., relevant to data block processed
○ Image searching workflow
● Approach: Hadoop MapFile
○ Very efficient
■ Big batches, >10000 query images
■ ~2 times faster on batches including around
25000 images
Denis Shestakov
denshe at gmail.com

Smart Hadoop configuration
Here is the problem:
● Apache Hadoop, v.1.0.1
● Capacity/performance of nodes varied
○ 8/24 cores, 24-48GB RAM, etc
● One config file (#mappers, #reducers, maxim.
map/reduce memory, ...) for all nodes
● Issue for memory-intensive workloads!
Denis Shestakov
denshe at gmail.com

Solution (hack):
● deploy Hadoop on all nodes with settings addressing
the least equipped nodes
● create sub-cluster configuration files adjusted to better
equipped nodes
○ substitute original config file with the new one on better
equipped nodes
● restart tasktrackers with new configuration files on
better equipped nodes
Call it smart deployment
● Or known under another name? Denis Shestakov
denshe at gmail.com

Denis Shestakov
denshe at gmail.com
Indexing 1T on 106 nodes: 75min → 65min

Conclusions
● Several directions for further optimization
● Presented techniques applicable to video and
audio datasets
○ Given a transformation into feature vectors
○ Only small changes expected (e.g, new Writable)
● Hadoop smart deployment trick
● (Wanted) Best practices for Hadoop job
history log analysis
Denis Shestakov
denshe at gmail.com

Supporting publications
Things to share
Hadoop job history logs available on request:
● Describe indexing/searching 4TB dataset
● Insights on better analysis/visualization are welcome
● Get cbmi13 example-set at http://goo.gl/e06wE
Denis Shestakov
denshe at gmail.com

Supporting publications
Supporting Materials
Check full-texts of our publications:
● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and
searching 100M images with Map-Reduce. In Proc. ACM ICMR'13,
2013.
● D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional
● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale
image similarity search: experience and best practice. In Proc. IEEE
BigData'13, 2013.
Denis Shestakov
denshe at gmail.com
indexing with Hadoop. In Proc. CBMI'13, 2013.

Acknowledgements
Denis Shestakov
denshe at gmail.com
● My colleagues at INRIA
Rennes
● Aalto University
● Grid5000 infrastructure

That’s it!
Denis Shestakov
denshe at gmail.com
Thanks!

Terabyte-scale image similarity search with Hadoop

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Dernier

Dernier (20)

Terabyte-scale image similarity search with Hadoop