This document summarizes research on mobile augmented reality from the Stanford-Nokia Collaboration. It describes a landmark recognition system using "bag of words" feature matching. It explores approaches for feature compression, including CHoG (Compressed Histogram of Gradients) descriptors. It discusses using vocabulary trees and forests for large databases and improving accuracy. It also looks at multi-view matching, 3D modeling from images, and streaming augmented reality while minimizing latency. Future research directions include improved features, matching algorithms, and 3D modeling to enable large-scale urban landmark recognition.
2. Mobile Augmented Reality Team Radek Grzeszczuk Bernd Girod Vijay Chandrasekhar Gabriel Takacs Wei-Chao Chen Natasha Gelfand Yingen Xiong Kari Pulli Sam Tsai David Chen Jana Kosecka Ramakrishna Vedantham Mina Makar
3. Outline Review: landmark recognition system Architecture: location-based pre-fetching and matching on the phone Computer vision: “Bag of Words” matching Feature compression for server-side matching Approaches explored: Transform coding of features, patch compression Compressible descriptor: CHoG (Compressed Histogram of Gradients) Scalability for large data bases From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests” Accuracy vs. data base size Towards 3D Multi-viewvocabulary trees Matching against 3-d models Summary and future directions
4. Outline Review: landmark recognition system Architecture: location-based pre-fetching and matching on the phone Computer vision: “Bag of Words” matching Feature compression for server-side matching Approaches explored: Transform coding of features, patch compression Compressible descriptor: CHoG (Compressed Histogram of Gradients) Scalability for large data bases From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests” Accuracy vs. data base size Towards 3D Multi-viewvocabulary trees Matching against 3-d models Summary and future directions
11. Timing Analysis(Q2 2008) Nokia N95 332 MHz ARM 64 MB RAM 100 KByte JPEG; uplink 60 Kbps Downloads Upload Upload Geometric Consistency Extract Features Extract Features Feature Matching Extract Features on Phone All on Phone All on Server
12. Outline Review: landmark recognition system Architecture: location-based pre-fetching and matching on the phone Computer vision: “Bag of Words” matching Feature compression for server-side matching Approaches explored: Transform coding of features, patch compression Compressible descriptor: CHoG (Compressed Histogram of Gradients) Scalability for large data bases From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests” Accuracy vs. data base size Towards 3D Multi-viewvocabulary trees Matching against 3-d models Summary and future directions
13. Advanced Feature Compression Transform Coding of SIFT/SURF descriptors[Chandrasekhar et al., VCIP 09] Direct compression of oriented image patch [M. Makar et al., ICASSP 09] Descriptor designed for compressibility: CHoG[Chandrasekhar et al., CVPR 09] Tree-Structured Vector QuantizationTree Histogram Coding [Chen et al., DCC 09] Compression of Location Information[Tsai et al., Mobimedia 09]
14. Patch CHoG: Compressed Histogram of Gradients Gradient distributions for each bin Gradients dx dx dx dx dx dx dx dx dy dy dy dy dy dy dy dy Spatial binning 01101 101101 Histogram compression 0100011 111001 0010011 01100 1010100 CHoGDescriptor
22. Outline Review: landmark recognition system Architecture: location-based pre-fetching and matching on the phone Computer vision: “Bag of Words” matching Feature compression for server-side matching Approaches explored: Transform coding of features, patch compression Compressible descriptor: CHoG (Compressed Histogram of Gradients) Scalability for large data bases From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests” Accuracy vs. data base size Towards 3D Multi-viewvocabulary trees Matching against 3-d models Summary and future directions
32. Real-time System: Send Image Image Wireless Network Information Server VocTreeImage Matching Feature Extraction Camera Client
33. Features Wireless Network Information Server VocTree Image Matching FeatureExtraction Camera Client Coding Real-time System: Send Features
34. Timing Analysis Nokia N95 332 MHz ARM 64 MB RAM Server Delay Execution Time (sec) Upload Image 40 kByte Server Delay Upload Features 2.2 kByte Extract Features “Send Features” “Send Image”
35. Timing Analysis Nokia N95 332 MHz ARM 64 MB RAM Execution Time (sec) Server Delay Upload Image 40 kByte Server Delay Upload 2.2 kByte Extract Features “Send Features” “Send Image”
36. Timing Analysis Nokia N95 332 MHz ARM 64 MB RAM Execution Time (sec) Server Delay Server Delay Extract Features “Send Features” “Send Image”
37. Streaming MAR Server Extract Features Search K-D Tree Check Geometry Send Query Frame Send ID and Geometry Network Low Motion John Mayer Inside Wants Out Display ID and Draw Boundary CompensateCamera Pose Time High Motion Client TrackCamera Pose …
38. Outline Review: landmark recognition system Architecture: location-based pre-fetching and matching on the phone Computer vision: “Bag of Words” matching Feature compression for server-side matching Approaches explored: Transform coding of features, patch compression Compressible descriptor: CHoG (Compressed Histogram of Gradients) Scalability for large data bases From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests” Accuracy vs. data base size Towards 3D Multi-view vocabulary trees City-scale landmark recognition using view invariant matching Summary and future directions
39. Multiview Database Front View Images Top View Images Bottom View Images Right View Images Left View Images
40. Multiview Vocabulary Trees Left Front Top Bottom Right Query Image Select Top Matches Select Top Matches Select Top Matches Select Top Matches Select Top Matches Geometric Consistency Check Top Match
41. Multiview Matching Performance Front SVT Multiview SVTs Image Recall Match Rate Query View Query View Top Right Bottom Right Front Left Top Bottom Front Left
42. Compact Architectural Models from Geo-Registered Image Collections GPS-tagged Images Building Outline Camera Poses Estimation Robust Map Alignment Efficient View Selection 3D Model of Landmark Unstructured Image Collections: Panoramio Structured Image Collections: Street View data (Navteq) [Grzeszczuk, 3DIM 2009]
43. View-Invariant Matching Pipeline Feature Store Feature Extraction Image Database Rectified Database Images Image Rectification using 3D Model Feature Extraction Matching Results Oblique Query Image Rectified Query Image Image Rectification using Vanishing Points
44. Outline Review: landmark recognition system Architecture: location-based pre-fetching and matching on the phone Computer vision: “Bag of Words” matching Feature compression for server-side matching Approaches explored: Transform coding of features, patch compression Compressible descriptor: CHoG (Compressed Histogram of Gradients) Scalability for large data bases From “Bags of Words” to “Vocabulary Trees” to “Vocabulary Forests” Accuracy vs. data base size Towards 3D Multi-viewvocabulary trees Matching against 3-d models Summary and future directions
45. Research Directions Research area: image features Keypoint detection optimized for CHoG, prioritization Comprehensive performance analysis of compressed feature matching Next generation CHoG: soft kernels vs. hard binning, embedded, refinablebitstream Beyond RANSAC: advanced geometry matching and coding, incorporate scale and orientation Research area: image database/vocabulary trees Optimum tree/forest growing, CHoG trees, incremental data base update Fast query, early termination, distance metrics, scoring, nearest neighbor algorithms Trees for phone implementation, inverted file caching, tree histogram coding Research area: streaming mobile augmented reality Camera pose estimation, feature tracking, temporally coherent feature extraction Continuous recognition strategies, scheduling, latency minimization Superposition of graphics information, motion compensation, occlusion handling Research area: 3D modeling Image matching pipeline using 3D models Automatic image rectification, features from texture maps Methods for integrating heterogeneous image sources Demonstrate improved landmark recognition for large-scale urban scene Collaboration with Marc Pollefeys, ETH Zurich
Notes de l'éditeur
Only a limited number of different Huffman trees.Catalan number yields number of rooted binary trees (ordered leaves, no cross-overs)Count unique permutations
Winder, Brown (Microsoft Resarch), “Learning Local Image Descriptors,” 64x64 patches. touristphotographs of the Trevi Fountain and of Yosemite Valley (920 images), and a test set consisting of images ofNotre Dame (500 images). BoostSSC –Boosting Similarity Sensitive CodingG. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter sensitive hashing. In Proc. ICCV, 2003.Torralba et al., Small Codes and Large Image Databases for Recognition, CVPR2009.Random Projections - P. A. ChuohaoYeo and K. Ramchandran, “Rate-EfficientVisual Correspondences Using Random Projections,” 2008.
Most retrieval application require NN search in some formThe descriptors for both SIFT and CHoG were computed from the sameset of patches. VQ-5 bin configuration, GLOH-9 cell configurationsand Huffman Tree Coding are used for CHoG, resulting in a45 dimensional descriptor. We observe that exact nearest neighborsearching is 10X faster for CHoG. Furthermore, CHoG is still 2Xfaster than using SIFT with ANN eps = 1, which incurs a small errorrate of 0.30%. The speed up results from the lower dimensionalityof the CHoG descriptor, and the use of look up tables for fastdistance computation.
The scalable vocabulary tree is the data structure at the center of our recognition system. To construct an SVT, first we take every database CD cover and extract robust local features. These features can be SIFT, SURF, or your own favorite type. Then, all the feature descriptors from all the images are represented as vectors in a high-dimensional space. Here, they are shown as 2-dimensional vectors, but in reality, they can be 64-dimensional or 128-dimensional vectors.
The scalable vocabulary tree is the data structure at the center of our recognition system. To construct an SVT, first we take every database CD cover and extract robust local features. These features can be SIFT, SURF, or your own favorite type. Then, all the feature descriptors from all the images are represented as vectors in a high-dimensional space. Here, they are shown as 2-dimensional vectors, but in reality, they can be 64-dimensional or 128-dimensional vectors.
The scalable vocabulary tree is the data structure at the center of our recognition system. To construct an SVT, first we take every database CD cover and extract robust local features. These features can be SIFT, SURF, or your own favorite type. Then, all the feature descriptors from all the images are represented as vectors in a high-dimensional space. Here, they are shown as 2-dimensional vectors, but in reality, they can be 64-dimensional or 128-dimensional vectors.
To impose some structure on this space, we perform hierarchical k-means clustering, the first step of which is dividing the space into k clusters using regular k-means.
And then again, recursively splitting each large cluster into k smaller clusters. We repeat this process until the clusters become sufficiently small.What results from the hierarchical k-means algorithm is a tree structure, where tree nodes are the cluster centroids and their children are the subcluster centroids.
Here is the same tree as on the previous slide, except the tree structure is more apparent. Once we have constructed an SVT on a server, how to process an incoming query is straightforward. For every query descriptor, we classify it by traversing the SVT greedily from top to bottom. Suppose the first descriptor follows this nearest neighbor path. The SVT knows which database images have features associated with every node, so it votes for the two images found on this path. Both the blue nodes and green nodes vote, but since the blue nodes are more discriminative, their vote counts for more. Then, another query descriptor goes down a different path and votes for other images. And so on, until all the query descriptors are classified. The final vote tally is a histogram indicating how likely each database image is a match.We notice that when both the query and database images are fronto-parallel, the voting scheme works well and will select the correct database match. This is because similar features are extracted from the query image and the matching database image, leading to their descriptors visiting many of the same nodes in the SVT.
Performance drops with single tree, since nodes become less discriminative – fewer features are unique to a particular database image
Feature extraction is robust against rotation and scale change. NOT robust against foreshortening.Overcome by putting multiple examples into data base that show object from different angles.
One could put all these views into one vocabulary tree.Distributing views across parallel trees prevent competition among the among the features belonging to different views of the same object. Views compete only, once all the features are considered. Select the 25 top matches for each SVT based on bin count similarity, then find match with best geometric consistency.The multiview SVT approach is attractive for multi-core server, the search process through the different trees can be run in parallel