Publicité
Publicité

Contenu connexe

Publicité

Dernier(20)

Publicité

3-d interpretation from single 2-d image III

  1. 3D Interpretation from Single 2D Image for Autonomous Driving III Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  2. Outline • Towards Generalization Across Depth for Monocular 3D Object Detection • RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving • Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation • Exploring the Capabilities and Limits of 3D Monocular Object Detection - A Study on Simulation and Real World Data • Object-Aware Centroid Voting for Monocular 3D Object Detection • Monocular 3D Detection with Geometric Constraints Embedding and Semi- supervised Training • Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels
  3. Towards Generalization Across Depth for Monocular 3D Object Detection • This work advances the state of the art by introducing MoVi-3D, a single- stage deep architecture for monocular 3D object detection. • MoVi-3D builds upon an approach which leverages geometrical information to generate, both at training and test time, virtual views where the object appearance is normalized with respect to distance. • These virtually generated views facilitate the detection task as they significantly reduce the visual appearance variability associated to objects placed at different distances from the camera. • As a consequence, the deep model is relieved from learning depth-specific representations and its complexity can be significantly reduced. • In particular, in this work thanks to virtual views generation process, a lightweight, single-stage architecture suffices to set new state-of-the-art results on the popular KITTI3D benchmark.
  4. Towards Generalization Across Depth for Monocular 3D Object Detection Aim at predicting a 3D bounding box for each object given a single image (left). In this image, the scale of an object heavily depends on its distance with respect to the camera. For this reason the complexity of the detection increases as the distance grows. Instead of performing the detection on the original image, perform it on virtual images (middle). Each virtual image presents a cropped and and scaled version of the original image that preserves the scale of objects as if the image was taken at a different, given depth.
  5. Towards Generalization Across Depth for Monocular 3D Object Detection Illustration of the Monocular 3D Object Detection task. Given an input image (left), the model predicts a 3D box for each object (middle). Each box has its 3D dimensions s = (W;H;L), 3D center c = (x; y; z) and rotation (alpha).
  6. Towards Generalization Across Depth for Monocular 3D Object Detection • The goal is to devise a training/inference procedure that enables generalization across depth, by indirectly forcing the models to develop representations for objects that are less dependent on their actual depth in the scene. • The idea is to feed the model with transformed images that have been put into a canonical form that depends on some query depth. • After this transformation, no matter where the car is in space, obtain an image of the car that is consistent in terms of the scale of the object. • Clearly, depth still influences the appearance, e.g. due to perspective deformations, but by removing the scale factor from the nuisance variables,able to simplify the task that has to be solved by the model. • In order to apply the proposed transformation,need to know the location of the 3D objects in advance.
  7. Towards Generalization Across Depth for Monocular 3D Object Detection 3D viewport Compute the top-left and bottom-right corners of the viewport, namely (Xv,Yv,Zv) and (Xv + Wv,Yv – Hv,Zv) respectively, and project them to the image plane of the camera, yielding the top-left and bottom-right corners of a 2D viewport. Crop it and rescale it to the desired resolution wv x hv to get the final output. It is a virtual image generated by the given 3D viewport.
  8. Towards Generalization Across Depth for Monocular 3D Object Detection • The goal of the training procedure is to build a network that is able to make correct predictions within a limited depth range given an image generated from a 3D viewport. • A ground-truth-guided sampling procedure:repeatedly draw (without replacement) a ground-truth object and then sample a 3D viewport in a neighborhood thereof so that the object is completely visible in the virtual image. • The location of the 3D viewport is perturbed with respect to the position of the target ground-truth object in order to obtain a model that is robust to depth ranges up to the predefined depth resolution Zres, which in turn plays an important role at inference time. • In addition, let a small share of the virtual images to be generated by 3D viewports randomly positioned in a way that the corresponding virtual image is completely contained in the original image. • A class-uniform sampling strategy:allows to get an even number of virtual images for each class that is present in the original image.
  9. Towards Generalization Across Depth for Monocular 3D Object Detection Training virtual image creation. We randomly sample a target object (dark-red car). Given the input image, object position and camera parameters, compute a 3D viewport that we place at z = Zv. Then project the 3D viewport onto the image plane, resulting in a 2D viewport. Finally crop the corresponding region and rescale it to obtain the target virtual view (right).
  10. Towards Generalization Across Depth for Monocular 3D Object Detection • Since have trained the network to be able to predict at distances that are twice the depth step, reasonably confident not missing objects, in the sense that each object will be covered by at least a virtual image. • Also, due to the convolutional nature of the architecture adjust the width of the virtual image in a way to cover the entire extent of the input image. • By doing so have virtual images that become wider as increasing the depth, following the rule (W is the width of the input image): • Finally perform NMS over detections that have been generated from the same virtual image.
  11. Towards Generalization Across Depth for Monocular 3D Object Detection Inference pipeline. Given the input image, camera parameters and Zres,create a series of 3D viewports placing every Zres/2 meters along the Z axis. Then project these viewports onto the image, crop and rescale the resulting regions to obtain distance-specific virtual views. Finally use these views to perform the 3D detection.
  12. Towards Generalization Across Depth for Monocular 3D Object Detection It consists of two parallel branches, the top one devoted to providing confidences about the predicted 2D and 3D bounding boxes, while the bottom one is devoted to regressing the actual bounding boxes. White rectangles denote 33 convolutions with 128 output channels followed by iABNsync.
  13. Towards Generalization Across Depth for Monocular 3D Object Detection
  14. Towards Generalization Across Depth for Monocular 3D Object Detection
  15. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving • It proposes an efficient and accurate monocular 3D detection framework in single shot. • This method predicts the nine perspective keypoints of a 3D bounding box in image space, and then utilize the geometric relationship of 3D and 2D perspectives to recover the dimension, location, and orientation in 3D space. • In this method, the properties of the object can be predicted stably even when the estimation of keypoints is very noisy, which enables us to obtain fast detection speed with a small architecture. • Training uses the 3D properties of the object without the need for external networks or supervision data. • This method is the first real-time system for monocular image 3D detection while achieves state-of the-art performance on the KITTI benchmark. • Code will be released at https://github.com/Banconxuan/RTM3D.
  16. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving Overview of proposed method: first predict ordinal keypoints projected in the image space by eight vertexes and a central point of a 3D object. then reformulate the estimation of the 3D bounding box as the problem of minimizing the energy function by using geometric constraints of perspective projection.
  17. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving An overview of proposed keypoint detection architecture: It takes only the RGB images as the input and outputs main center heatmap, vertexes heatmap, and vertexes coordinate as the base module to estimate 3D bounding box. It can also predict other alternative priors to further improve the performance of 3D detection.
  18. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving Illustration of keypoint feature pyramid network (KFPN).
  19. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving
  20. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving
  21. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation • Since the location recovery in 3D space is quite difficult on account of absence of depth information, this work proposes a unified framework which decomposes the detection problem into a structured polygon prediction task and a depth recovery task. • Different from the widely studied 2D bounding boxes, the proposed structured polygon in the 2D image consists of several projected surfaces of the target object as better representation for 3D detection. • In order to inversely project the predicted 2D structured polygon to a cuboid in the 3D physical world, the following depth recovery task uses the object height prior to complete the inverse projection transformation with the given camera projection matrix. • Moreover, a fine-grained 3D box refinement scheme is proposed to further rectify the 3D detection results.
  22. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation The overall framework (Decoupled-3D) decouples the monocular 3D object detection problem into sub-tasks. The overall network consists of three parts. (Top row) The 2D structured polygons are generated with a stacked hourglass network. (Middle row) Object depth stage utilizes 3D object height as a prior to recover the missing depth of the object. (Bottom row) 3D box refine stage rectifies coarse 3D boxes using bird’s eye view features in 3D-ROIs.
  23. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation Structured polygon estimation aims to estimate the 2D locations of the projected vertices
  24. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation Height-Guided Depth Estimation. Combine object height H and corresponding pixel value h to estimate object depth
  25. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation 3D Box Refinement. Rectify coarse boxes with bird’s eye view map Note: Depth Net DOR(“Deep Ordinal Regression Network for Monocular Depth Estimation”)
  26. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation
  27. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation
  28. Exploring the Capabilities and Limits of 3D Monocular Object Detection - A Study on Simulation and Real World Data • Recent deep learning methods show promising results to recover depth information from single images by learning priors about the environment. • In addition to the network design, the major difference of these competing approaches lies in using a supervised or self-supervised optimization loss function, which require different data and ground truth information. • This paper evaluate the performance of a 3D object detection pipeline which is parameterizable with different depth estimation configurations. • It implement a simple distance calculation approach based on camera intrinsics and 2D bounding box size, a self-supervised, and a supervised learning approach for depth estimation. • It evaluate the detection pipeline on simulator data and a real world sequence from an autonomous vehicle on a race track. • Advantages and drawbacks of the different depth estimation strategies are discussed
  29. Exploring the Capabilities and Limits of 3D Monocular Object Detection - A Study on Simulation and Real World Data 3D object detection pipeline with three alternative configurations
  30. Exploring the Capabilities and Limits of 3D Monocular Object Detection - A Study on Simulation and Real World Data • Distance calculation using the 2D bounding box height, and the known height of the real world race car as a geometric constraint. “known height assumption” • Depth estimation for the whole image using the supervised DenseDepth network. The distance to each object is calculated as the median depth value in the bounding box crop. Explicit knowledge about the objects, like height information, is not required in this approach. • Depth estimation for the whole image using the self-supervised struct2depth network. The distance to each object is calculated as the median depth value in the bounding box crop. Explicit knowledge about the objects, like height information, is not required in this approach.
  31. Exploring the Capabilities and Limits of 3D Monocular Object Detection - A Study on Simulation and Real World Data
  32. Object-Aware Centroid Voting for Monocular 3D Object Detection • This paper propose an end-to-end trainable monocular 3D object detector without learning the dense depth. • Specifically, the grid coordinates of a 2D box are first projected back to 3D space with the pinhole model as 3D centroids proposals. • Then, a object-aware voting approach is introduced, which considers both the region-wise appearance attention and the geometric projection distribution, to vote the 3D centroid proposals for 3D object localization. • With the late fusion and the predicted 3D orientation and dimension, the 3D bounding boxes of objects can be detected from a single RGB image. • The method is straightforward yet significantly superior to other monocular-based methods.
  33. Object-Aware Centroid Voting for Monocular 3D Object Detection 3D Object Detection Pipeline. Given an image with predicted 2D region proposals (yellow box), the regions are divided into grids. Each grid point with (u;v) coordinate is projected back to 3D space by leveraging the pinhole model and the class-specific 3D height H, resulting in 3D box centroid proposals. With the voting method inspired by both appearance and geometric cues, 3D object location is predicted.
  34. Object-Aware Centroid Voting for Monocular 3D Object Detection The Architecture. 2D region proposals are first obtained from the RPN module. Then, with the 3D Center Reasoning (left), multiple 3D centroid proposals are estimated from the 2D RoI grid coordinates. Followed by the Object-Aware Voting (right), which consists of geometric projection distribution (GPD) and appearance attention map (AAM), the 3D centroid proposals are voted for 3D localization. For the 3D dimension and orientation, they are estimated together with 2D object detection head.
  35. Object-Aware Centroid Voting for Monocular 3D Object Detection • For the objects on driving road, they are horizontally placed without the pose angles of yaw and pitch with respect to the camera. • Besides, the 3D dimension variance of each class of objects (such as Car) is quite small. • These constraints lead to the idea that the apparent heights of objects on image are approximately invariant when objects are in the same depth. • Recent survey also points out that the positions and apparent size of object in an image are applicable to infer the depth on KITTI dataset. • Therefore, the 3D object centroid can be roughly inferred with the simple pinhole camera model.
  36. Object-Aware Centroid Voting for Monocular 3D Object Detection • Specifically, divide each 2D region proposals into s X s grid cells and project the grid coordinates back onto 3D space. • Since each grid point indicates the probable projection of the corresponding 3D object centroid, get multiple 3D centroid proposals P3d where the i-th centroid proposal P3d(Xi;Yi;Z) is computed by Examples and statistics on KITTI training set.
  37. Object-Aware Centroid Voting for Monocular 3D Object Detection • Specifically, use a single 1X1 convolution followed by sigmoid activation to generate appearance attention map from the feature maps of RoI pooling layer. • The activated convolution feature map from the image indicates the foreground semantic objects due to the classification supervision in 2D object detection, leading to the object- ware voting. • This voting component comes from the distribution of the offset between the projected 3D centroid and the 2D box center. • It is demonstrated the 2D box center can be modeled as Gaussian distribution with ground truth as expectation • To dynamically learn the distribution, the 2D grid coordinates and image features of RoI are concatenated together as input of a fully-connected layer to predict the offset with Kullback Leibler (KL) divergence as loss function to supervise the learning
  38. Object-Aware Centroid Voting for Monocular 3D Object Detection The object-aware voting can be formulated as the element-wise multiplication with both the normalized probability maps Mapp and Mgeo as follows In the training stage, the 3D localization pipeline is trained with smooth L1 loss
  39. Object-Aware Centroid Voting for Monocular 3D Object Detection 3D dimension prediction loss function comparing predictions and the ground truth are defined in the logarithm space through the smooth L1 loss In 3D orientation estimation, use Multi-Bin to disentangle it into residual angle prediction and angle bins classification. 3D orientation estimation loss is formed as Loss functions for the joined training of multi-tasks of 2D and 3D object detection as
  40. Object-Aware Centroid Voting for Monocular 3D Object Detection
  41. Object-Aware Centroid Voting for Monocular 3D Object Detection
  42. Object-Aware Centroid Voting for Monocular 3D Object Detection Qualitative results. Red: detected 3D boxes. Yellow: ground truth. Right: birds’ eye view (BEV) results.
  43. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training • A single-shot and keypoints-based framework for mono 3D objects detection, KM3D-Net. • Here, design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine with perspective geometry constraints to compute position. • Further, reformulate the geometric constraints as a differentiable version and embed it into the network while maintaining the consistency of model outputs in an E2E fashion. • Then propose a semi-supervised training strategy where labeled training data is scarce. • In this strategy, enforce a consensus prediction of two shared-weights KM3D-Net for the same unlabeled image under different input augmentation conditions and network regularization. • In particular, unify the coordinate-dependent augmentations as the affine transformation for the differential recovering position and propose a keypoints-dropout module for the network regularization. • This model only requires RGB images without synthetic data, instance segmentation, CAD model, or depth generator.
  44. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training Overview of KM3D-Net which output keypoints, object dimensions, local orientation, and 3D confidence, followed by differentiable geometric consistency constraints to predict position.
  45. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training Overview of unsupervised training. It leverages affine transformation to unify input augmentation and devise keypoints dropout for regularization. These two strategies make KM3D-Net output two stochastic variables with the same input. Penalizing their differences is training goal.
  46. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training
  47. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training
  48. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels • The training of deep-learning-based 3D object detectors requires large datasets with 3D bounding box labels for supervision that have to be generated by hand-labeling. • A network architecture and training procedure for learning monocular 3D object detection without 3D bounding box labels. • By representing the objects as triangular meshes and employing differentiable shape rendering, define loss functions based on depth maps, segmentation masks, and ego- and object-motion, which are generated by pre-trained, o-the-shelf networks. • In comparison to SOA methods requiring 3D bounding box labels for training and superior performance to conventional baseline methods.
  49. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels A mono 3D vehicle detector that requires no 3D bounding box labels for training. The right image shows that the predicted vehicles (colored shapes) fit the GT bounding boxes (red). Despite the noisy input depth (lower left ), the method is able to accurately predict the 3D poses of vehicles due to the proposed fully differentiable training scheme. It show the projections of the predicted bounding boxes (colored boxes, upper left).
  50. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels The proposed model contains a single-image network and a multi-image network extension. The single- image network back-projects the input depth map into a point cloud. A Frustum PointNet encoder predicts the pose and shape then decoded into a predicted 3D mesh and segmentation mask through differentiable rendering. The multi-image network architecture takes three images as the inputs, and the single-image network is applied individually to each image. This network predicts a depth map for the middle frame based on the vehicle‘s pose and shape. A pre-trained network predicts ego-motion and object-motion from the images. The reconstruction loss is computed by differentiably warping the images into the middle frame.
  51. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels In order to train without 3D bounding box labels we use three losses, the segmentation loss Lseg, the chamfer distance Lcd, and the photometric reconstruction loss Lrec. The first two are defined for single images and the photometric reconstruction loss relies on temporal photo-consistency for three consecutive frames
  52. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels Qualitative comparison of MonoGRNet (1st row), Mono3D(2nd row), and this method (third row) with depth maps from BTS. It show GT bounding boxes for cars (red), predicted bounding boxes (green), and the back- projected point cloud. In comparison to Mono3D, the prediction accuracy is increased specifically for further away vehicles. The performance of MonoGRNet and this model is comparable.
  53. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels
Publicité