Single Image Depth Estimation
Time: 2023.09.13-2023.10.05
Paper Reading
Datasets
- HR-WSI: Structure-Guided Ranking Loss for Single Image Depth Prediction
- Holopix50k: A Large-Scale In-the-wild Stereo Image Dataset
- DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data
- ReDWeb V1: Monocular Relative Depth Perception with Web Stereo Data Supervision
- The Replica Dataset: A Digital Replica of Indoor Spaces
- Taskonomy: Disentangling Task Transfer Learning
Methods
authority recommend
- ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth (arXiv 2023.02)
- Vision Transformers for Dense Prediction (ICCV 2021)
- Learning to Recover 3D Scene Shape from a Single Image (CVPR 2021)
lightweight SIDE research
- Deep Neighbor Layer Aggregation for Lightweight Self-Supervised Monocular Depth Estimation (arXiv 2023.09)
- fully convolutional depth estimation network using contextual feature fusion
- use high-resolution and low-resolution features to reserve information on small targets and fast-moving objects instead of long-range fusion
- employing lightweight channel attention based on convolution in the decoder stage
- RT-MonoDepth: Real-time Monocular Depth Estimation on Embedded Systems (arXiv 2023.08)
- Fast inference based on convolution: RT-MonoDepth and RT-MonoDepthS, runs at 18.4&30.5 FPS on NVIDIA Jetson Nano and 253.0&364.1 FPS on NVIDIA Jetson AGX Orin on a single RGB image of resolution 640×192, and achieve relative stateof-the-art accuracy on the KITTI dataset.
- Encoder (downsample inputs): 4-layer pyramid convolution encoder, removing the normalization layer, standard convolutions instead of depth-wise separable convolution.
- Decoder (upsample and fuse): upsampling -> 3 × 3 depth-wise separable convolution followed by nearest-neighbor interpolation with a scale factor of 2; fusion -> mixed use of element-wise addition and concatenate; prediction -> convs + activating functions: leakyReLU, sigmoid.
- Lightweight Monocular Depth Estimation via Token-Sharing Transformer (2023 IEEE International Conference on Robotics and Automation (ICRA), CCF-B)
- Token-Sharing Transformer (TST): On the NYU Depth v2 dataset, TST can deliver depth maps up to 63.4 FPS in NVIDIA Jetson nano and 142.6 FPS in NVIDIA Jetson TX2.
- Design concept: hierarchy-focused architecture (gradually reduces the resolutions of tokens) + bottleneck-focused architecture (bottleneck-focused architecture reduces the resolution through CNN and applies self-attention only in low-resolution tokens)
- Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation (CVPR 2023)
- efficient combination of CNNs and Transformers: Consecutive Dilated Convolutions (CDC) module -> shallow CNNs with dilated convolution to enhance local features; Local-Global Features Interaction (LGFI) module -> cross-covariance attention to compute the attention along the feature channels.
- Boosting LightWeight Depth Estimation via Knowledge Distillation (International Conference on Knowledge Science, Engineering and Management, KSEM 2023, CCF-C)
- lightweight network (MobileNet-v2 Encoder, Channel-wise attention) + Promoting KD with Auxiliary Data
- Lightweight Monocular Depth Estimation with an Edge Guided Network (2022 17th International Conference on Control, Automation, Robotics and Vision, ICARCV, CORE Computer Science Conference Rankings: A)
- Preliminary: edge information are important cues for convolutional neural networks (CNNs) to estimate depth.
- Encoder-Decoder Architecture:
- Multi-scale Feature Extractor -> MobileNetV2 as the backbone
- Edge Guidance Branch -> guiding depth estimation
- Transformer-Based Feature Aggregation Module
- Lightweight Monocular Depth Estimation through Guided Decoding (2022 International Conference on Robotics and Automation (ICRA), CCF-B)
- lightweight encoder-decoder architecture for embedded platforms + Guided Upsampling Block
- inference:
- NYU Depth V2: 35.1 fps on the NVIDIA Jetson Nano and up to 144.5 fps on the NVIDIA Xavier NX
- KITTI: 23.7 fps on the Jetson Nano and 102.9 fps on the Xavier NX
- MobileXNet: An Efficient Convolutional Neural Network for Monocular Depth Estimation (IEEE Transactions on Intelligent Transportation Systems, 2022, CCF-B)
- Encoder-Decoder style CNN architecture: Conv, DWConv, DilatedConv, Bilinear Upsampling
- To penalize the errors around edges -> hybrid loss: the regular L1 loss + the image gradient-based L1 loss
- Deep Neighbor Layer Aggregation for Lightweight Self-Supervised Monocular Depth Estimation (arXiv 2023.09)
others
- DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation (arXiv 2023.08)
- Paradigm innovation: regression or classification -> denoising diffusion
- Edge-guided occlusion fading reduction for a light-weighted self-supervised monocular depth estimation (arXiv 2019.11)
- Atrous Spatial Pyramid Pooling (ASPP) -> (Dilated/Atrous Convolution) reduce the computational costs
- Edge-Guided post-processing -> reduce the occlusion fading
- DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation (arXiv 2023.08)
Metrics
相对误差(Relative Error,REL):
- 相对误差用于度量模型估计的深度值与真实深度值之间的相对差异。
- 公式:$REL = \frac{|D_{\text{est}} - D_{\text{gt}}|}{D_{\text{gt}}}$
均方根误差(Root Mean Square Error,RMSE):
- 均方根误差衡量模型估计值与真实值之间的绝对差异,通过平方差的平均值再开平方根得到。
- 公式:$RMSE = \sqrt{\frac{1}{N} \sum (D_{\text{est}} - D_{\text{gt}})^2}$
平均绝对误差(Mean Absolute Error,MAE):
- 平均绝对误差度量估计深度值与真实深度值之间的平均绝对差异。
- 公式:$MAE = \frac{1}{N} \sum |D_{\text{est}} - D_{\text{gt}}|$
对数均方根误差(Log Root Mean Square Error,Log-RMSE):
- 对数均方根误差在对数尺度上度量估计深度值与真实深度值之间的均方根差异。
- 公式:$Log-RMSE = \sqrt{\frac{1}{N} \sum (\log(D_{\text{est}} + \epsilon) - \log(D_{\text{gt}} + \epsilon))^2}$
- 这里的$\epsilon$是一个小的常数,通常用于避免对数中的除零错误。