Monocular depth estimation based on deep learning: An overview

ChaoQiang Zhao; QiYu Sun; ChongZhen Zhang; Yang Tang; Feng Qian

doi:10.1007/s11431-020-1582-8

journal article Jun 10, 2020

Monocular depth estimation based on deep learning: An overview

ChaoQiang Zhao QiYu Sun ChongZhen Zhang Yang Tang

Feng Qian

Science China Technological Sciences Vol. 63 No. 9 pp. 1612-1627 · Springer Science and Business Media LLC

View at Publisher Save 10.1007/s11431-020-1582-8

Topics

No keywords indexed for this article. Browse by subject →

References

119

[1]

Hu G, Huang S, Zhao L, et al. A robust RGB-D SLAM algorithm. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. Vilamoura: IEEE, 2012. 1714–1719 10.1109/iros.2012.6386103

[2]

Zhu Z S, Su A, Liu H B, et al. Vision navigation for aircrafts based on 3D reconstruction from real-time image sequences. Sci China Tech Sci, 2015, 58: 1196–1208 10.1007/s11431-015-5828-x

[3]

Chai X, Gao F, Qi C K, et al. Obstacle avoidance for a hexapod robot in unknown environment. Sci China Tech Sci, 2017, 60: 818–831 10.1007/s11431-016-9017-6

[4]

Park S J, Hong K S, Lee S. RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, 2017. 4980–4989

[5]

The interpretation of structure from motion

S. Ullman

Proceedings of the Royal Society of London. Series... 1979 10.1098/rspb.1979.0006

[6]

Mancini F, Dubbini M, Gattelli M, et al. Using unmanned aerial vehicles (UAV) for high-resolution reconstruction of topography: The structure from motion approach on coastal environments. Remote Sens, 2013, 5: 6880–6898 10.3390/rs5126880

[7]

ORB-SLAM: A Versatile and Accurate Monocular SLAM System

Raul Mur-Artal, J. M. M. Montiel, Juan D. Tardos

IEEE Transactions on Robotics 2015 10.1109/tro.2015.2463671

[8]

Szeliski R, Kang S R. Shape ambiguities in structure from motion. IEEE Trans Pattern Anal Machine Intell, 1997, 19: 506–512 10.1109/34.589211

[9]

Zou L, Li Y. A method of stereo vision matching based on OpenCV. In: 2010 International Conference on Audio, Language and Image Processing. Shanghai: IEEE, 2010. 185–190 10.1109/icalip.2010.5684978

[10]

Cao Z L, Yan Z H, Wang H. Summary of binocular stereo vision matching technology (in Chinese). J Chongqing Univ Tech (Nat Sci), 2015, 29: 70–75

[11]

Benosman R, Manière T, Devars J. Multidirectional stereovision sensor, calibration and scenes reconstruction. In: Proceedings of 13th International Conference on Pattern Recognition. Vienna: IEEE, 1996. 161–165 10.1109/icpr.1996.546011

[12]

Ramírez-Hernández L R, Rodríguez-Quiñonez J C, Castro-Toscano M J, et al. Improve three-dimensional point localization accuracy in stereo vision systems using a novel camera calibration method. Int J Adv Robot Syst, 2020, 17: 172988141989671 10.1177/1729881419896717

[13]

Tateno K, Tombari F, Laina I, et al. CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017. 6243–6252 10.1109/cvpr.2017.695

[14]

Yoneda K, Tehrani H, Ogawa T, et al. Lidar scan feature for localization with highly precise 3D map. In: 2014 IEEE Intelligent Vehicles Symposium Proceedings. Dearborn: IEEE, 2014. 1345–1350 10.1109/ivs.2014.6856596

[15]

Zhang F, Zhu X, Ye M. Fast human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019. 3517–3526 10.1109/cvpr.2019.00363

[16]

Pang J, Chen K, Shi J, et al. Libra R-CNN: Towards balanced learning for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019. 821–830 10.1109/cvpr.2019.00091

[17]

Lyu H, Fu H, Hu X, et al. ESNet: Edge-based segmentation network for real-time semantic segmentation in traffic scenes. In: 2019 IEEE International Conference on Image Processing (ICIP). Taipei: IEEE, 2019. 1855–1859 10.1109/icip.2019.8803132

[18]

Zhao Z Q, Zheng P, Xu S T, et al. Object detection with deep learning: A review. IEEE Trans Neural Netw Learning Syst, 2019, 30: 3212–3232 10.1109/tnnls.2018.2876865

[19]

Ghosh S, Das N, Das I, et al. Understanding deep learning techniques for image segmentation. ACM Comput Surv, 2019, 52: 1–35 10.1145/3329784

[20]

Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review

Waseem Rawat, Zenghui Wang

Neural Computation 2017 10.1162/neco_a_00990

[21]

Tang Y, Zhao C, Wang J, et al. An overview of perception and decision-making in autonomous systems in the era of learning. 2020, arXiv: 2001.02319

[22]

Facil J M, Ummenhofer B, Zhou H, et al. CAM-Convs: Camera-aware multi-scale convolutions for single-view depth. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019. 11826–11835 10.1109/cvpr.2019.01210

[23]

Garg R, Vijay Kumar B G, Carneiro G, et al. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: Leibe B, Matas J, Sebe N, et al., eds. Computer Vision-ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol 9912. Cham: Springer, 2016. 740–756 10.1007/978-3-319-46484-8_45

[24]

Wang R, Pizer S M, Frahm J M. Recurrent neural network for (un-)supervised learning of monocular video visual odometry and depth. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019. 5555–5564 10.1109/cvpr.2019.00570

[25]

Chakravarty P, Narayanan P, Roussel T. GEN-SLAM: Generative modeling for monocular simultaneous localization and mapping. In: 2019 International Conference on Robotics and Automation (ICRA). Montreal: IEEE, 2019. 147–153 10.1109/icra.2019.8793530

[26]

Aleotti F, Tosi F, Poggi M, et al. Generative adversarial networks for unsupervised monocular depth prediction. In: Leal-Taixe L, Roth S, eds. Computer Vision-ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science, vol 11129. Cham: Springer, 2018. 337–354 10.1007/978-3-030-11009-3_20

[27]

Godard C, Mac Aodha O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017. 270–279 10.1109/cvpr.2017.699

[28]

Zhan H, Garg R, Saroj Weerasekera C, et al. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018. 340–349 10.1109/cvpr.2018.00043

[29]

Yin Z, Shi J. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018. 1983–1992 10.1109/cvpr.2018.00212

[30]

Wang C, Miguel Buenaposada J, Zhu R, et al. Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018. 2022–2030 10.1109/cvpr.2018.00216

[31]

Fei X, Wong A, Soatto S. Geo-supervised visual depth prediction. IEEE Robot Autom Lett, 2019, 4: 1661–1668 10.1109/lra.2019.2896963

[32]

Are we ready for autonomous driving? The KITTI vision benchmark suite

A. Geiger, P. Lenz, R. Urtasun

2012 IEEE Conference on Computer Vision and Patter... 2012 10.1109/cvpr.2012.6248074

[33]

Mayer N, Ilg E, Hausser P, et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, 2016. 4040–4048 10.1109/cvpr.2016.438

[34]

Zhao C, Tang Y, Sun Q. Deep direct visual odometry. 2019, arXiv:1912.05101

[35]

Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems. 2014. 2366–2374

[36]

Chen X, Ma H, Wan J, et al. Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017. 1907–1915 10.1109/cvpr.2017.691

[37]

Understanding Convolution for Semantic Segmentation

Panqu Wang, Pengfei Chen, Ye Yuan et al.

2018 IEEE Winter Conference on Applications of Com... 2018 10.1109/wacv.2018.00163

[38]

Chang M F, Lambert J, Sangkloy P, et al. Argoverse: 3D tracking and forecasting with rich maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019. 8748–8757 10.1109/cvpr.2019.00895

[39]

Xue F, Wang X, Li S, et al. Beyond tracking: Selecting memory and refining poses for deep visual odometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019. 8575–8583 10.1109/cvpr.2019.00877

[40]

Clark R, Wang S, Wen H, et al. VINet: Visual-inertial odometry as a sequence-to-sequence learning problem. In: Thirty-First AAAI Conference on Artificial Intelligence, 2017 10.1609/aaai.v31i1.11215

[41]

Indoor Segmentation and Support Inference from RGBD Images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli et al.

Lecture Notes in Computer Science 2012 10.1007/978-3-642-33715-4_54

[42]

The Cityscapes Dataset for Semantic Urban Scene Understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos et al.

2016 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2016.350

[43]

Zhou T, Brown M, Snavely N, et al. Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017. 1851–1858 10.1109/cvpr.2017.700

[44]

Bian J, Li Z, Wang N, et al. Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Advances in Neural Information Processing Systems, 2019. 35–45

[45]

Saxena A, Min Sun A, Ng A Y. Make3D: Learning 3D scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell, 2009, 31: 824–840 10.1109/tpami.2008.132

[46]

Hoiem D, Efros A A, Hebert M. Automatic photo pop-up. ACM Trans Graph, 2005, 24: 577–584 10.1145/1073204.1073232

[47]

van Dijk T, de Croon G. How do neural networks see depth in single images? In: Proceedings of the IEEE International Conference on Computer Vision. Seoul, 2019. 2183–2191 10.1109/iccv.2019.00227

[48]

Kuznietsov Y, Stuckler J, Leibe B. Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017. 6647–6655 10.1109/cvpr.2017.238

[49]

Kendall A, Martirosyan H, Dasgupta S, et al. End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, 2017. 66–75 10.1109/iccv.2017.17

[50]

Mahjourian R, Wicke M, Angelova A. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018. 5667–5675 10.1109/cvpr.2018.00594

Showing 50 of 119 references

Cited By

260

A novel energy-efficient spike transformer network for depth estimation from event cameras via cross-modality knowledge distillation

Xin Zhang, Liangxiu Han · 2025

Neurocomputing

Vision-based safe autonomous UAV docking with panoramic sensors

Phuoc Thuan Nguyen, Tomi Westerlund · 2023

Frontiers in Robotics and AI

Metrics

260

Citations

119

References

Details

Published: Jun 10, 2020
Vol/Issue: 63(9)
Pages: 1612-1627
License: View

Authors

Cite This Article

ChaoQiang Zhao, QiYu Sun, ChongZhen Zhang, et al. (2020). Monocular depth estimation based on deep learning: An overview. Science China Technological Sciences, 63(9), 1612-1627. https://doi.org/10.1007/s11431-020-1582-8

Monocular depth estimation based on deep learning: An overview

You May Also Like