Zuria Bauer

Postdoctoral Researcher at the Computer Vision and Geometry Group, ETH Zurich

THESIS | Zuria Bauer

THESIS

Monocular Depth Estimation

Datasets, Methods, and Applications

September 16, 2021

Thesis defended: 16th of September 2021.

Grade: Summa Cum Laude

Supervisors: Prof. Dr. Miguel Cazorla and Dr. Sergio Orts-Escolano

Abstract

The World Health Organization (WHO) stated in February 2021 at the Seventy- Third World Health Assembly that, globally, at least 2.2 billion people have a near or distance vision impairment. They also denoted the severe impact vision impairment has on the quality of life of the individual suffering from this condition, how it affects the social well-being and their economic independence in society, becoming in some cases an additional burden to also people in their immediate surroundings. In order to minimize the costs and intrusiveness of the applications and maximize the autonomy of the individual life, the natural solution is using systems that rely on computer vision algorithms.

The systems improving the quality of life of the visually impaired need to solve different problems such as: localization, path recognition, obstacle detection, environment description, navigation, etc. Each of these topics involves an additional set of problems that have to be solved to address it. For example, for the task of object detection, there is the need of depth prediction to know the distance to the object, path recognition to know if the user is on the road or on a pedestrian path, alarm system to provide notifications of danger for the user, trajectory prediction of the approaching obstacle, and those are only the main key points. Taking a closer look at all of these topics, they have one key component in common: depth estimation/prediction. All of these topics are in need of a correct estimation of the depth in the scenario.

In this thesis, our main focus relies on addressing depth estimation in indoor and outdoor environments. Traditional depth estimation methods, like structure from motion and stereo matching, are built on feature correspondences from multiple viewpoints. Despite the effectiveness of these approaches, they need a specific type of data for their proper performance. Since our main goal is to provide systems with minimal costs and intrusiveness that are also easy to handle we decided to infer the depth from single images: monocular depth estimation.

Estimating depth of a scene from a single image is a simple task for humans, but it is notoriously more difficult for computational models to be able to achieve high accuracy and low resource requirements. Monocular Depth Estimation is this very task of estimating depth from a single RGB image. Since there is only a need of one image, this approach is used in applications such as autonomous driving, scene understanding or 3D modeling where other type of information is not available.

This thesis presents contributions towards solving this task using deep learning as the main tool. The four main contributions of this thesis are: first, we carry out an extensive review of the state-of-the-art in monocular depth estimation; secondly, we introduce a novel large scale high resolution outdoor stereo dataset able to provide enough image information to solve various common computer vision problems; thirdly, we show a set of architectures able to predict monocular depth effectively; and, at last, we propose two real life applications of those architectures, addressing the topic of enhancing the perception for the visually impaired using low-cost wearable sensors.


BibTex
    @article{bauer2021monocular, title={Monocular Depth Estimation: Datasets, Methods, and Applications}, author={Bauer, Zuria}, year={2021}, publisher={Universidad de Alicante} }
PDF | Video

News

The Hoi! dataset will be presented as a Highlight paper during CVPR 2026 in Denver!

Excited to share that our papers Hoi! A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation and FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning have been accepted to #CVPR2026 🎉

I will be serving as a Poster Chair for ECCV 2026 in Malmö—looking forward to an exciting conference!

Our paper Video Perception Models for 3D Scene Synthesis has been accepted to #NeurIPS2025 🎉

Heading to #BMVC? Check out our paper MonoTracker: Monocular RGB-Only 6D Tracking of Unknown Objects!

Attending #ICCV? Come by our poster on 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection!

Going to #IROS2025 in China? Stop by our poster Lost and Found: Updating Dynamic 3D Scene Graphs from Egocentric Observations!

SpotLight: Robotic Scene Understanding through Interaction and Affordance Detection accepted to #Humanoids2025 in Korea—see you there!

Our work CroCoDL: Cross-device Collaborative Dataset for Localization will be presented at #CVPR2025 in Nashville—come say hi at our poster!