logo

flickr.com
Real-time Modelling from Monocular Vision

MEng Computing Final Year Project

Continuing my interest in Computer Vision, I met up with Dr. Andrew Davison to discuss the stuff he'd been getting upto in the field of Robotic Vision. He'd quite recently moved to Imperial College from Oxford where he was a member of the active vision group. After establishing that we shared a strong common interest, Andy was generous enough to offer to supervise me through my final year project, later titled "Real-time Modelling from Monocular Vision"

Overview

The goal, essentially, was to generate an accurate scene model in real-time from a monocular image sequence. I built on top of Andrew Davisons MonoSLAM framework which provided a set of sparse probabilistic features on which to base the model, and a real-time pose estimate of the camera. My main contribution is the use of this pose estimate to influence the constructed model to better reflect the true geometry of the scene.

Extract from my dissertation

Abstract
Current methodologies for generating models of scenes from sparse features are no good. The features themselves do not in general unambiguously specify a real-world surface and attempts to use them without heavily constraining the problem result in poor models. We outline a novel method for obtaining more accurate and less constrained models from monocular vision, by extending Davisons MonoSLAM framework [6] for real-time localisation and mapping. We construct models incrementally using context collected from MonoSLAM. Moreover we demonstrate that the entire system can operate in real-time to generate models on the fly.

Final model from video covering a simple multi-planar scene

Since our method is approximating the real surface using a triangular mesh, single planar and multi-planar scenes are the simplest to handle. One test video sequence focusses on modelling the corner of a room:

Modelling a multi-planar scene. Static image from sequence (top-left), overhead textured wireframe (bottom-left), novel views (right).

Large Office video sequence

More complicated scenes pose a greater challenge where objects may occlude each other and contain very rugged surfaces.

The following videos show the probabilistic sparse points initialised and tracked through the video by Davisons MonoSLAM, and the textured model generated using these points and the pose estimate. Texturing is limited to areas in which SLAM features are found.

Video sequence of office scene construction from monocular video. Video Sequence (left), model (right).

Overview of office scene construction from monocular video. Orange Cameras represent stored images for rendering texture.

Dynamic Textures

Given that our triangular mesh is only an approximation to the true surface, and given that in general we will require many stills from the video sequence to obtain the required textures for individual patches, we can be left with texture inconsistencies between adjacent triangles.

To reduce this effect, we can increase view-consistency and realism by defering texture selection until view-time. Given the current virtual camera pose, we try to use textures aquired from video stills that share a similar pose to the virtual camera. For a 'well samples' space, this can create quite rich virtual playback.

Sampling image stills for delayed texture selection, or "Dynamic Textures". Orange Cameras represent stored images for rendering texture.

Shoutback

by Steven Lovegrove Jump to top