Realtime Stereo Active Vision

The Active Vision Approach

An active vision system is one that is able to interact with its environment by altering its viewpoint rather than passively observing it, and by operating on sequences of images rather than on a single frame. Moreover, since its foveas can scan over the scene, the range of the visual scene is not restricted to that of the static view. The ability to physically track a target reduces motion blur, increasing target resolution for higher level tasks such as classification. Active Vision is close, in principle, to the biological systems that inspired it and so it seems intuitively acceptable that as a visual sensor (especially augmented with color) it is perfectly suited to human/robot interaction and autonomous robot navigation in human environments.

The RSISE/NICTA works towards systems that have vision capabilities similar to that of primates.


Generation 1: MAVis

MAVis was RSL's first attempt at an active vision system. It was designed as a monocular rig with the intention that it be extendable to a binocular configuration. Highly inspired by biological vision systems, the group took the approach that the speed and precision of a vision system depended heavily on the individual performance of each individual "eye". With this in mind MAVis was designed to reduce the rotational inertia of the moving parts to a minimum for each "eye". It was this notion that led to the use of cable drive transmission. In particular it allowed the motors to be mounted away from the moving parts, whilst also reducing the backlash and friction associated with gear driven systems. It was able to achieve performance and precision similar to the human vision system.

Generation 2: HyDrA

HyDrA (or Hybrid Drive Active vision) is perhaps not the natural successor to MAVis with its tilt-axis mounted vergence motors, but its simple design allowed a prototype to be rapidly manufactured and early work on control and vision processing in the binocular domain to begin. Using HyDrA, progress was achieved in object tracking and object recognition. Its cable driven tilt axis also allowed for performance on par with the human vision system.

Generation 3: CeDAR

The true successor to MAVis is perhaps more suitably CeDAR (or Cable Drive Active Vision Robot). Fully cable driven, only the cameras and their mounts make up the rotational inertia of the system's "eyes". Optimised in Pro-Engineer for maximum rigidity to weight ratio, the system is light weight and incurs minimal deflection under high acceleration for added precision. Even with two fully motorised zoom digital cameras weighing 350g each, the system still outperforms the human vision system (and most of the best high performance systems around the world that carry much smaller payloads). CeDAR incorporates a common tilt axis and two independent pan axes separated by a baseline of 30cm (REF 2). All axes exhibit a range of motion of greater than 90 deg, speed of greater than 600 deg/sec and angular resolution of 0.01 deg. Synchronised images with a field of view of 45 deg are obtained from each camera at 30 fps at a resolution of 640x480 pixels, and distributed to a processing network. The mechanical status of the viewing apparatus and acceptance of motion control commands are handled by a dedicated motion control server.

REALTIME Vision Processing

Active Mosaics

Because many useful algorithms have been developed for static stereo vision, one of the first milestones in the development of CeDAR's vision processing systems was the ability to use any static stereo algorithms on the active platform. To deal with this, a real-time active rectification technique was developed that gives the relationship between successive frames and left and right frames such that global mosaics of the scene can be constructed (REF 4). Then, any static algorithm can be converted to the active case simply by operating on the global mosaics.

Bayesian Occupancy Grid

The next stage of development involved developing a 3D perception of where visual surfaces and free space are located in real dynamic scenes, and how these surfaces are moving, regardless of any deliberate motions of the viewing apparatus and variations in gaze fixation point. A Bayesian Occupancy grid approacch was developed to incorporate spatio-temporal information (REF 4).

Spatial Perception Spatial and flow perceptions are peripheral responses, that is, they operate continually, over the entire visual field, regardless of the geometry of the active platform.
Spatial Perception

Now, CeDAR can cope with moving cameras, and has a perceprion of where free space and visual surfaces are in the scene. The next stage was to develop coordinated stereo fixation. Humans find it hard to fixate on free space, they are more interested in attending visual surfaces. Accordingly, a robust Markov Random Field Zero Disparity Filter (MRF ZDF, REF 7) was developed to synthesize this property of the primate vision system. This is a foveal response, and permits CeDAR to focus on, track, and segment out objects of interest in the scene.

Saliency SceneSaliency Map

Most recently, CeDAR has learnt to decide where to look all by itself. A saliency map is calculated that highlights the most interesting things in the scene. The approach incorporates much biological inspiration, such as the concept of active-dynamic Inhibition of Return (IOR - once we have considered an object, we are less likely to consider it again, at least in the short term - even if it moves), and a Task-Dependant Spatial Bias (TSB - biasing gaze fixation to help whatever task we are currently doing) (REF 8).

Saliency Scene Saliency Map As a whole, the system can understand where free space and visual surfaces are in dynamic scenes while simultaneously saccading between regions and objects of interest, and maintaining its gaze upon those regions as it sees fit, regardless of their motion (it also keeps a short-term memory of where such regions are, and their motion), and while identifying which pixels in its view are on the object of interest, and which are not. Vision processing occurs on a small network of computers. Cues maps that are known to exist in the human brain (via neural correlates) are implemented on servers in the processing network. The vision processing network has been shown to exhibit similarity with the processing centres in the primate early visual brain.

Please refer to demo footage (below) for better understanding of the system.

Papers (chronological development)

REF 1 ACRA03.pdf - CeDAR with driver assistance systems
REF 2 ICVS03.pdf - CeDAR: agile vision mechanism
REF 3 MVA03.pdf - CeDAR: a real world vision system
REF 4 ACRA04.pdf - Active rectification: static algorithms on active platforms
REF 5 IVS05.pdf - Active spatio-temporal scene awareness
REF 6 FSR05.pdf - Active bimodal scene awareness: peripheral and foveal
REF 7 CVIU05.pdf - MRF ZDF active hand tracking and segmentation
REF 8 EPIROB06.pdf - Real-time bio-inspired active-dynamic coordinated fixation and segmentation *NEW*

Demo Footage - Active Stereo


Andrew Dankers - Web Page Andrew Dankers
Feedback & Queries: Andrew Dankers
Date Last Modified: Monday, 3rd Oct 2006