A Gestural Interface to Virtual Environments

Three-dimensional virtual environments present new challenges for human-computer interaction.  Current input devices provide little more than 3D "point and click" interaction whilst tethering the user to the system by restrictive cabling or gloves.  In contrast, video-tracked hand gestures provide a natural and intuitive means of interacting with the environment in an accoutrement-free manner.
In this project, we are investigating the use of vision-based systems to track the hand in a gesture-based interface to 3D immersive environments with the aim of providing a more natural, less restrictive interface for manipulating objects in 3D.

System Overview

We have developed a stereo vision-based system for real-time tracking of the position and orientation of the user's hand and classification of gestures.  The system uses a combination of model and feature-based methods to acquire, track and classify the hand within the video images.  Model-based template matching is used to track features of the hand in real time.  Skin colour detection is used to locate the hand blob within the image on startup and if tracking fails.  Features extracted from the hand blob are also used in classifying gestures.

The gesture interface system was developed as an interface to virtual environments and has been used to control navigation and manipulation of 3D objects.  The system is used in conjunction with the joint CSIRO/ANU Virtual Environments lab.  The environment consists of a Barco Baron projection table for 3D graphics display with CrystalEyes stereo shutter glasses for stereoscopic viewing.  The environment is powered by an SGI Onyx2.  Polhemus FastTrak sensors and stylus are available for non-gestural input.

System setup

Robust & Real-Time Tracking in 3D

The tracking system is able to track multiple features on the user's hand at frame rate (30Hz).  When tracking fails, the system is able to relocate the hand within the image within 2-3 frames.  The images below show examples of the system tracking a hand.  The white squares depict the features being tracked with the size of the square indicating the certainty of tracking for that feature. The larger the square, the more confident the system is in the tracking result.

Skin Colour Detection

In order to start tracking when the user's hand enters the working volume or when tracking fails, some method of locating the hand within the images is needed.  We use skin colour detection to locate skin coloured blobs within the images, and further image processing to detect the hand.  Once the hand is found, its location can be used to restart the tracking of the model.


Classification of the hand shape is required to identify when the user is showing different gestures, and thus wishes to perform a different action.  Classification of gesture is possible using a variety of methods including hidden Markov Models, neural networks and probablistic models.  We use a statistical model to determine which gesture (if any) in the gesture set is being displayed.  Image features including moments are used to create a feature vector from which a classification is made.


Navigation Control - Terrain Flythrough

A common task in 3D manipulation is user control of the viewpoint within a scene or virtual world.  The user should be able to move easily through the scene.  As a demonstration of the use of gesture to control the user's viewpoint, we constructed a terrain flythrough.  The user controls their direction within the flythrough by tilting the hand as in the image below.  Forward and backward motion is controlled via the location of the user's hand in space - moving the hand forward to move forwards, and back to move backwards.

3D Object Manipulation - Blocks

Along with viewpoint control, object manipulation is a fundamental interaction requirement in 3D virtual environments.  Object manipulations include selection, translation, rotation and scaling.

Multidimensional Input - Sound Space Exploration

An advantage of gesture over other 3D trackers is the ability to provide multidimensional input.  While a Polhemus Stylus or similar device provides position and orientation information for a single point in space (the stylus tip), a gesture interface can input many positions simultaneously since the system tracks multiple features.  To demonstrate this ability, we developed "HandSynth" - a tool for exploring multi-dimensional sound synthesis algorithms.  In HandSynth, the position of each fingertip is tied to different parameters within the sound generator.  Moving the hand about in space and changing its orientation generates a variety of different sounds synthesised by changing  up to 15 parameters at the same time.  We used the HandSynth to simultaneously control 5 fm-synths each with 3 parameters  - carrier frequency, modulation frequency and  modulation depth.  Exploring the sound space is much quicker and easier than  the conventional approach which involves twiddling individual knobs and sliders for many tedious hours to understand the complex perceptual interactions between parameters.   The HandSynth was also used as an interface for non-linear navigation in a 3 minute long sampled sound file. Movement from left to right was a fast forward or rewind. Movement front-back provided normal playback speed for the current position. Up and down allowed slow motion forward and back. This interface allows the user to quickly hear an overview with left-right hand movement, to zoom in on detail and to also have random access into the file based on  position of the hand movements.  The synthesis and navigation were combined to create a compositional tool  in which 3 parameters of reverb and flanging effects were controlled by the spatial position of two fingertips while the other 3 fingertips accessed samples from the soundfile to be processed through the effects.   Finally we also used the sounds from the  HandSynth as input to a music visualisation based on  a flock of 'boid' artificial lifeforms that respond to different frequencies in the sound.  The motion of the boids is displayed graphically providing a visual display of auditory information.  With HandSynth providing the audio input, the boids now respond to the user's hand movements.


R. O'Hagan, A. Zelinsky and S. Rougeaux, "Visual Gesture Interfaces to Virtual Environments", Interacting With Computers Special Issue, To Appear (invited paper) 2001.

R. O'Hagan and A. Zelinsky, "Vision-based Gesture Interfaces to Virtual Environments", 1st Australasian User Interfaces Conference (AUIC2000) Proceedings, Canberra, Australia, pp 73-80. January 2000.

R. O'Hagan and A. Zelinsky,"Finger Track - A Robust and Real-Time Gesture Interface", Advanced Topics in Artificial Intelligence, Tenth Australian Joint Conference on Artificial Intelligence (AI'97) Proceedings, Perth, Australia, pp475-484. December 1997.


Feedback & Queries: Rochelle O'Hagan

Date Last Modified: Sunday, 8th Jul 2001