Tải bản đầy đủ (.pdf) (415 trang)

augmented reality interface and design

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (10.56 MB, 415 trang )

i
Emerging Technologies
of Augmented Reality:
Interfaces and Design
Michael Haller
Upper Austria University of Applied Sciences, Austria
Mark Billinghurst
Human Interface Technology Laboratory, New Zealand
Bruce H. Thomas
Wearable Computer Laboratory, University of South Australia, Australia
Hershey • London • Melbourne • Singapore
IDEA GROUP PUBLISHING
Acquisition Editor: Kristin Klinger
Senior Managing Editor: Jennifer Neidig
Managing Editor: Sara Reed
Assistant Managing Editor: Sharon Berger
Development Editor: Kristin Roth
Copy Editor: Larissa Vinci
Typesetter: Marko Primorac
Cover Design: Lisa Tosheff
Printed at: Yurchak Printing Inc.
Published in the United States of America by
Idea Group Publishing (an imprint of Idea Group Inc.)
701 E. Chocolate Avenue
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail:
Web site:
and in the United Kingdom by


Idea Group Publishing (an imprint of Idea Group Inc.)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 0609
Web site:
Copyright © 2007 by Idea Group Inc. All rights reserved. No part of this book may be reproduced in any
form or by any means, electronic or mechanical, including photocopying, without written permission from the
publisher.
Product or company names used in this book are for identication purposes only. Inclusion of the names of the
products or companies does not indicate a claim of ownership by IGI of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Emerging technologies of augmented reality : interfaces and design / Michael Haller, Mark Billinghurst, and
Bruce Thomas, editors.
p. cm.
Summary: "This book provides a good grounding of the main concepts and terminology for Augmented Real-
ity (AR), with an emphasis on practical AR techniques (from tracking-algorithms to design principles for AR
interfaces). The targeted audience is computer-literate readers who wish to gain an initial understanding of this
exciting and emerging technology" Provided by publisher.
Includes bibliographical references and index.
ISBN 1-59904-066-2 (hardcover) ISBN 1-59904-067-0 (softcover) ISBN 1-59904-068-9 (ebook)
1. Human-computer interaction Congresses. 2. Virtual reality Congresses. 3. User interfaces (Computer
systems) I. Haller, Michael, 1974- II. Billinghurst, Mark, 1967- III. Thomas, Bruce (Bruce H.)
QA76.9.H85E48 2007
004.01'9 dc22
2006027724
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are

those of the authors, but not necessarily of the publisher.
iii
Emerging Technologies
of Augmented Reality:
Interfaces and Design
Table of Contents
Preface vi
Section I: Introduction to Technologies that Support Augmented Reality
Chapter I
Vision Based 3D Tracking and Pose Estimation for Mixed Reality 1
Pascal Fua, Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland
Vincent Lepetit, Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland
Chapter II
Developing AR Systems in the Presence of Spatial Uncertainty
23
Cindy M. Robertson, Georgia Institute of Technology, TSRB, USA
Enylton Machado Coelho, Georgia Institute of Technology, TSRB, USA
Blair MacIntyre, Georgia Institute of Technology, TSRB, USA
Simon Julier, Naval Research Laboratory, USA
Chapter III
An Introduction to Head Mounted Displays for Augmented Reality 43
Kiyoshi Kiyokawa, Osaka University, Japan
Chapter IV
Projector-Based Augmentation 64
Oliver Bimber, Bauhaus University, Germany
iv
Chapter V
Mobile Phone Based Augmented Reality 90
Anders Henrysson, Norrköping Visualisation and Interaction Studio, Sweden
Mark Ollila, Norrköping Visualisation and Interaction Studio, Sweden

Mark Billinghurst, Human Interface Technology Laboratory New Zealand
Chapter VI
Representing and Processing Screen Space in Augmented Reality 110
Blaine Bell, Columbia University, USA
Steven Feiner, Columbia University, USA
Section II: Augmented Reality Development Environments
Chapter VII
Abstraction and Implementation Strategies for Augmented Reality
Authoring 138
Florian Ledermann, Vienna University of Technology, Austria
István Barakonyi, Graz University of Technology, Austria
Dieter Schmalstieg, Vienna University of Technology, Austria
Chapter VIII
Supporting Early Design Activities for AR Experiences 160
Maribeth Gandy, Georgia Institute of Technology, USA
Blair MacIntyre, Georgia Institute of Technology, USA
Steven Dow, Georgia Institute of Technology, USA
Jay David Bolter, Georgia Institute of Technology, USA
Chapter IX
Real-Time 3D Design and Modelling of Outdoor Structures Using Mobile
Augmented Reality Systems 181
Wayne Piekarski, University of South Australia, Australia
Chapter X
The Evolution of a Framework for Mixed Reality Experiences 198
Charles E. Hughes, University of Central Florida, USA
Christopher B. Stapleton, Simiosys LLC, USA
Matthew R. O’Connor, University of Central Florida, USA
Section III: Interface Design and Evaluation of Augmented Reality Applications
Chapter XI
Lessons Learned in Designing Ubiquitous Augmented Reality User

Interfaces 218
Christian Sandor, Technische Universität München, Germany
Gudrun Klinker, Technische Universität München, Germany
v
Chapter XII
Human Communication in Collaborative Augmented Reality Systems 236
Kiyoshi Kiyokawa, Osaka University, Japan
Chapter XIII
Interaction Design for Tangible Augmented Reality Applications 261
Gun A. Lee, Electronics and Telecommunications Research Institute, Korea
Gerard J. Kim, Korea University, Korea
Mark Billinghurst, Human Interface Technology Laboratory, New Zealand
Section IV: Case Studies of Augmented Reality Applications
Chapter XIV
Industrial Augmented Reality Applications 283
Holger Regenbrecht, University of Otago, New Zealand
Chapter XV
Creating Augmented Virtual Environments 305
Ulrich Neumann, University of Southern California, USA
Suya You, University of Southern California, USA
Chapter XVI
Making Memories of a Lifetime 329
Christopher B. Stapleton, Simiosys LLC, USA
Charles E. Hughes, University of Central Florida, USA
Chapter XVII
Social and Physical Interactive Paradigms for Mixed Reality Entertainment 352
Adrian David Cheok, National University of Singapore, Singapore
Chapter XVIII
The Future of Augmented Reality Gaming 367
Bruce H. Thomas, Wearable Computer Laboratory, University of South Australia,

Australia
About the Authors 384
Index 391

vi
Preface
Motivation
Augmented reality (AR) research aims to develop technologies that allow the real-time fu-
sion of computer-generated digital content with the real world. Unlike virtual reality (VR)
technology, which completely immerses users inside a synthetic environment, augmented
reality allows the user to see three-dimensional virtual objects superimposed upon the real
world. Both AR and VR are part of a broader reality–virtuality continuum termed “mixed
reality” (MR) by Milgram and Kishino (1994) (see Figure 1). In their view, a mixed reality
environment is “one in which real world and virtual world objects are presented together
within a single display anywhere between the extrema of the virtuality continuum.”
Figure 1. Reality-virtuality continuum (Milgram & Kishino, 1994)
Mixed Reality
Real
Environment
Augmented
Reality
Augmented
Virtuality
Virtual
Environment
vii
State of the Art
Mixed reality technology can enhance users’ perception and interaction with the real world
(Azuma et al., 2001), particularly through the use of augmented reality. Using Azuma’s
(1997) denition, an AR system has to fulll the following three characteristics:

• It combines both the real and virtual content,
• The system is interactive and performs in real-time, and
• The virtual content is registered with the real world.
Previous research has shown that AR technology can be applied in a wide range of areas
including education, medicine, engineering, military, and entertainment. For example, vir-
tual maps can be overlaid on the real world to help people navigate through the real world,
medical imagery can appear on a real patient body, and architects can see virtual buildings
in place before they are built.
Analyzing the proceedings of the leading AR/MR research symposium (The International
Symposium on Mixed and Augmented Reality), we can identify several signicant research
directions, including:
• Tracking techniques: How to achieve robust and accurate overlay of virtual imagery
on the real world
• Display technologies: Head mounted, handheld, and projection displays for AR
• Mobile augmented reality: Using mobile computers to develop AR applications that
can be used in outdoor settings
• Interaction techniques: Methods for interacting with AR content
• Novel augmented reality applications
Overview
Although the eld of mixed reality has grown signicantly over the last decade, there have
been few published books about augmented reality, particularly the interface design aspects.
Emerging Technologies of Augmented Reality: Interfaces and Design is written to address
this need. It provides a good grounding of the main concepts of augmented reality with a
particular emphasis on user interfaces and design and practical AR techniques (from track-
ing-algorithms to design principles for AR interfaces).
A wide range of experts from around the world have provided fully peer reviewed chapters
for this book. The targeted audience is computer-literate readers who wish to gain an initial
understanding of this exciting and emerging technology. This book may be used as the basis
for a graduate class or as an introduction to researchers who want to explore the eld of user
interfaces and design techniques for augmented reality.

viii
Book Structure and Use
This book is structured around the following four key topics:
• Technologies that support augmented reality
• Augmented reality development environments
• Interface design and evaluation of augmented reality applications
• Case studies of augmented reality applications
The rst section, Introduction to Technologies that Support Augmented Reality, provides
a concise overview of important AR technologies. These chapters examine a wide range of
technologies, balanced between established and emerging new technologies. This insight
provides the reader with a good grounding of the key technical concepts and challenges
developers face when building AR systems. The major focus of these chapters is on track-
ing, display, and presentation technologies.
In Chapter I, mixed reality applications require accurate knowledge of the relative positions
of the camera and the scene. Many technologies have tried to achieve this goal and computer
vision seems to be the only one that has the potential to yield non-invasive, accurate, and
low-cost solutions to this problem. In this chapter, the authors discuss some of the most
promising computer vision approaches, their strengths, and their weaknesses.
Chapter II introduces spatially adaptive augmented reality as an approach to dealing with
the registration errors introduced by spatial uncertainty. The authors argue that if program-
mers are given simple estimates of registration error, they can create systems that adapt to
dynamically changing amounts of spatial uncertainty, and that it is this ability to adapt to
spatial uncertainty that will be the key to creating augmented reality systems that work in
real-world environments.
Chapter III discusses design and principles of head mounted displays (HMDs), as well as
their state-of-the-art examples, for augmented reality. After a brief history of head mounted
displays, human vision system, and application examples of see-through HMDs, the author
describes the design and principles for HMDs, such as typical congurations of optics,
typical display elements, and major categories of HMDs. For researchers, students, and
HMD developers, this chapter is a good starting point for learning the basics, state of the

art technologies, and future research directions for HMDs.
Chapter IV shows how, in contrast to HMD-based systems, projector-based augmentation
approaches combine the advantages of well-established spatial virtual reality with those
of spatial augmented reality. Immersive, semi-immersive, and augmented visualizations
can be realized in everyday environments—without the need for special projection screens
and dedicated display congurations. This chapter describes projector-camera methods
and multi-projector techniques that aim at correcting geometric aberrations, compensating
local and global radiometric effects, and improving focus properties of images projected
onto everyday surfaces.
Mobile phones are evolving into the ideal platform for portable augmented reality. In
Chapter V, the authors describe how augmented reality applications can be developed
for mobile phones and the interaction metaphors that are ideally suited for this platform.
ix
Several sample applications are described which explore different interaction techniques.
The authors also present a user study showing that moving the phone to interact with virtual
content is an intuitive way to select and position virtual objects.
In Chapter VI, the authors describe how to compute a 2D screen-space representation that
corresponds to the visible portions of the projections of 3D AR-objects on the screen. They
describe in detail two visible surface determination algorithms that are used to generate these
representations. They compare the performance and accuracy tradeoffs of these algorithms,
and present examples of how to use our representation to satisfy visibility constraints that
avoid unwanted occlusions, making it possible to label and annotate objects in 3D environ-
ments.
The second section, Augmented Reality Development Environments, examines frame-
works, toolkits, and authoring tools that are the current state-of-the-art for the development
of AR applications. As it has been stated from many disciplines, “Content is King!” For
AR, this is indeed very true and these chapters provide the reader with an insight into this
emerging important area. The concepts covered vary from staging complete AR experiences
to modeling 3D content for AR.
AR application development is still lacking advanced authoring tools—even the simple pre-

sentation of information, which should not require any programming, is not systematically
addressed by development tools. In Chapter VII, the authors present APRIL, the augmented
presentation and interaction language. APRIL is an authoring platform for AR applications
that provides concepts and techniques that are independent of specic applications or target
hardware platforms, and should be suitable for raising the level of abstraction at which AR
content creators can operate.
Chapter VIII presents DART, The designer’s augmented reality toolkit which is an authoring
environment for rapidly prototyping augmented reality experiences. The authors summarize
the most signicant problems faced by designers working with AR in the real world and use
DART as the example to guide a discussion of the AR design process. DART is signicant
because it is one of the rst tools designed to allow non-programmers to rapidly develop
AR applications. If AR applications are to become mainstream then there will need to be
more tools like this.
Augmented reality techniques can be used to construct virtual models in an outdoor environ-
ment. Chapter IX presents a series of new AR user interaction techniques to support the
capture and creation of 3D geometry of large outdoor structures. Current scanning technolo-
gies can be used to capture existing physical objects, while construction at a distance also
allows the creation of new models that exist only in the mind of the user. Using a single AR
interface, users can enter geometry and verify its accuracy in real-time. This chapter presents
a number of different construction-at-a-distance techniques, which are demonstrated with
examples of real objects that have been modeled in the real world.
Chapter X describes the evolution of a software system specically designed to support
the creation and delivery of mixed reality experiences. The authors rst describe some of
the attributes required of such a system. They then present a series of MR experiences that
they have developed over the last four years, with companion sections on lessons learned
and lessons applied. The authors’ goals are to show the readers the unique challenges in
developing an MR system for multimodal, multi-sensory experiences, and to demonstrate
how developing MR applications informs the evolution of such a framework.
x
The next section, Interface Design and Evaluation of Augmented Reality Applications,

describes current AR user interface technologies with a focus on the design issues. AR is
an emerging technology; as such, it does not have a set of agreed design methodologies or
evaluation techniques. These chapters present the opinions of experts in the areas of design
and evaluation of AR technology, and provide a good starting point for the development of
your next AR system.
Ubiquitous augmented reality (UAR) is an emerging human-computer interaction technique,
arising from the convergence of augmented reality and ubiquitous computing. In UAR,
visualizations can augment the real world with digital information, and interaction with the
digital content can follow a tangible metaphor. Both the visualization and interaction should
adapt according to the user’s context and are distributed on a possibly changing set of devices.
Current research problems for user interfaces in UAR are software infrastructures, author-
ing tools, and a supporting design process. The authors in Chapter XI present case studies
of how they have used a systematic design space analysis to carefully narrow the amount
of available design options. The next step is to use interactive, possibly immersive tools to
support interdisciplinary brainstorming sessions and several tools for UAR are presented.
The main goal of Chapter XII is to give characteristics, evaluation methodologies, and
research examples of collaborative augmented reality systems from a perspective of hu-
man-to-human communication. Starting with a classication of conventional and 3D col-
laborative systems, the author discusses design considerations of collaborative AR systems
from a perspective of human communication. Moreover, he presents different evaluation
methodologies of human communication behaviors and shows a variety of collaborative AR
systems with regard to display devices used. will be a good starting point for learning about
existing collaborative AR systems; their advantages and limitations. This chapter will also
contribute to the selection of appropriate hardware congurations and software designs of
a collaborative AR system for given conditions.
Chapter XIII describes the design of interaction methods for tangible augmented reality
applications. First, the authors describe the general concept of a tangible augmented reality
interface and review its various successful applications, focusing on their interaction de-
signs. Next, they classify and consolidate these interaction methods into common tasks and
interaction schemes. Finally, they present general design guidelines for interaction methods

in tangible AR applications. The principles presented in this chapter will help developers
design interaction methods for tangible AR applications in a more structured and efcient
way, and bring tangible AR interfaces into more widespread use.
The nal section, Case Studies of Augmented Reality Applications, provides an explana-
tion of AR through one or more closely related real case studies. Through the examination of
a number of successful AR experiences, these chapters answer the question, “What makes
AR work?” The case studies cover a range of applications from industrial to entertainment,
and provide the reader with a rich understand of the process of developing successful AR
environments.
Chapter XIV explains and illustrates the different types of industrial augmented reality (IAR)
applications and shows how they can be classied according to their purpose and degree
of maturity. The information presented here provides valuable insights into the underlying
principles and issues associated with bringing Augmented Reality applications from the
laboratory and into an industrial context.
xi
Augmented reality typically fuses computer graphics onto images or direct views of a scene.
In Chapter XV, an alternative augmentation approach is described as a real scene that is
captured as video imagery from one or more cameras, and these images are inserted into
a corresponding 3D scene model or virtual environment. This arrangement is termed an
augmented virtual environment (AVE) and it produces a powerful visualization of the dy-
namic activities observed by cameras. This chapter describes the AVE concept and the major
technologies needed to realize such systems. AVEs could be used in security and command
and control type applications to create an intuitive way to monitor remote environments.
Chapter XVI
explores how mixed reality (MR) allows the magic of virtuality to escape the
connes of the computer and enter our lives to potentially change the way we play, work,
train, learn, and even shop. Case studies demonstrate how emerging functional capabilities
will depend upon new artistic conventions to spark the imagination, enhance human experi-
ence, and lead to subsequent commercial success.
In Chapter XVII the author explores the applications of mixed reality technology for future

social and physical entertainment systems. A variety of case studies show the very broad
and signicant impacts of mixed reality technology on human interactivity with regards to
entertainment. The MR entertainment systems described incorporate different technologies
ranging from the current mainstream ones such as GPS tracking, Bluetooth, and RFID tags
to pioneering researches of vision based tracking, augmented reality, tangible interaction
techniques, and 3D live mixed reality capture system.
Entertainment systems are one of the more successful uses of augmented reality technologies
in real world applications. Chapter XVIII provides insights into the future directions of
the use of augmented reality with gaming applications. This chapter explores a number of
advances in technologies that may enhance augmented reality gaming. The features for both
indoor and outdoor augmented reality are examined in context of their desired attributes for
the gaming community. A set of concept games for outdoor augmented reality are presented
to highlight novel features of this technology.
As can be seen within the four key focus areas, a number of different topics have been pre-
sented. Augmented reality encompasses many aspects so it is impossible to cover all of the
research and development activity occurring in one book. This book is intended to support
readers with different interests in augmented reality and to give them the foundation that will
enable them to design the next generation of AR applications. It is not a traditional textbook
that should be read from front to back, rather the reader can pick and choose the topics of
interest and use the material presented here as a springboard to further their knowledge in
this fast growing eld.
As editors is it our hope that this work will be the rst of a number of books in the eld that
will help capture the existing knowledge and train new researchers in this exciting area.
References
Azuma, R. (1997). A survey of augmented reality. Presence: Teleoperation and Virtual
Environments, 6(4), 355-385.
xii
Azuma, R., Baillot, Y., Behringer, R., Feiner, S., Julier, S., & MacIntyre, B. (2001). Recent
advances in augmented reality. IEEE Computer Graphics and Applications, 21(6),
34-47.

Milgram, P., & Kishino, F. (1994, December). A taxonomy of mixed reality visual displays.
IEICE Transactions on Information Systems, E77-D(12).
xiii
Acknowledgments
First of all, we would like to thank our authors. It always takes more time than expected to
write a chapter and all authors did a great job. Special thanks to all the staff at Idea Group
Inc. that were always there to help in the production process. Special thanks to our devel-
opment editor, Kristin Roth! The different chapters beneted from the patient attention of
the anonymous reviewers. They include Blaine Bell, Oliver Bimber, Peter Brandl, Wilhelm
Burger, Adrian D. Cheok, Ralf Dörner, Steven Feiner, Maribeth Gandy, Christian Geiger,
Raphael Grasset, Tobias Höllerer, Hirokazu Kato, Kiyoshi Kiyokawa, Gudrun Klinker, Gun
A. Lee, Ulrich Neumann, Volker Paelke, Wayne Piekarski, Holger Regenbrecht, Christian
Sandor, Dieter Schmalstieg, and Jürgen Zauner. Thanks to them for providing constructive
and comprehensive reviews.
Michael Haller, Austria
Mark Billinghurst, New Zealand
Bruce H. Thomas, Australia
June 2006
xiv
Section I:
Introduction to
Technologies that Support
Augmented Reality
Vision Based 3D Tracking and Pose Estimation for Mixed Reality 1
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
Chapter I
Vision Based 3D Tracking
and Pose Estimation for
Mixed Reality

Pascal Fua, Ecole Polytechnique Fédérale de Laussane (EPFL), Switzerland
Vincent Lepetit, Ecole Polytechnique Fédérale de Laussane (EPFL), Switzerland
Abstract
Mixed reality applications require accurate knowledge of the relative positions of the camera
and the scene. When either of them moves, this means keeping track in real-time of all six
degrees of freedom that dene the camera position and orientation relative to the scene,
or equivalently, the 3D displacement of an object relative to the camera. Many technolo-
gies have tried to achieve this goal. However, computer vision is the only one that has the
potential to yield non-invasive, accurate, and low-cost solutions to this problem, provided
that one is willing to invest the effort required to develop sufciently robust algorithms. In
this chapter, we therefore discuss some of the most promising approaches, their strengths,
and their weaknesses.
2 Fua & Lepetit
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.
Introduction
Tracking an object in a video sequence means continuously identifying its location when either
the object or the camera are moving. More specically, 3D tracking aims at continuously
recovering all six degrees of freedom that dene the camera position and orientation relative
to the scene, or equivalently, the 3D displacement of an object relative to the camera.
Many other technologies besides vision have been tried to achieve this goal, but they all have
their weaknesses. Mechanical trackers are accurate enough, although they tether the user to
a limited working volume. Magnetic trackers are vulnerable to distortions by metal in the
environment which are a common occurrence, and also limit the range of displacements.
Ultrasonic trackers suffer from noise and tend to be inaccurate at long ranges because of
variations in the ambient temperature. Inertial trackers drift with time.
By contrast, vision has the potential to yield non-invasive, accurate, and low-cost solutions
to this problem, provided that one is willing to invest the effort required to develop suf-
ciently robust algorithms. In some cases, it is acceptable to add ducials, such as LEDs
or special markers, to the scene or target object to ease the registration task. Of course, this

assumes that one or more ducials are visible at all times, otherwise, the registration falls
apart. Moreover, it is not always possible to place ducials. For example, augmented real-
ity end-users do not like markers because they are visible in the scene and it is not always
possible to modify the environment before the application has to run.
It is therefore much more desirable to rely on naturally present features, such as edges, corners,
or texture. Of course, this makes tracking far more difcult. Finding and following feature
points or edges on many everyday objects is sometimes difcult because there may only
be a few of them. Total, or even partial occlusion of the tracked objects typically results in
tracking failure. The camera can easily move too fast so that the images are motion blurred;
the lighting during a shot can change signicantly; reections and specularities may confuse
the tracker. Even more importantly, an object may drastically change its aspect very quickly
due to displacement. For example, this happens when a camera lms a building and goes
around the corner, causing one wall to disappear and a new one to appear. In such cases,
the features to be followed always change and the tracker must deal with features coming
in and out of the picture. Next, we focus on solutions to these difcult problems and show
how planar, non-planar, and even deformable objects can be handled.
For the sake of completeness, we provide a brief description of the camera models that all
these techniques rely on, as well as pointers to useful implementations and more extensive
descriptions in the appendix at the end of this chapter.
Fiducials-Based Tracking
Vision-based 3D tracking can be decomposed into two main steps; First image processing
to extract some information from the images, and second pose estimation itself. The addi-
tion in the scene of ducials, also called landmarks or markers, greatly helps both steps.
Vision Based 3D Tracking and Pose Estimation for Mixed Reality 3
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
They constitute image features easy to extract, and they provide reliable, easy to exploit
measurements for pose estimation.
Point-Like Fiducials
Fiducials have been used for many years by close-range photogrammetrists. They can be

designed in such a way that they can be easily detected and identied with an ad hoc method.
Their image locations can also be measured to a much higher accuracy than natural features.
In particular, circular ducials work best, because the appearance of circular patterns is
relatively invariant to perspective distortion, and because their centroid provides a stable
2D position, which can easily be determined with sub-pixel accuracy. The 3D positions of
the ducials in the world coordinate system are assumed to be precisely known. This can
be achieved by hand, with a laser, or with a structure-from-motion algorithm. To facilitate
their identication, the ducials can be arranged in a distinctive geometric pattern. Once
the ducials are identied in the image, they provide a set of correspondences that can be
used to retrieve the camera pose.
For high-end applications, companies such as Geodetic Services, Inc., Advanced Real-time
Tracking GmbH, Metronor, ViconPeak, and AICON 3D Systems GmbH propose commer-
cial products based on this approach. Lower-cost and lower-accuracy solutions have also
been proposed by the computer vision community. For example, the concentric contrasting
circle (CCC) ducial (Hoff, Nguyen & Lyon, 1996) is formed by placing a black ring on
a white background, or vice-versa. To detect these ducials, the image is rst thresholded,
morphological operations are then applied to eliminate too small regions, and a connected
component labeling operation is performed to nd white and black regions, as well as their
centroids. Along the same lines, State, Hirota, David, Garett, and Livingston (1996) use
color-coded ducials for a more reliable identication. Each ducial consists of an inner dot
and a surrounding outer ring, four different colors are used, and thus 12 unique ducials can
be created and identied based on their two colors. Because the tracking range is constrained
by the detectability of ducials in input images, Cho, Lee, and Neumann (1998) introduce a
system that uses several sizes for the ducials. They are composed of several colored con-
centric rings where large ducials have more rings than smaller ones, and diameters of the
rings are proportional to their distance to the ducial center to facilitate their identication.
When the camera is close to ducials, only small size ducials are detected. When it is far
from them, only large size ducials are detected.
While all the previous methods for ducial detection use ad hoc schemes, Claus and Fitzi-
gibbon (2004) use a machine learning approach which delivers signicant improvements in

reliability. The ducials are made of black disks on white background, and sample ducial
images are collected under varying perspective, scale, and lighting conditions, as well as
negative training images. A cascade of classiers is then trained on these data. The rst
step is a fast Bayes decision rule classication, the second one a powerful but slower near-
est neighbor classier on the subset passed by the rst stage. At run-time, all the possible
sub-windows in the image are classied using this cascade. This results in a remarkably
reliable ducial detection method.
4 Fua & Lepetit
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.
Extended Fiducials
The ducials previously presented were all circular and only their center was used. By con-
trast, Koller et al. (1997) introduce squared, black on white ducials, which contain small
red squares for their identication. The corners are found by tting straight line segments
to the maximum gradient points on the border of the ducial. Each of the four corners of
such ducials provides one correspondence and the pose is estimated using an Extended
Kalman lter.
Planar rectangular ducials are also used in Kato and Billinghurst (1999), Kato, Poupyrev,
Imamoto, and Tachibana (2000), and Rekimoto (1998) and it is shown that a single ducial
is enough to estimate the pose. Figure 1 depicts their approach. It has become popular be-
cause it yields a robust, low-cost solution for real-time 3D tracking, and a software library
called ARToolKit is publicly available (ARtoolkit).
The whole process, the detection of the ducials and the pose estimation runs in real-time,
and therefore can be applied in every frame. The 3D tracking system does not require any
initialization by hand, and is robust to ducial occlusion. In practice, under good lighting
conditions, the recovered pose is also accurate enough for augmented reality applications.
These characteristics make ARToolKit a good solution to 3D tracking, whenever the engi-
neering of the scene is possible.
Using Natural Features
Using markers to simplify the 3D tracking task requires engineering of the environment

which end-users of tracking technology do not like or is sometimes even impossible, for
Figure 1. Processing ow of ARToolKit: The marker is detected in the thresholded image,
and then used to estimate the camera pose (Reproduced from Kato et al., 2000, © 2000
IEEE, used with permission)
Input Image Thresholding Image Marker Detection
Virtual Image Overlay Pose and Position Estimation
Vision Based 3D Tracking and Pose Estimation for Mixed Reality 5
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
example, in outdoor environments. Whenever possible, it is therefore much better to be able
to rely on features naturally present in the images. Of course, this approach makes tracking
much more challenging and some 3D knowledge is often required to make things easier.
For MR applications, this is not an issue since 3D scene models are typically available and
we therefore focus here on model-based approaches.
Here we distinguish two families of approaches depending on the nature of the image features
being used. The rst one is formed by edge-based methods that match the projections of the
target object 3D edges to area of high image gradient. The second family includes all the
techniques that rely on information provided by pixels inside the object’s projection.
Edge-Based Methods
Historically, the early approaches to tracking were all edge-based mostly because these
methods are both computationally efcient, and relatively easy to implement. They are also
naturally stable to lighting changes, even for specular materials, which is not necessarily true
of methods that consider the internal pixels as will be discussed later. The most popular ap-
proach is to look for strong gradients in the image around a rst estimation of the object pose,
without explicitly extracting the contours (Armstrong & Zisseman, 1995; Comport, Marchand,
& Chaumette, 2003; Drummond & Cipolla, 2002; Harris, 1992; Marchand, Bouthemy, &
Chaumette, 2001; Vacchetti, Lepetit, & Fua, 2004a), which is fast and general.
RAPiD
Even though RAPiD (Harris, 1992) was one of the rst 3D trackers to successfully run in
real-time and many improvements have been proposed since, many of its basic components

have been retained in more recent systems. The key idea is to consider a set of 3D points on
the object, called control points, which lie on high contrast edges in the images. As shown
in Figure 2, the control points can be sampled along the 3D model edges and in the areas
of rapid albedo change. They can also be generated on the y as points on the occluding
Figure 2. In RAPiD-like approaches, control points are sampled along the model edges;
the small white segments in the left image join the control points in the previous image to
their found position in the new image, the pose can be inferred from these matches, even in
presence of occlusions by introducing robust estimators (Reproduced from Drummond &
Cipolla, 2002, © 2002 IEEE, used with permission)




6 Fua & Lepetit
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.
contours of the object. The 3D motion of the object between two consecutive frames can
be recovered from the 2D displacement of the control points.
Once initialized, the system performs a simple loop. For each frame, the predicted pose,
which can simply be the pose estimated for the previous frame, is used to predict which
control points will be visible and what their new locations should be. The control points are
matched to the image contours, and the new pose estimated from these correspondences via
least-squares minimization.
In Harris (1992), some enhancements to this basic approach are proposed. When the edge
response at a control point becomes too weak, it is not taken into account into the motion
computation, as it may subsequently incorrectly latch on to a stronger nearby edge. As we
will see next, this can also be handled using a robust estimator. An additional clue that
can be used to reject incorrect edges is their polarity, that is whether they correspond to a
transition from dark to light or from light to dark. A way to use occluding contours of the
object is also given.

Making RAPiD Robust
The main drawback of the original RAPiD formulation is its lack of robustness. The weak
contours heuristics is not enough to prevent incorrectly detected edges from disturbing the
pose computation. In practice, such errors are frequent. They arise from occlusions, shadows,
texture on the object itself, or background clutter.
Several methods have been proposed to make the RAPiD computation more robust. Drum-
mond and Cipolla (2002) use a robust estimator and replace the least-squares estimation by
an iterative re-weighted least-squares to solve the new problem. Similarly, Marchand et al.
(2001) uses a framework similar to RAPiD to estimate a 2D afne transformation between
consecutive frames, but also replaces standard least-squares by robust estimation.
In the approaches previously described, the control points were treated individually, without
taking into account that several control points are often placed on the same edge, and hence
their measurements are correlated. By contrast, in Armstrong and Zisserman (1995) and
Simon and Berger (1998), control points lying on the same object edge are grouped into
primitives, and a whole primitive can be rejected from the pose estimation. In Armstrong
and Zisseman (1995), a RANSAC methodology (Fischler & Bolles, 1981) is used to detect
outliers among the control points forming a primitive. If the number of remaining control
points falls below a threshold after elimination of the outliers, the primitive is ignored in
the pose update. Using RANSAC implies that the primitives have an analytic expression,
and precludes tracking free-form curves. By contrast, Simon and Berger (1998) use a robust
estimator to compute a local residual for each primitive. The pose estimator then takes into
account all the primitives using a robust estimation on the above residuals.
When the tracker nds multiple edges within its search range, it may end up choosing the
wrong one. To overcome this problem, in Drummond and Cipolla (2002) the inuence of a
control point is inversely proportional to the number of edge strength maxima visible within
the search path. Vacchetti et al. (2004a) introduce another robust estimator to handle multiple
hypotheses and retain all the maxima as possible correspondents in the pose estimation.
Vision Based 3D Tracking and Pose Estimation for Mixed Reality 7
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.

Texture-Based Methods
If the object is sufciently textured, information can be derived from optical ow (Basu,
Essa, & Pentland, 1996; DeCarlo & Metaxas, 2000; Li, Roivainen, & Forchheimer, 1993),
template matching (Cascia, Sclaroff, & Athitsos, 2000; Hager & Belhumeur, 1998; Jurie &
Dhome, 2001, 2002), or interest-point correspondences. However the latter is probably the
most effective for MR applications because they rely on matching local features. Given such
correspondences, the pose can be estimated by least-square minimization, or even better, by
robust estimation. They are therefore relatively insensitive to partial occlusions or matching
errors. Illumination invariance is also simple to achieve. And, unlike edge-based methods,
they do not get confused by background clutter and exploit more of the image information,
which tends to make them more dependable.
Interest Point Detection and 2D Matching
In interest point methods, instead of matching all pixels in an image, only some pixels are
rst selected with an “interest operator” before matching. This reduces the computation time
while increasing the reliability if the pixels are correctly chosen. Förstner (1986) presents
the desired properties for such an interest operator. Selected points should be different from
their neighbors, which eliminates edge-points; the selection should be repeatable, that is the
same points should be selected in several images of the same scene, despite perspective dis-
tortion or image noise. In particular, the precision and the reliability of the matching directly
depends on the invariance of the selected position. Pixels on repetitive patterns should also
be rejected or at least given less importance to avoid confusion during matching.
Such an operator was already used in the 1970s for tracking purposes (Moravec, 1977, 1981).
Numerous other methods have been proposed since and Deriche and Giraudon (1993) and
Smith and Brady (1995) give good surveys of them. Most of them involve second order
derivatives, and results can be strongly affected by noise. Several successful interest point
detectors (Förstner, 1986; Harris & Stephens, 1988; Shi & Tomasi, 1994) rely on the auto-
correlation matrix computed at each pixel location. It is a 2×2 matrix, whose coefcients are
sums over a window of the rst derivatives of the image intensity with respect to the pixel
coordinates, and its measures the local variations of the image. As discussed in Förstner
(1986), the pixels can be classied from the behavior of the eigenvalues of the auto-corre-

lation matrix. Pixels with two large, approximately equal eigenvalues are good candidates
for selection. Shi and Tomasi (1994) show that locations with two large eigenvalues can
be reliably tracked, especially under afne deformations, and considers locations where
the smallest eigen value is higher than a threshold. Interest points can then be taken to the
locations that are local maxima of the chosen measure above a predened threshold. The
derivatives involved in the auto-correlation matrix can be weighted using a Gaussian kernel
to increase robustness to noise (Schmid & Mohr, 1997). The derivatives should also be
computed using a rst order Gaussian kernel. This comes at a price since it tends to degrade
both the localization accuracy and the performance of the image patch correlation procedure
used for matching purposes.
8 Fua & Lepetit
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.
For tracking purpose, it is then useful to match two sets of interest points and extract from
two images taken from similar viewpoints. A classical procedure (Zhang, Deriche, Faugeras,
& Luong, 1995) runs as follows: For each point in the rst image, search in a region of
the second image around its location for a corresponding point. The search is based on the
similarity of the local image windows centered on the points, which strongly characterize
the points when the images are sufciently close. The similarity can be measured using the
zero-normalized cross-correlation that is invariant to afne changes of the local image inten-
sities, and make the procedure robust to illumination changes. To obtain a more reliable set
of matches, one can reverse the role of the two images, and repeat the previous procedure.
Only the correspondences between points that chose each other are kept.
Eliminating Drift
In the absence of points whose coordinates are known a priori, all methods are subject to
error accumulation, which eventually results in tracking failure and precludes of truly long
sequences.
A solution to this problem is to introduce one or more keyframes such as the one in the up-
per left corner of Figure 3, that is images of the target object or scene for which the camera
has been registered beforehand. At runtime, incoming images can be matched against the

keyframes to provide a position estimate that is drift-free (Genc, Riedel, Souvannavong, &
Navab, 2002; Ravela, Draper, Lim, & Weiss, 1995; Tordoff, Mayol, de Campos, & Mur-
ray, 2002). This, however, is more difcult than matching against immediately preceding
frames as the difference in viewpoint is likely to be much larger. The algorithm used to
establish point correspondences must therefore both be fast and relatively insensitive to
large perspective distortions, which is not usually the case for those used by the algorithms
that need only handle small distortions between consecutive frames.
Figure 3. Face tracking using interest points, and one reference image shown (top left)
(Reproduced from Vacchetti et al., 2004b, © 2004 IEEE, used with permission)
Vision Based 3D Tracking and Pose Estimation for Mixed Reality 9
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
In Vacchetti, Lepetit, and Fua (2004b), this is handled as follows. During a training stage,
the system extracts interest points from each keyframe, back-projects them to the object
surface to compute their 3D position, and stores image patches centered around their loca-
tion. During tracking, for each new incoming image, the system picks the keyframe whose
viewpoint is closest to that of the last known viewpoint. It synthesizes an intermediate image
from that keyframe by warping the stored image patches to the last known viewpoint, which
is typically the one corresponding to the previous image. The intermediate and the incoming
images are now close enough that matching can be performed using simple, conventional,
and fast correlation methods. Since the 3D position in the keyframe has been precomputed,
the pose can then be estimated by robustly minimizing the reprojection error. This approach
handles perspective distortion, complex aspect changes, and self-occlusion. Furthermore,
it is very efcient because it takes advantage of the large graphics capabilities of modern
CPUs and GPUs.
However, as noticed by several authors (Chia, Cheok, & Prince, 2002; Ravela et al., 1995;
Tordoff et al., 2002; Vacchetti et al., 2004b), matching only against keyframes does not,
by itself, yield directly exploitable results. This has two main causes. First, wide-baseline
matching as described in the previous paragraph is inherently less accurate than the short-
baseline matching involved in frame-to-frame tracking, which is compounded by the fact

that the number of correspondences that can be established is usually less. Second, if the
pose is computed for each frame independently, no temporal consistency is enforced and
the recovered motion can appear to be jerky. If it were used as is by an MR application,
the virtual objects inserted in the scene would appear to jitter, or to tremble, as opposed to
remaining solidly attached to the scene.
Temporal consistency can be enforced by some dynamical smoothing using a motion model.
Another way proposed in Vacchetti et al. (2004b) is to combine the information provided by
the keyframes which provides robustness with that coming from preceding frames, which
enforces temporal consistency. This does not make assumptions on the camera motion and
improves the accuracy of the recovered pose. It is still compatible with the use of dynami-
cal smoothing that can be useful in cases where the pose estimation remains unstable, for
example when the object is essentially fronto-parallel.
Tracking by Detection
The recursive nature of traditional 3D tracking approaches provides a strong prior on the
pose for each new frame and makes image feature identications relatively easy. However, it
comes at a price. First, the system must either be initialized by hand or require the camera to
be very close to a specied position. Second, it makes the system very fragile. If something
goes wrong between two consecutive frames, for example due to a complete occlusion of
the target object or a very fast motion, the system can be lost and must be re-initialized
in the same fashion. In practice, such weaknesses make purely recursive systems nearly
unusable, and the popularity of ARToolKit (Kato et al., 2000) in the augmented reality com-
munity should come as no surprise. It is the rst vision-based system to really overcome
these limitations by being able to detect the markers in every frame without constraints on
the camera pose.
10 Fua & Lepetit
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.
However, achieving the same level of performance without having to engineer the environment
remains a desirable goal. Since object pose and appearance are highly correlated, estimating
both simultaneously increases the performances of object detection algorithms. Therefore, 3D

pose estimation from natural features without a priori knowledge of the position and object
detection are closely related problems. Detection has a long history in Computer Vision. It
has often relied on 2D detection even for 3D objects (Nayar, Nene, & Murase, 1996; Viola
& Jones, 2001). However, there has been sustained interest in simultaneous object detection
and 3D pose estimation. Early approaches were edge-based (Lowe, 1991; Jurie, 1998), but
methods based on feature points matching have become popular since local invariants were
shown to work better for that purpose (Schmid & Mohr, 1997).
Feature point-based approaches to be the most robust to scale, viewpoint, and illumination
changes, as well as partial occlusions. They typically operate on the following principle.
During an ofine training stage, one builds a database of interest points lying on the object
and whose position on the object surface can be computed. A few images in which the object
has been manually registered are often used for this purpose. At runtime, feature points are
rst extracted from individual images and matched against the database. The object pose
can then be estimated from such correspondences. RANSAC-like algorithms (Fischler &
Bolles, 1981) or the Hough transform are very convenient for this task since they eliminate
spurious correspondences while avoiding combinatorial issues.
The difculty in implementing such approaches comes from the fact that the database images
and the input ones may have been acquired from very different viewpoints. As discussed
in this chapter, unless the motion is very quick, this problem does not arise in conventional
recursive tracking approaches because the images are close to each other. However, for
tracking-by-detection purposes, the so-called wide baseline matching problem becomes a
critical issue that must be addressed.
In the remainder of this section, we discuss in more detail the extraction and matching of
feature points in this context. We conclude by discussing the relative merits of tracking-by-
detection and recursive tracking.
Feature Point Extraction
To handle as wide as possible a range of viewing conditions, feature point extraction should
be insensitive to scale, viewpoint, and illumination changes. Note that the stability of the
extracted features is much more crucial here than for the techniques described in this chapter
where only close frames were matched. Different techniques are therefore required and we

discuss them next.
As proposed in Lindeberg (1994), scale-invariant extraction can be achieved by taking
feature points to be local extrema of a Laplacian-of-Gaussian pyramid in scale-space. To
increase computational efciency, the Laplacian can be approximated by a Difference-of-
Gaussians (Lowe, 1999). Research has then focused on afne invariant region detection to
handle more perspective changes. Baumberg (2000), Schaffalitzky and Zisserman (2002),
and Mikolajczyk and Schmid (2002) used an afne invariant point detector based on the
Harris detector, where the afne transformation that makes equal the two eigen values of

×