Tải bản đầy đủ (.pdf) (490 trang)

Dynamic Vision for Perception and Control of Motion ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.48 MB, 490 trang )

Dynamic Vision for Perception
and Control of Motion
Ernst D. Dickmanns
Dynamic Vision
for Perception and
Control of Motion
123
Ernst D. Dickmanns, Dr Ing.
Institut für Systemdynamik und Flugmechanik
Fakultät für Luft- und Raumfahrttechnik
Universität der Bundeswehr München
Werner-Heisenberg-Weg 39
85579 Neubiber g
Germany
British Library Cataloguing in Publication Data
Dickmanns, Ernst Dieter
Dynamic vision for perception and control of motion
1. Compu ter vision - Industrial applic ations 2. Optical
detect ors 3. Motor vehicles - Automatic control 4. Adaptive
control systems
I. Title
629’.046
ISBN-13: 9781846286377
Library of Congress Control Number: 2007922344
ISBN 978-1-84628-637-7 e-ISBN 978-1-84628-638-4 Printed on acid-free paper
© Springer-Verlag London Limited 2007
Apart from any fair dealing for the pu rposes of research or private study , or criticism or review , as
permitted under the Copyright, Designs andPatents Act 1988, this publication may only be reproduced,
stored or transmitted, in any form or by any means, with the prior permission in writing of the
publis hers, or in the case of reprographic reproduction in accordance with the terms of licences issued


by the Copyright Licensing Agency. Enquiries concerning reproduct ion outside those terms should be
sent to the publishers.
The use of registered names, trademarks, etc. in this publication d oes not imply, even in the absence of
a specific stat ement, that such names are exempt from the relevant laws and regulations and therefore
free for general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the infor-
mation contained in this book and cannot accept any legal responsibility or liability for any e rrors or
omissions that may be made.
98765432 1
Springer Science+B usiness Media
springer.com
Preface
During and after World War II, the principle of feedback control became well un-
derstood in biological systems and was applied in many technical disciplines to re-
lieve humans from boring workloads in systems control. N. Wiener considered it
universally applicable as a basis for building intelligent systems and called the new
discipline “Cybernetics” (the science of systems control)
[Wiener 1948]. Following
many early successes, these arguments soon were oversold by enthusiastic follow-
ers; at that time, many people realized that high-level decision–making could
hardly be achieved only on this basis. As a consequence, with the advent of suffi-
cient digital computing power, computer scientists turned to quasi-steady descrip-
tions of abstract knowledge and created the field of “Artificial Intelligence” (AI)
[McCarthy 1955; Selfridge 1959; Miller et al. 1960; Newell, Simon 1963; Fikes, Nilsson
1971]
. With respect to achievements promised and what could be realized, a similar
situation developed in the last quarter of the 20th century.
In the context of AI also, the problem of computer vision has been tackled (see,
e.g.,
[Selfridge, Neisser 1960; Rosenfeld, Kak 1976; Marr 1982]. The main paradigm ini-

tially was to recover a 3-D object shape and orientation from single images (snap-
shots) or from a few viewpoints. On the contrary, in aerial or satellite remote sens-
ing, another application of image evaluation, the task was to classify areas on the
ground and to detect special objects. For these purposes, snapshot images, taken
under carefully controlled conditions, sufficed. “Computer vision” was a proper
name for these activities since humans took care of accommodating all side con-
straints to be observed by the vehicle carrying the cameras.
When technical vision was first applied to vehicle guidance
[Nilsson 1969], sepa-
rate viewing and motion phases with static image evaluation (lasting for minutes
on remote stationary computers in the laboratory) had been adopted initially. Even
stereo effects with a single camera moving laterally on the vehicle between two
shots from the same vehicle position were investigated
[Moravec 1983]. In the early
1980s, digital microprocessors became sufficiently small and powerful, so that on-
board image evaluation in near real time became possible. DARPA started its pro-
gram “On strategic computing” in which vision architectures and image sequence
interpretation for ground vehicle guidance were to be developed (‘Autonomous
Land Vehicle’ ALV) [
Roland, Shiman 2002]. These activities were also subsumed
under the title “computer vision”, and this term became generally accepted for a
broad spectrum of applications. This makes sense, as long as dynamic aspects do
not play an important role in sensor signal interpretation.
For autonomous vehicles moving under unconstrained natural conditions at
higher speeds on nonflat ground or in turbulent air, it is no longer the computer
which “sees” on its own. The entire body motion due to control actuation and to
Preface
vi
perturbations from the environment has to be analyzed based on information com-
ing from many different types of sensors. Fast reactions to perturbations have to be

derived from inertial measurements of accelerations and the onset of rotational
rates, since vision has a rather long delay time (a few tenths of a second) until the
enormous amounts of data in the image stream have been digested and interpreted
sufficiently well. This is a well-proven concept in biological systems also operating
under similar conditions, such as the vestibular apparatus of vertebrates with many
cross-connections to ocular control.
This object-oriented sensor fusion task, quite naturally, introduces the notion of
an extended presence since data from different times (and from different sensors)
have to be interpreted in conjunction, taking additional delay times for control ap-
plication into account. Under these conditions, it does no longer make sense to talk
about “computer vision”. It is the overall vehicle with an integrated sensor and
control system, which achieves a new level of performance and becomes able “to
see”, also during dynamic maneuvering. The computer is the hardware substrate
used for data and knowledge processing.
In this book, an introduction is given to an integrated approach to dynamic vis-
ual perception in which all these aspects are taken into account right from the be-
ginning. It is based on two decades of experience of the author and his team at
UniBw Munich with several autonomous vehicles on the ground (both indoors and
especially outdoors) and in the air. The book deviates from usual texts on computer
vision in that an integration of methods from “control engineering/systems dynam-
ics” and “artificial intelligence” is given. Outstanding real-world performance has
been demonstrated over two decades. Some samples may be found in the accom-
panying DVD. Publications on the methods developed have been distributed over
many contributions to conferences and journals as well as in Ph.D. dissertations
(marked “Diss.” in the references). This book is the first survey touching all as-
pects in sufficient detail for understanding the reasons for successes achieved with
real-world systems.
With gratitude, I acknowledge the contributions of the Ph.D. students S. Baten,
R. Behringer, C. Brüdigam, S. Fürst, R. Gregor, C. Hock, U. Hofmann, W. Kinzel,
M. Lützeler, M. Maurer, H G. Meissner, N. Mueller, B. Mysliwetz, M. Pellkofer,

A. Rieder, J. Schick, K H. Siedersberger, J. Schiehlen, M. Schmid, F. Thomanek,
V. von Holt, S. Werner, H J. Wünsche, and A. Zapp as well as those of my col-
league V. Graefe and his Ph.D. students. When there were no fitting multi-
microprocessor systems on the market in the 1980s, they realized the window-
oriented concept developed for dynamic vision, and together we have been able to
compete with “Strategic Computing”. I thank my son Dirk for generalizing and
porting the solution for efficient edge feature extraction in “Occam” to “Transput-
ers” in the 1990s, and for his essential contributions to the general framework of
the third-generation system EMS vision. The general support of our work in “con-
trol theory and application” by K D. Otto over three decades is appreciated as well
as the infrastructure provided at the institute ISF by Madeleine Gabler.
Ernst D. Dickmanns
Acknowledgments
Support of the underlying research by the Deutsche Forschungs-Gemeinschaft
(DFG), by the German Federal Ministry of Research and Technology (BMFT), by
the German Federal Ministry of Defense (BMVg), by the Research branch of the
European Union, and by the industrial firms Daimler-Benz AG (now
DaimlerChrysler), Dornier GmbH (now EADS Friedrichshafen), and VDO
(Frankfurt, now part of Siemens Automotive) through funding is appreciated.
Through the German Federal Ministry of Defense, of which UniBw Munich is a
part, cooperation in the European and the Trans-Atlantic framework has been
supported; the project “AutoNav” as part of an American-German Memorandum of
Understanding has contributed to developing “expectation-based, multifocal,
saccadic” (EMS) vision by fruitful exchanges of methods and hardware with the
National Institute of Standards and Technology (NIST), Gaithersburgh, and with
Sarnoff Research of SRI, Princeton.
The experimental platforms have been developed and maintained over several
generations of electronic hardware by Ingenieurbüro Zinkl (VaMoRs), Daimler-
Benz AG (VaMP), and by the staff of our electromechanical shop, especially J.
Hollmayer, E. Oestereicher, and T. Hildebrandt. The first-generation vision

systems have been provided by the Institut für Messtechnik of UniBwM/LRT.
Smooth operation of the general PC-infrastructure is owed to H. Lex of the Institut
für Systemdynamik und Flugmechanik (UniBwM /LRT/ ISF).
Contents
1 Introduction 1
1.1 Different Types of Vision Tasks and Systems 1
1.2 Why Perception and Action? 3
1.3 Why Perception and Not Just Vision? 4
1.4 What Are Appropriate Interpretation Spaces? 5
1.4.1 Differential Models for Perception ‘Here and Now’ 8
1.4.2 Local Integrals as Central Elements for Perception 9
1.4.3 Global Integrals for Situation Assessment 11
1.5 What Type of Vision System Is Most Adequate? 11
1.6 Influence of the Material Substrate on System Design:
Technical vs. Biological Systems 14
1.7 What Is Intelligence? A Practical (Ecological) Definition 15
1.8 Structuring of Material Covered 18
2 Basic Relations: Image Sequences – “the World” 21
2.1 Three-dimensional (3-D) Space and Time 23
2.1.1 Homogeneous Coordinate Transformations in 3-D Space 25
2.1.2 Jacobian Matrices for Concatenations of HCMs 35
2.1.3 Time Representation 39
2.1.4 Multiple Scales 41
2.2 Objects 43
2.2.1 Generic 4-D Object Classes 44
2.2.2 Stationary Objects, Buildings 44
Contents
x
2.2.3 Mobile Objects in General 44
2.2.4 Shape and Feature Description 45

2.2.5 Representation of Motion 49
2.3 Points of Discontinuity in Time 53
2.3.1 Smooth Evolution of a Trajectory 53
2.3.2 Sudden Changes and Discontinuities 54
2.4 Spatiotemporal Embedding and First-order Approximations 54
2.4.1 Gain by Multiple Images in Space and/or Time for
Model Fitting 56
2.4.2 Role of Jacobian Matrix in the 4-D Approach to Vision 57
3 Subjects and Subject Classes 59
3.1 General Introduction: Perception – Action Cycles 60
3.2 A Framework for Capabilities 60
3.3 Perceptual Capabilities 63
3.3.1 Sensors for Ground Vehicle Guidance 64
3.3.2 Vision for Ground Vehicles 65
3.3.3 Knowledge Base for Perception Including Vision 72
3.4 Behavioral Capabilities for Locomotion 72
3.4.1 The General Model: Control Degrees of Freedom 73
3.4.2 Control Variables for Ground Vehicles 75
3.4.3 Basic Modes of Control Defining Skills 84
3.4.4 Dual Representation Scheme 88
3.4.5 Dynamic Effects in Road Vehicle Guidance 90
3.4.6 Phases of Smooth Evolution and Sudden Changes 104
3.5 Situation Assessment and Decision-Making 107
3.6 Growth Potential of the Concept, Outlook 107
3.6.1 Simple Model of Human Body as a Traffic Participant 108
3.6.2 Ground Animals and Birds 110
Contents xi
4 Application Domains, Missions, and Situations 111
4.1 Structuring of Application Domains 111
4.2 Goals and Their Relations to Capabilities 117

4.3 Situations as Precise Decision Scenarios 118
4.3.1 Environmental Background 118
4.3.2 Objects/Subjects of Relevance 119
4.3.3 Rule Systems for Decision-Making 120
4.4 List of Mission Elements 121
5 Extraction of Visual Features 123
5.1 Visual Features 125
5.1.1 Introduction to Feature Extraction 126
5.1.2 Fields of View, Multifocal Vision, and Scales 128
5.2 Efficient Extraction of Oriented Edge Features 131
5.2.1 Generic Types of Edge Extraction Templates 132
5.2.2 Search Paths and Subpixel Accuracy 137
5.2.3 Edge Candidate Selection 140
5.2.4 Template Scaling as a Function of the Overall Gestalt 141
5.3 The Unified Blob-edge-corner Method (UBM) 144
5.3.1 Segmentation of Stripes Through Corners, Edges, and Blobs 144
5.3.2 Fitting an Intensity Plane in a Mask Region 151
5.3.3 The Corner Detection Algorithm 167
5.3.4 Examples of Road Scenes 171
5.4 Statistics of Photometric Properties of Images 174
5.4.1 Intensity Corrections for Image Pairs 176
5.4.2 Finding Corresponding Features 177
5.4.3 Grouping of Edge Features to Extended Edges 178
5.5 Visual Features Characteristic of General Outdoor Situations 181
Contents
xii
6 Recursive State Estimation 183
6.1 Introduction to the 4-D Approach for Spatiotemporal Perception 184
6.2 Basic Assumptions Underlying the 4-D Approach 187
6.3 Structural Survey of the 4-D Approach 190

6.4 Recursive Estimation Techniques for Dynamic Vision 191
6.4.1 Introduction to Recursive Estimation 191
6.4.2 General Procedure 192
6.4.3 The Stabilized Kalman Filter 196
6.4.4 Remarks on Kalman Filtering 196
6.4.5 Kalman Filter with Sequential Innovation 198
6.4.6 Square Root Filters 199
6.4.7 Conclusion of Recursive Estimation for Dynamic Vision 202
7 Beginnings of Spatiotemporal Road
and Ego-state Recognition 205
7.1 Road Model 206
7.2 Simple Lateral Motion Model for Road Vehicles 208
7.3 Mapping of Planar Road Boundary into an Image 209
7.3.1 Simple Beginnings in the Early 1980s 209
7.3.2 Overall Early Model for Spatiotemporal Road Perception 213
7.3.3 Some Experimental Results 214
7.3.4 A Look at Vertical Mapping Conditions 217
7.4 Multiple Edge Measurements for Road Recognition 218
7.4.1 Spreading the Discontinuity of the Clothoid Model 219
7.4.2 Window Placing and Edge Mapping 222
7.4.3 Resulting Measurement Model 224
7.4.4 Experimental Results 225
8 Initialization in Dynamic Scene Understanding 227
8.1 Introduction to Visual Integration for Road Recognition 227
8.2 Road Recognition and Hypothesis Generation 228
Contents xiii
8.2.1 Starting from Zero Curvature for Near Range 229
8.2.2 Road Curvature from Look-ahead Regions Further Away 230
8.2.3 Simple Numerical Example of Initialization 231
8.3 Selection of Tuning Parameters for Recursive Estimation 233

8.3.1 Elements of the Measurement Covariance Matrix R 234
8.3.2 Elements of the System State Covariance Matrix Q 234
8.3.3 Initial Values of the Error Covariance Matrix P
0
235
8.4 First Recursive Trials and Monitoring of Convergence 236
8.4.1 Jacobian Elements and Hypothesis Checking 237
8.4.2 Monitoring Residues 241
8.5 Road Elements To Be Initialized 241
8.6 Exploiting the Idea of Gestalt 243
8.6.1 The Extended Gestalt Idea for Dynamic Machine Vision 245
8.6.2 Traffic Circle as an Example of Gestalt Perception 251
8.7 Default Procedure for Objects of Unknown Classes 251
9 Recursive Estimation of Road Parameters
and Ego State While Cruising 253
9.1 Planar Roads with Minor Perturbations in Pitch 255
9.1.1 Discrete Models 255
9.1.2 Elements of the Jacobian Matrix 256
9.1.3 Data Fusion by Recursive Estimation 257
9.1.4 Experimental Results 258
9.2 Hilly Terrain, 3-D Road Recognition 259
9.2.1 Superposition of Differential Geometry Models 260
9.2.2 Vertical Mapping Geometry 261
9.2.3 The Overall 3-D Perception Model for Roads 262
9.2.4 Experimental Results 263
9.3 Perturbations in Pitch and Changing Lane Widths 268
9.3.1 Mapping of Lane Width and Pitch Angle 268
9.3.2 Ambiguity of Road Width in 3-D Interpretation 270
Contents
xiv

9.3.3 Dynamics of Pitch Movements: Damped Oscillations 271
9.3.4 Dynamic Model for Changes in Lane Width 273
9.3.5 Measurement Model Including Pitch Angle, Width Changes 275
9.4 Experimental Results 275
9.4.1 Simulations with Ground Truth Available 276
9.4.2 Evaluation of Video Scenes 278
9.5 High-precision Visual Perception 290
9.5.1 Edge Feature Extraction to Subpixel Accuracy for Tracking 290
9.5.2 Handling the Aperture Problem in Edge Perception 292
10 Perception of Crossroads 297
10.1 General Introduction 297
10.1.1 Geometry of Crossings and Types of Vision
Systems Required 298
10.1.2 Phases of Crossroad Perception and Turnoff 299
10.1.3 Hardware Bases and Real-world Effects 301
10.2 Theoretical Background 304
10.2.1 Motion Control and Trajectories 304
10.2.2 Gaze Control for Efficient Perception 310
10.2.3 Models for Recursive Estimation 313
10.3 System Integration and Realization 323
10.3.1 System Structure 324
10.3.2 Modes of Operation 325
10.4 Experimental Results 325
10.4.1 Turnoff to the Right 326
10.4.2 Turnoff to the Left 328
10.5 Outlook 329
11 Perception of Obstacles and Other Vehicles 331
11.1 Introduction to Detecting and Tracking Obstacles 331
11.1.1 What Kinds of Objects Are Obstacles for Road Vehicles? 332
Contents xv

11.1.2 At Which Range Do Obstacles Have To Be Detected? 333
11.1.3 How Can Obstacles Be Detected? 334
11.2 Detecting and Tracking Stationary Obstacles 336
11.2.1 Odometry as an Essential Component of Dynamic Vision 336
11.2.2 Attention Focusing on Sets of Features 337
11.2.3 Monocular Range Estimation (Motion Stereo) 338
11.2.4 Experimental Results 342
11.3 Detecting and Tracking Moving Obstacles on Roads 343
11.3.1 Feature Sets for Visual Vehicle Detection 345
11.3.2 Hypothesis Generation and Initialization 352
11.3.3 Recursive Estimation of Open Parameters and Relative State 361
11.3.4 Experimental Results 366
11.3.5 Outlook on Object Recognition 375
12 Sensor Requirements for Road Scenes 377
12.1 Structural Decomposition of the Vision Task 378
12.1.1 Hardware Base 378
12.1.2 Functional Structure 379
12.2 Vision under Conditions of Perturbation 380
12.2.1 Delay Time and High-frequency Perturbation 380
12.2.2 Visual Complexity and the Idea of Gestalt 382
12.3 Visual Range and Resolution Required for Road Traffic Applications. 383
12.3.1 Large Simultaneous Field of View 384
12.3.2 Multifocal Design 384
12.3.3 View Fixation 385
12.3.4 Saccadic Control 386
12.3.5 Stereovision 387
12.3.6 Total Range of Fields of View 388
12.3.7 High Dynamic Performance 390
12.4 MarVEye as One of Many Possible Solutions 391
12.5 Experimental Result in Saccadic Sign Recognition 392

Contents
xvi
13 Integrated Knowledge Representations
for Dynamic Vision 395
13.1 Generic Object/Subject Classes 399
13.2 The Scene Tree 401
13.3 Total Network of Behavioral Capabilities 403
13.4 Task To Be Performed, Mission Decomposition 405
13.5 Situations and Adequate Behavior Decision 407
13.6 Performance Criteria and Monitoring Actual Behavior 409
13.7 Visualization of Hardware/Software Integration 411
14 Mission Performance, Experimental Results 413
14.1 Situational Aspects for Subtasks 414
14.1.1 Initialization 414
14.1.2 Classes of Capabilities 416
14.2 Applying Decision Rules Based on Behavioral Capabilities 420
14.3 Decision Levels and Competencies, Coordination Challenges 421
14.4 Control Flow in Object-oriented Programming 422
14.5 Hardware Realization of Third-generation EMS vision 426
14.6 Experimental Results of Mission Performance 427
14.6.1 Observing a Maneuver of Another Car 427
14.6.2 Mode Transitions Including Harsh Braking 429
14.6.3 Multisensor Adaptive Cruise Control 431
14.6.4 Lane Changes with Preceding Checks 432
14.6.5 Turning Off on Network of Minor Unsealed Roads 434
14.6.6 On- and Off-road Demonstration with
Complex Mission Elements 437
15 Conclusions and Outlook 439
Contents xvii
Appendix A

Contributions to Ontology for Ground Vehicles 443
A.1 General environmental conditions 443
A.2 Roadways 443
A.3 Vehicles 444
A.4 Form, Appearance, and Function of Vehicles 444
A.5 Form, Appearance, and Function of Humans 446
A.6 Form, Appearance, and Likely Behavior of Animals 446
A.7 General Terms for Acting “Subjects” in Traffic 446
Appendix B
Lateral dynamics 449
B.1 Transition Matrix for Fourth-order Lateral Dynamics 449
B.2 Transfer Functions and Time Responses to an Idealized Doublet
in Fifth-order Lateral Dynamics 450
Appendix C
Recursive Least–squares Line Fit 453
C.1 Basic Approach 453
C.2 Extension of Segment by One Data Point 456
C.3 Stripe Segmentation with Linear Homogeneity Model 457
C.4 Dropping Initial Data Point 458
References 461
Index 473
1 Introduction
The field of “vision” is so diverse and there are so many different approaches to
the widespread realms of application that it seems reasonable first to inspect it and
to specify the area to which the book intends to contribute. Many approaches to
machine vision have started with the paradigm that easy things should be tackled
first, like single snapshot image interpretation in unlimited time; an extension to
more complex applications may later on build on the experience gained. Our ap-
proach on the contrary was to separate the field of dynamic vision from its (quasi-)
static counterpart right from the beginning and to derive adequate methods for this

specific domain. To prepare the ground for success, sufficiently capable methods
and knowledge representations have to be introduced from the beginning.
1.1 Different Types of Vision Tasks and Systems
Figure 1.1 shows juxtapositions of several vision tasks occurring in everyday life.
For humans, snapshot interpretation seems easy, in general, when the domain is
well known in which the image has been taken. We tend to imagine the temporal
context and the time when the image has been shot. From motion smear and un-
usual poses, the embedding of the snapshot in a well-known maneuver is con-
cluded. So in general, even single images require background knowledge on mo-
tion processes in space for more in-depth understanding; this is often overlooked in
machine or computer vision. The approach discussed in this book (bold italic let-
ters in Figure 1.1) takes motion processes in “3-D space and time” as basic knowl-
edge required for understanding image sequences in an approach similar to our
own way of image interpretation. This yields a natural framework for using lan-
guage and terms in the common sense.
Another big difference in methods and approaches required stems from the fact
that the camera yielding the video stream is either stationary or moving itself. If
moving, linear or/and rotational motion also may require special treatment. Sur-
veillance is done, usually, from a stationary position while the camera may pan (ro-
tation around a vertical axis, often also called yaw) and tilt (rotation around the
horizontal axis, also called pitch) to increase its total field of view. In this case,
motion is introduced purposely and is well controlled, so that it can be taken into
account during image evaluation. If egomotion is to be controlled based on vision,
the body carrying the camera(s) may be subject to strong perturbations, which can-
not be predicted, in general.
2 1 Introduction
Pictorial vision Motion vision
(single image interpretation)
Surveillance Motion control
detection, inspection

(prey) (predator)
[
hybrid systems]
Monocular Bin- (multi-) ocular stereo
motion stereo
Passive Active: fixation type
inertially stabilized,
attention focused
2-D shape Spatial interpretation
Off-line Real-time
Monochrome Color vision
Intensity Range
Figure 1.1. Types of vision systems and vision tasks
In cases with large rotational rates, motion blur may prevent image evaluation at
all; also, due to the delay time introduced by handling and interpreting the large
data rates in vision, stable control of the vehicle may no longer be possible.
Biological systems have developed close cooperation between inertial and opti-
cal sensor data evaluation for handling this case; this will be discussed to some de-
tail and applied to technical vision systems in several chapters of the book. Also
from biologists stems the differentiation of vision systems into “prey” and “preda-
tor” systems. The former strive to cover a large simultaneous field of view for de-
tecting predators sufficiently early and approaching from any direction possible.
Predators move to find prey, and during the final approach as well as in pursuit
they have to estimate their position and speed relative to the dynamically moving
prey quite accurate to succeed in a catch. Stereovision and high resolution in the
direction of motion provides advantages, and nature succeeded in developing this
combination in the vertebrate eye.
Once active gaze control is available, feedback of rotational rates measured by
inertial sensors allows compensating for rotational disturbances on the own body
just by moving the eyes (reducing motion blur), thereby improving their range of

applicability. Fast moving targets may be tracked in smooth pursuit, also reducing
motion blur for this special object of interest; the deterioration of recognition and
tracking of other objects of less interest are accepted.
1.2 Why Perception and Action? 3
Since images are only in two dimensions, the 2-D framework looks most natural
for image interpretation. This may be true for almost planar objects viewed ap-
proximately normal to their plane of appearance, like a landscape in a bird’s-eye
view. On the other hand, when a planar surface is viewed with the optical axis al-
most parallel to it from an elevation slightly above the ground, the situation is quite
different. In this case, each line in the image corresponds to a different distance on
the ground, and the same 3-D object on the surface looks quite different in size ac-
cording to where it appears in the image. This is the reason why homogeneously
distributed image processing by vector machines, for example, does have a hard
time in showing its efficiency; locally adapted methods in image regions seem
much more promising in this case and have proven their superiority. Interpreting
image sequences in 3-D space with corresponding knowledge bases right from the
beginning allows easy adaptation to range differences for single objects. Of course,
the analysis of situations encompassing several objects at various distances now
has to be done on a separate level, building on the results of all previous steps. This
has been one of the driving factors in designing the architecture for the Third-
generation “expectation-based, multi-focal saccadic” (EMS) vision system de-
scribed in this book. This corresponds to recent findings in well-developed biologi-
cal systems where for image processing and action planning based on the results of
visual perception, different areas light up in magnetic resonance images
[Talati,
Hirsch 2005].
Understanding motion processes of 3-D objects in 3-D space while the body
carrying the cameras also moves in 3-D space, seems to be one of the most difficult
tasks in real-time vision. Without the help of inertial sensing for separating egomo-
tion from relative motion, this can hardly be done successfully, at least in dynamic

situations.
Direct range measurement by special sensors such as radar or laser range finders
(LRF) would alleviate the vision task. Because of their relative simplicity and low
demand of computing power, these systems have found relatively widespread ap-
plication in the automotive field. However, with respect to resolution and flexibil-
ity of data exploitation as well as hardware cost and installation volume required,
they have much less potential than passive cameras in the long run with computing
power available in abundance. For this reason, these systems are not included in
this book.
1.2 Why Perception and Action?
For technical systems which are intended to find their way on their own in an ever
changing world, it is impossible to foresee every possible event and to program all
required capabilities for appropriate reactions into its software from the beginning.
To be flexible in dealing with situations actually encountered, the system should
have perceptual and behavioral capabilities which it may expand on its own in re-
sponse to new requirements. This means that the system should be capable of judg-
ing the value of control outputs in response to measured data; however, since out-
puts of control affect state variables over a certain amount of time, ensuing time
4 1 Introduction
histories have to be observed and a temporally deeper understanding has to be de-
veloped. This is exactly what is captured in the “dynamic models” of systems the-
ory (and what biological systems may store in neuronal delay lines).
Also, through these time histories, the ground is prepared for more compact
“frequency domain” (integral) representations. In the large volume of literature on
linear systems theory, time constants T as the inverse of eigenvalues of first-order
system components, as well as frequency, damping ratio, and relative phase as
characteristic properties of second-order components are well known terms for de-
scribing temporal characteristics of processes, e.g.
, [Kailath 1980]. In the physio-
logical literature, the term “temporal Gestalt” may even be found

[Ruhnau 1994a, b],
indicating that temporal shape may be as important and characteristic as the well
known spatial shape.
Usually, control is considered an output resulting from data analysis to achieve
some goal. In a closed-loop system, where one of its goals is to adapt to new situa-
tions and to act autonomously, control outputs may be interpreted as questions
asked with respect to real-world behavior. Dynamic reactions are now interpreted
to better understand the behavior of a body in various states and under various en-
vironmental conditions. This opens up a new avenue for signal interpretation: be-
side its use for state control, it is now also interpreted for system identification and
modeling, that is, learning about its temporal behavioral characteristics.
In an intelligent autonomous system, this capability of adaptation to new situa-
tions has to be available to reduce dependence on maintenance and adaptation by
human intervention. While this is not yet state of the art in present systems, with
the computing power becoming available in the future, it clearly is within range.
The methods required have been developed in the fields of system identification
and adaptive control.
The sense of vision should yield sufficient information about the near and far-
ther environment to decide when state control is not so important and when more
emphasis may be put on system identification by using special control inputs for
this purpose. This approach also will play a role when it comes to defining the no-
tion of a “self” for the autonomous vehicle.
1.3 Why Perception and Not Just Vision?
Vision does not allow making a well-founded decision on absolute inertial motion
when another object is moving close to the ego-vehicle and no background can be
seen in the field of view (known to be stationary). Inertial sensors like accelerome-
ters and angular rate sensors, on the contrary, yield the corresponding signals for
the body they are mounted on; they do this practically without any delay time and
at high signal rates (up to the kHz range).
Vision needs time for the integration of light intensity in the sensor elements (33

1/3, respectively, 40 ms corresponding to the United States or European standard),
for frame grabbing and communication of the (huge amount of) image data, as well
as for feature extraction, hypothesis generation, and state estimation. Usually, three
to five video cycles, that are 100 to 200 ms, will have passed until a control output
1.4 What are Appropriate Interpretation Spaces? 5
derived from vision will hit the real world. For precise control of highly dynamic
systems, this time delay has to be taken into account.
Since perturbations should be counteracted as soon as possible, and since visu-
ally measurable results of perturbations are the second integral of accelerations
with corresponding delay times, it is advisable to have inertial sensors in the sys-
tem for early pickup of perturbations. Because long-term stabilization may be
achieved using vision, it is not necessary to resort to expensive inertial sensors; on
the contrary, when jointly used with vision, inexpensive inertial sensors with good
properties for the medium- to high-frequency part are sufficient as demonstrated by
the vestibular systems in vertebrates.
Accelerometers are able to measure rather directly the effects of most control
outputs; this alleviates system identification and finding the control outputs for re-
flex-like counteraction of perturbations. Cross-correlation of inertial signals with
visually determined signals allows temporally deeper understanding of what in the
natural sciences is called “time integrals” of input functions.
For all these reasons, the joint use of visual and inertial signals is considered
mandatory for achieving efficient autonomously mobile platforms. Similarly, if
special velocity components can be measured easily by conventional devices, it
does not make sense to try to recover these from vision in a “purist” approach.
These conventional signals may alleviate perception of the environment considera-
bly since the corresponding sensors are mounted onto the body in a fixed way,
while in vision the measured feature values have to be assigned to some object in
the environment according to just visual evidence. There is no constantly estab-
lished link for each measurement value in vision as is the case for conventional
sensors.

1.4 What are Appropriate Interpretation Spaces?
Images are two-dimensional arrays of data; the usual array size today is from about
64 × 64 for special “vision” chips to about 770 × 580 for video cameras (special
larger sizes are available but only at much higher cost, e.g., for space or military
applications). A digitized video data stream is a fast sequence of these images with
data rates up to ~ 11 MB/s for black and white and up to three times this amount
for color.
Frequently, only fields of 320 × 240 pixels (either only the odd or the even lines
with corresponding reduction of the resolution within the lines) are being evaluated
because of computing power missing. This results in a data stream per camera of
about 2 MB/s. Even at this reduced data rate, the processing power of a single mi-
croprocessor available today is not yet sufficient for interpreting several video sig-
nals in parallel in real time. High-definition TV signals of the future may have up
to 1080 lines and 1920 pixels in each line at frame rates of up to 75 Hz; this corre-
sponds to data rates of more than 155 MB/s. Machine vision with this type of reso-
lution is way out in the future.
Maybe, uniform processing of entire images is not desirable at all, since differ-
ent objects will be seen in different parts of the images, requiring specific image
6 1 Introduction
processing algorithms for efficient evaluation, usually. Very often, lines of discon-
tinuity are encountered in images, which should be treated with special methods
differing essentially from those used in homogeneous parts. Object- and situation-
dependent methods and parameters should be used, controlled from higher evalua-
tion levels.
The question thus is, whether any basic feature extraction should be applied uni-
formly over the entire image region. In biological vision systems, this seems to be
the case, for example, in the striate cortex (V1) of vertebrates where oriented edge
elements are detected with the help of corresponding receptive fields. However,
vertebrate vision has nonhomogeneous resolution over the entire field of view. Fo-
veal vision with high resolution at the center of the retina is surrounded by recep-

tive fields of increasing spread and a lower density of receptors per unit of area in
the radial direction.
Vision of highly developed biological systems seems to ask three questions,
each of which is treated by a specific subsystem:
1. Is there something of special interest in a wide field of view?
2. What is it precisely, that attracted interest in question one? Can the individual
object be characterized and classified using background knowledge? What is its
relative state “here and now”?
3. What is the situation around me and how does it affect optimal decisions in be-
havior for achieving my goals? For this purpose, a relevant collection of objects
should be recognized and tracked, and the likely future behavior should be pre-
dicted.
To initialize the vision process at the beginning and to detect new objects later on,
it is certainly an advantage to have a bottom-up detection component available all
over the wide field of view. Maybe, just a few algorithms based on coarse resolu-
tion for detecting interesting groups of features will be sufficient to achieve this
goal. The question is, how much computing effort should be devoted to this bot-
tom-up component compared to more elaborate, model based top-down compo-
nents for objects already detected and being tracked. Usually, single objects cover
only a small area in an image of coarse resolution.
To answer question 2 above, biological vision systems direct the foveal area of
high resolution by so-called saccades, which are very fast gaze direction changes
with angular rates up to several hundred degrees per second, to the group of fea-
tures arousing most interest. Humans are able to perform up to five saccades per
second with intermediate phases of smooth pursuit (tracking) of these features, in-
dicating a very dynamic mode of perception (time-sliced parallel processing).
Tracking can be achieved much more efficiently with algorithms controlled by
prediction according to some model. Satisfactory solutions may be possible only in
special task domains for which experience is available from previous encounters.
Since prediction is a very powerful tool in a world with continuous processes,

the question arises: What is the proper framework for formulating the continuity
conditions? Is the image plane readily available as plane of reference? However, it
is known that the depth dimension in perspective mapping has been lost com-
pletely: All points on a ray have been mapped into a single point in the image
plane, irrespective of their distance, which has been lost. Would it be better to for-
mulate all continuity conditions in 3-D physical space and time? The correspond-
1.4 What are Appropriate Interpretation Spaces? 7
ing models are available from the natural sciences since Newton and Leibnitz have
found that differential equations are the proper tools for representing these continu-
ity conditions in generic form; over the last decades, simulation technology has
provided the methods for dealing with these representations on digital computers.
In communication technology and in the field of pattern recognition, video
processing in the image plane may be the best way to go since no understanding of
the content of the scene is required. However, for orienting oneself in the real
world through image sequence analysis, early transition to the physical interpreta-
tion space is considered highly advantageous because it is in this space that occlu-
sions become easily understandable and motion continuity persists. Also, it is in
this space that inertial signals have to be interpreted and that integrals of accelera-
tions yield 3-D velocity components; integrals of these velocities yield the corre-
sponding positions and angular orientations for the rotational degrees of freedom.
Therefore, for visual dynamic scene understanding, images are considered inter-
mediate carriers of data containing information about the spatiotemporal environ-
ment. To recover this information most efficiently, all internal modeling in the in-
terpretation process is done in 3-D space and time, and the transition to this
representation should take place as early as possible. Knowledge for achieving this
goal is specific to single objects and the generic classes to which they belong.
Therefore, to answer question 2 above, specialist processes geared to classes of ob-
jects and individuals of these classes observed in the image sequence should be de-
signed for direct interpretation in 3-D space and time.
Only these spatiotemporal representations then allow answering question 3 by

looking at these data of all relevant objects in the near environment for a more ex-
tended period of time. To be able to understand motion processes of objects more
deeply in our everyday environment, a distinction has to be made between classes
of objects. Those obeying simple laws of motion from physics are the ones most
easily handled (e.g., by some version of Newton’s law). Light objects, easily
moved by stochastically appearing (even light) winds become difficult to grasp be-
cause of the variable properties of wind fields and gusts.
Another large class of objects – with many different subclasses – is formed by
those able to sense properties of their environment and to initiate movements on
their own, based on a combination of the data sensed and background knowledge
internally stored. These special objects will be called subjects; all animals includ-
ing humans belong to this (super-) class as well as autonomous agents created by
technical means (like robots or autonomous vehicles). The corresponding sub-
classes are formed by combinations of perceptual and behavioral capabilities and,
of course, their shapes. Beside their shapes, individuals of subclasses may be rec-
ognized also by stereotypical motion patterns (like a hopping kangaroo or a wind-
ing snake).
Road vehicles (independent of control by a human driver or a technical subsys-
tem) exhibit typical behaviors depending on the situation encountered. For exam-
ple, they follow lanes and do convoy driving, perform lane changes, pass other ve-
hicles, turn off onto a crossroad or slow down for parking. All of the maneuvers
mentioned are well known to human drivers, and they recognize the intention of
performing one of those by its typical onset of motion over a short period of time.
For example, a car leaving the center of its lane and moving consistently toward
8 1 Introduction
the neighboring lane is assumed to initiate a lane change. If this occurs within the
safety margin in front, egomotion should be adjusted to this (improper) behavior of
other traffic participants. This shows that recognition of the intention of other sub-
jects is important for a defensive style of driving. This cannot be recognized with-
out knowledge of temporally extended maneuvers and without observing behav-

ioral patterns of subjects in the environment. Question 3 above, thus, is not
answered by interpreting image patterns directly but by observing symbolic repre-
sentations resulting as answers to question 2 for a number of individual ob-
jects/subjects over an extended period of time.
Simultaneous interpretation of image sequences on multiple scales in 3-D space
and time is the way to satisfy all requirements for safe and goal-oriented behavior.
1.4.1 Differential Models for Perception “Here and Now”
Experience has shown that the simultaneous use of differential and integral models
on different scales yields the most efficient way of data fusion and joint data inter-
pretation. Figure 1.2 shows in a systematic fashion the interpretation scheme de-
veloped. Each of the axes is subdivided into four scale ranges. In the upper left
corner the point “here and now” is shown as the point where all interaction with
the real world takes place. The second scale range encompasses the local (as op-
posed to global) environment which allows introducing new differential concepts
compared to the pointwise state. Local embedding, with characteristic properties
Figure 1.2. Multiple interpretation scales in space and time for dynamic perception.
Vertical axis: 3-D space; horizontal axis: time
Range
Temporally
Local time
in time
o
local differential
integrals Extended local
o
Global
p
in space
Time
point

environment time integrals time integrals
Temporal change Single step
Point ‘Here and now'
at point 'here' transition matrix
in space
local
(avoided because derived from
measurements of noise amplifi - notion of (local)
cation)
'objects' (row 3)
Spatially Differential Transition of
local geometry: " feature Feature
differential
edge angles,
parameters history
environment
positions
curvatures
Short range Sparse
Local Object state, Motion predictions, predictions,
space feature- >Object state
integrals distribution, diff.eqs.
conditions
Object state history
o
shape 'dyn. model' ‘Central hub' history
Maneuver
local
'lead'- single step Multiple step
space information prediction of prediction of

of objects situation for efficient situation situation;
controllers
(usually not
monitoring
done)
of maneuvers
. .
p
. .
Mission Actual Mission
space global
Monitoring,
“temporal Gestalt”
performance,
of objects situation
monitoring
State transition,
constraints: changed aspect
basic cycle time
Objects
1.4 What are Appropriate Interpretation Spaces? 9
such as spatial or temporal change rates, spatial gradients, or directions of extreme
values such as intensity gradients are typical examples.
These differentials have shown to be powerful concepts for representing knowl-
edge about physical properties of classes of objects. Differential equations repre-
sent the natural mathematical element for coding knowledge about motion proc-
esses in the real world. With the advent of the Kalman filter [Kalman 1960], they
have become the key element for obtaining the best state estimate of the variables
describing the system, based on recursive methods implementing a least-squares
model fit. Real-time visual perception of moving objects is hardly possible without

this very efficient approach.
1.4.2 Local Integrals as Central Elements for Perception
Note that the precise definition of what is local depends on the problem domain in-
vestigated and may vary in a wide range. The third column and row in Figure 1.2
are devoted to “local integrals”; this term again is rather fuzzy and will be defined
more precisely in the task context. On the timescale, it means the transition from
analog (continuous, differential) to digital (sampled, discrete) representations. In
the spatial domain, typical local integrals are rigid bodies, which may move as a
unit without changing their 3-D shape.
These elements are defined such that the intersection in field (3, 3) in Figure 1.2
becomes the central hub for data interpretation and data fusion: it contains the in-
dividual objects as units to which humans attach most of their knowledge about the
real world. Abstraction of properties has lead to generic classes which allow sub-
suming a large variety of single cases into one generic concept, thereby leading to
representational efficiency.
1.4.2.1 Where is the Information in an Image?
It is well known that information in an image is contained in local intensity
changes: A uniformly gray image has only a few bits of information, namely, (1)
the gray value and (2) uniform distribution of this value over the entire image. The
image may be completely described by three bytes, even though the amount of data
may be about 400 000 bytes in a TV frame or even 4 MB (2k × 2k pixels). If there
are certain areas of uniform gray values, the boundary lines of these areas plus the
internal gray values contain all the information in the image. This object in the im-
age plane may be described with much less data than the pixel values it encom-
passes.
In a more general form, image areas defined by a set of properties (shape, texture,
color, joint motion, etc.) may be considered image objects, which originated from
3-D objects by perspective mapping. Due to the numerous aspect conditions, which
such an object may adopt relative to the camera, its potential appearances in the
image plane are very diverse. Their representation will require orders of magnitude

more data for an exhaustive description than its representation in 3-D space plus
the laws of perspective mapping, which are the same for all objects. Therefore, an
object is defined by its 3-D shape, which may be considered a local spatial integral

×