Tải bản đầy đủ (.pdf) (46 trang)

Quality of Telephone-Based Spoken Dialogue Systems phần 1 ppsx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (930.53 KB, 46 trang )

Quality of Telephone-Based
Spoken Dialogue Systems
This page intentionally left blank
QUALITY OF TELEPHONE-BASED
SPOKEN DIALOGUE SYSTEMS
SEBASTIAN MÖLLER
Institut für Kommunikationsakustik (IKA)
Ruhr-Universität Bochum
Germany
Springer
eBook ISBN: 0-387-23186-2
Print ISBN: 0-387-23190-0
Print ©2005 Springer Science + Business Media, Inc.
All rights reserved
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,
mechanical, recording, or otherwise, without written consent from the Publisher
Created in the United States of America
Boston
©2005 Springer Science + Business Media, Inc.
Visit Springer's eBookstore at:
and the Springer Global Website Online at:
Contents
Preface
Acknowledgments
Definitions and Abbreviations
1.
2.
MOTIVATION AND INTRODUCTION
QUALITY OF HUMAN-MACHINE INTERACTION OVER
THE PHONE


2.1
Interaction Scenarios Involving Speech Transmission
2.1.1
2.1.2
2.1.3
Speech Transmission Systems
Room Acoustics
Spoken Dialogue Systems
2.2
Interaction with Spoken Dialogue Systems
2.2.1
2.2.2
2.2.3
Language and Dialogue Structure in HMI
Interactive Speech Theory
Cooperativity Guidelines
2.3
Quality of Telephone Services Based on SDSs
2.3.1
2.3.2
2.3.3
QoS Taxonomy
Quality Features
Validation and Discussion
2.4
System Specification, Design and Evaluation
2.4.1
2.4.2
2.4.3
2.4.4

Transmission Network Planning
SDS Specification
SDS Design
System Assessment and Evaluation
2.5
Summary
ix
xi
xiii
1
9
11
13
17
18
38
41
43
47
51
59
63
63
66
68
77
79
86
90
vi

3.
ASSESSMENT AND EVALUATION METHODS
3.1
Characterization
3.1.1
3.1.2
3.1.3
3.1.4
3.1.5
Agent Factors
Task Factors
User Factors
Environmental Factors
Contextual factors
3.2
3.3
3.4
3.5
3.6
3.7
3.8
Reference
Data Collection
Speech Recognition Assessment
Speech and Natural Language Understanding Assessment
Speaker Recognition Assessment
Speech Output Assessment
SDS Assessment and Evaluation
3.8.1
3.8.2

3.8.3
3.8.4
3.8.5
3.8.6
3.8.7
3.8.8
Experimental Set-Up
Test Subjects
Experimental Task
Dialogue Analysis and Annotation
Interaction Parameters
Quality Judgments
Usability Evaluation
SDS Evaluation Examples
3.9
Summary
4.
SPEECH RECOGNITION PERFORMANCE OVER THE PHONE
4.1
4.2
4.3
4.4
Impact of Transmission Impairments on ASR Performance
Transmission Channel Simulation
Recognizer and Test Set-Up
Recognition Results
4.4.1
4.4.2
4.4.3
4.4.4

4.4.5
Normalization
Impact of Circuit Noise
Impact of Signal-Correlated Noise
Impact of Low Bit-Rate Coding
Impact of Combined Impairments
4.5
4.6
4.7
E-Model Modification for ASR Performance Prediction
Conclusions from the Experiment
Summary
93
97
97
101
102
103
103
103
106
108
114
119
121
127
128
131
133
134

140
147
157
159
162
165
169
171
175
179
181
183
184
186
188
190
195
198
Contents
5.
QUALITY OF SYNTHESIZED SPEECH OVER THE PHONE
5.1
5.2
5.3
5.4
Functional Testing of Synthesized Speech
Intelligibility and Quality in Good and Telephonic Conditions
Test- and System-Related Influences
Transmission Channel Influences
5.4.1

5.4.2
5.4.3
Experimental Set-Up
Results
Conclusions from the Experiment
5.5
Summary
6.
QUALITY OF SPOKEN DIALOGUE SYSTEMS
6.1
Experimental Set-Up
6.1.1
6.1.2
6.1.3
6.1.4
6.1.5
The BoRIS Restaurant Information System
Speech Recognition Simulation
Measurement of Interaction Parameters
Questionnaire Design
Test Design and Data Collection
6.2
Analysis of Assessment and Evaluation Results
6.2.1
6.2.2
6.2.3
6.2.4
6.2.5
6.2.6
Data Pre-Analysis

Multidimensional Analysis of Quality Features
Multidimensional Analysis of Interaction Parameters
Analysis of the QoS Schematic
Impact of System Characteristics
Test-Related Issues
6.3
Quality Prediction Modelling
6.3.1
6.3.2
6.3.3
6.3.4
PARADISE Framework
Quality Prediction Models for Experiment 6.3 Data
Hierarchical Quality Prediction Models
Conclusion of Modelling Approaches
6.4
Summary
7.
FINAL CONCLUSIONS AND OUTLOOK
Glossary
Appendices
A
B
C
D
Definition of Interaction Parameters
Template Sentences for Synthesis Evaluation, Exp. 5.1 and 5.2
BoRIS Dialogue Structure
Instructions and Scenarios
vii

201
202
205
206
208
209
215
233
234
237
240
241
244
248
250
252
255
256
258
273
278
298
309
311
312
321
337
345
348
351

359
363
363
381
383
387
viii
D.1
D.2
D.3
Instructions to Test Subjects
Scenarios
Instructions for Expert Evaluation
E
Questionnaires
E.1
E.2
Questionnaire for Experiment 6.2
Questionnaire for Experiment 6.3
References
About the Author
Index
387
388
393
405
405
413
429
463

465
Preface
An increasing number of telephone services are offered in a fully automatic
way with the help of speech technology. The underlying systems, called spo-
ken dialogue systems (SDSs), possess speech recognition, speech understand-
ing, dialogue management, and speech generation capabilities, and enable a
more-or-less natural spoken interaction with the human user. Nevertheless, the
principles underlying this type of interaction are different from the ones which
govern telephone conversations between humans, because of the limitations of
the machine interaction partner. Users are normally able to cope with the limi-
tations and to reach the goal of the interaction, provided that both interlocutors
behave in a cooperative way.
The present book gives a systematic overview of assessment, evaluation,
and prediction methods for the quality of these innovative services. On the
basis of cooperativity considerations, a new taxonomy of quality of service
(QoS) aspects is developed. It identifies four types of factors influencing the
quality aspects perceived by the user: Environmental factors resulting from the
physical situation of use (transmission channels, ambient noise); factors directly
related to the machine interaction partner; task factors covering the interaction
goal; and non-physical contextual factors like the access conditions and the
involved costs. These factors are shown to be in a complex relationship to
different categories of perceived quality, like cooperativity, efficiency, usability,
user satisfaction, and acceptability. The taxonomy highlights the relationships
between the different factors and aspects. It is a very useful tool for classifying
assessment and evaluation methods, for planning and interpreting evaluation
experiments, and for estimating quality on the basis of system characteristics.
Quality is the result of a perception and a judgment process. Consequently,
assessment and evaluation methods involving human test subjects are necessary
in order to quantify the impact of system characteristics on perceived quality.
The system characteristics can be described with the help of interaction parame-

ters, i.e. parameters which are measured instrumentally or on the basis of expert
x
annotations. A number of parameters and evaluation methods are defined, both
on a system component level and for the fully integrated system. It is shown
that technology-centered component assessment has to go hand in hand with
user-centric evaluation, because both provide different types of information for
the system developer. The resulting information about quality is needed in all
phases of system specification, design, implementation, and operation, in order
to efficiently set up systems which offer a high quality to their users.
Three new experimental investigations illustrate the relationships between
system characteristics on the one side, and component performance or per-
ceived quality on the other. First, the effect of the transmission channel on
speech recognition and speech output is analyzed with the help of a network
simulation model. The results are compared to human communication scenar-
ios, and quality or performance estimations are obtained on the basis of system
characteristics, using quality prediction models. In a second step, interaction
experiments with a fully integrated system are carried out, and interaction pa-
rameters as well as user quality judgments are collected. The analysis of the
obtained data shows that the correlation between both types of metrics is rel-
atively low. This is a proof for the hypothesis that quality models for the
overall interaction with the SDS can cover only a part of the factors influencing
perceived quality. With the help of the QoS taxonomy, alternative modelling
approaches are proposed. Still, the predictive power is too limited to avoid
resource-demanding experiments with human test subjects. The reasons for
this finding are discussed, and necessary research directions to overcome the
limitations are pointed out.
The assessment, evaluation and prediction of quality requires knowledge
from a number of disciplines which do not always share a common ground of
information. Although being written from the perspective of an engineer in
telecommunications, the book is directed towards a wide audience, from ex-

perts in telecommunications and signal processing, communication acoustics,
computational linguistics, speech and language sciences, up to psychophysics,
human factor design and ergonomics. It is hoped that this – admittedly very
ambitious – goal can at least partially be reached, and that the book may provide
useful information for designing systems and services which ultimately satisfy
the needs of their human users.
Bochum
SEBASTIAN MÖLLER
Acknowledgments
The present work was performed during my occupation at the Institut für
Kommunikationsakustik (IKA), Ruhr-Universität Bochum. A number of per-
sons contributed in different ways to its finalization. Especially, I would like to
thank the following:
my colleague PD Dr. phil. Ute Jekosch for supporting this work over the
years, and for providing the scientific basis of quality assessment,
the former head of the institute, Prof. Dr Ing. Dr. techn. h.c. Jens Blauert,
for providing a scientific home, and for enabling and supporting the work
with interest and advice,
Prof. Dr Ing. Ulrich Heute (Christian-Albrechts Universität zu Kiel, Ger-
many) and Prof. Dr. Rolf Carlsson (KTH Stockholm, Sweden) for their
interest in my work,
my colleagues Alexander Raake and Jan Krebber for taking over some of
my duties so that I had the time for writing, and for numerous fruitful
discussions,
the student Janto Skowronek for the huge amount of work performed during
his diploma thesis and his later occupation at the institute,
the students Christine Dudda (now Pellegrini) and Andreea Niculescu for
their experimental work on dialogue system evaluation contributing to Chap-
ter 6,
the student co-workers Sven Bergmann, Sven Dyrbusch, Marc Hanisch,

Marius Hilckmann, Anders Krosch, Jörn Opretzka, Rosa Pegam, Sebastian
Rehmann and Joachim Riedel for their countless contributions during the
last years,
Dr. Ergina Kavallieratou for her work on speech recognition contributing
to Chapter 4,
xii
Stefan Schaden and many other colleagues at IKA for discussions and sug-
gestions,
James Taylor and Dr Ing. Volker Kraft for reviewing and correcting the
manuscript,
Prof. Dr. Hervé Bourlard and his colleagues from the Institut dalle Molle
d’Intelligence Artificielle Perceptive (IDIAP) in Martigny, Switzerland, for
providing a scientific basis in early spring 2000,
Dr. Martin Rajman and his colleagues Alex Trutnev and Florian Seydoux at
Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland, for their
support in developing the Swiss-French prototype of the BoRIS system,
Dr Ing. Jens Berger for his support with signal-based quality prediction
models,
numerous colleagues in Study Group 12 of the International Telecommuni-
cation Union (ITU-T) following and supporting my work with interest,
the system administrators of the institute’s computer network and the mem-
bers of the office for providing and maintaining their resources, and
my family and friends for strongly supporting me during the past five years.
A part of the work was supported by the EC-funded IST project INSPIRE
(“INfotainment management with SPeech Interaction via REmote-microphones
and telephone interfaces” , IST-2001-32746).
Definitions and Abbreviations
Definitions
# ASR REJECTIONS
#

BARGE-INS
#
CANCEL ATTEMPTS
#
H
ELP
R
EQUESTS
#
S
YSTEM
E
RROR
M
ESS
.
#
S
YSTEM
Q
UESTIONS
#
S
YSTEM
T
URNS
#
S
YSTEM
W

ORDS
#
TIME-OUT PROMPTS
#
TURNS
#
USER QUESTIONS
#
USER TURNS
#
U
SER
W
ORDS
# USER WORDS IV
#
W
ORDS
A
CA:AP
CA:IA
CA:IC
CA:TF
average number of speech recognizer rejections in a dia-
logue
average number of barge-in attempts from the user in a
dialogue
average number of cancel attempts from the user in a dia-
logue
average number of help requests from the user in a dialogue

average number of system error messages in a dialogue
average number of system questions in a dialogue
average number of system turns in a dialogue
average number of system words uttered in a dialogue
average number of time-out prompts in a dialogue
average number of turns in a dialogue
average number of user questions in a dialogue
average number of user turns in a dialogue
average number of user words uttered in a dialogue
average number of in-vocabulary user words in a dialogue
average number of words uttered in a dialogue
false speaker rejection rate
scaling factor for
expectation factor
coefficients for
number of correct system answers
number of failed system answers
number of incorrect system answers
number of partially correct system answers
false speaker acceptance rate
packet loss robustness factor
speaker misclassification rate
recognition confusion matrix corresponding to
recognition confusion matrix corresponding to
number of correctly recognized attribute-value pairs
cost measures of the PARADISE model
number of correctly recognized sentences
number of correctly recognized words
percentage of system answers judged to be appropriate and
inappropriate by different experts (%)

percentage of appropriate system answers (%)
percentage of inappropriate system answers (%)
percentage of incomprehensible system answers (%)
percentage of completely failed system answers (%)
CA:AI
AN:FA
A
AN:CO
AN:IC
AN:PA
Bpl
xiv
CE
CER
COMP
CR
D
I
IC
Id
Ie
Ie,eff
Iq
Iqo
IR
Is
Le
Lst
LSTR
concept efficiency (%)

concept error rate (%)
subjective evaluation of task completion
correction rate (%)
frequency-weighted difference in sensitivity between the
direct and the diffuse sound (dB)
number of deleted attribute-value pairs
number of deleted sentences
number of deleted words
DARPA score
DARPA error
modified DARPA error
dialogue duration (s)
D-value of the handset telephone, receive side (dB)
D-value of the handset telephone, send side (dB)
reference recognition rate of the simulated recognizer
target recognition rate of the simulated recognizer
signal-to-equivalent-continuous-circuit-noise ratio (dB)
percentage of users rating a connection good or better (%)
recognition success metrics according to Kamm et al.
(1997a)
impairment factor
number of inserted attribute-value pairs
number of inserted sentences
number of inserted words
information content (%)
impairment factor for impairments occurring delayed with
respect to the speech signal
equipment impairment factor
effective equipmentimpairmentfactor, including transmis-
sion errors

impairment factor for quantizing distortion
recognizer-specific impairment factor
implicit recovery (%)
impairment factor for impairments occurring simultane-
ously with the speech signal
kappa coefficient (per configuration, per dialogue)
frequency-dependent loss of the talker echo path (dB)
frequency-dependent loss of the sidetone path (dB)
listener sidetone rating (dB)
understanding error confusion matrix
mean recognition score
total number of attribute-value pairs in an utterance
number of correctly not set attribute-value pairs
total number of concepts in the dialogue
total number of dialogues
total number of user queries in the dialogue
number of unique concepts newly understood by the sys-
tem in the dialogue
DD
G
%GoB
MRS
Definitions and Abbreviations
xv
Nc
NES
N for
No
Nor
N

os
Nro
OLR
P(A)
P(E)
PA:CO
PA:FA
PA:PA
%PoW
Ppl
Pr
PRE
Ps
Q
QComb
QD
qdu
Qo
R,
RD
RLR
RLRset
Ro
S
SA
SAT
SCR
SCT
SER
SLR

SLRset
circuit noise level (dBm0p)
number of errors per sentence
mean number of errors per sentence (isolated word recognition)
noise floor level (dBmp)
total equivalent circuit noise level (dBm0p)
equivalent circuit noise caused by room noise at receive side (dBm0p)
equivalent circuit noise caused by room noise at send side (dBm0p)
recognizer-specific noise parameter (dBm0p)
overall loudness rating between mouth and ear reference points (dB)
probability for rejection of the null hypothesis
actual agreement rate
chance agreement rate
number of correctly parsed user utterances
number of user utterances which failed parsing
number of partially parsed user utterances
maximum performance value
minimum performance value
percentage of users rating a connection poor or worse (%)
random packet loss probability (%)
A-weighted sound pressure level of room noise at receive side (dB(A))
proportion reduction in error
A-weighted sound pressure level of room noise at send side (dB(A))
signal-to-quantizing-noise ratio (dB)
system performance measure (Bonneau-Maynard et al., 2000)
query density (%)
quantizing distortion unit
recognizer-specific robustness factor (dB)
Spearman rank order correlation coefficient
(normalized) transmission rating

mean amount of variance covered by the regression analysis
Pearson correlation coefficient
response delay
receive loudness rating between the 0 dBr point in the network and the
ear reference point (dB)
receive loudness rating of the telephone handset (dB)
basic signal-to-noise transmission rating factor
total number of sentences in the reference
number of substituted attribute-value pairs
number of substituted sentences
number of substituted words
sentence accuracy (%)
mean overall system performance rating (Bonneau-Maynard et al., 2000)
system correction rate (%)
average number of system correction turns
sentence error rate (%)
send loudness rating between the mouth reference point and the 0 dBr
point in the network (dB)
send loudness rating of the telephone handset (dB)
Abbreviations
ACR
ADPCM
AM
AMR
ANN
ANOVA
AoS
ASR
ATIS
AVM

AVP
absolute category rating
adaptive differential pulse code modulation
amplitude modulation
adaptive multi-rate
artificial neutral network
analysis of variance
acceptability of service
automatic speech recognition
air travel information system
attribute-value matrix
attribute-value pair
xvi
SRD
STD
STMR
T
Ta
TD
TELR
topline
Tr
TS,
UA
UCR
UCT
URD
UTD
W
WA

WEPL
WER
WES
WPST
WPUT
system response delay (s)
system turn duration (s)
sidetone masking rating (dB)
mean one-way talker echo path delay (ms)
overall delay between the mouth reference point of the talker
and the ear reference point of the listener (ms)
turn duration (s)
talker echo loudness rating (dB)
topline performance value
topline transmission rating value
round-trip delay for listener echo (ms)
task success measures (%)
understanding accuracy (%)
user correction rate (%)
average number of user correction turns
user response delay (s)
user satisfaction rating according to Walker et al. (1998a)
estimation of
user turn duration (s)
total number of words in the reference
weighting coefficients for
word accuracy (%)
word accuracy for isolated word recognition (%)
weighted echo path loss for listener echo (dB)
word error rate (%)

word error rate for isolated word recognition (%)
word error per sentence
mean word error per sentence (isolated word recognition)
average number of words per system turn
average number of words per user turn
Definitions and Abbreviations
xvii
BP
BT
CART
CELP
CLID
CNET
CNRS
CP
CSELT
CSLU
CSR
CVC
DARPA
DDL
DP
DR
DRT
DSD
DTMF
EAGLES
EER
ELDA
ELRA

EPFL
ETSI
EURESCOM
FUB
GG
GSM
GSM-EFR
GSM-FR
GSM-HR
GUI
HENR
HHI
HMI
HMM
HTK
HTML
IDIAP
IEC
IKA
INT,
IP
IRS
IRU
ISCA
ISDN
bandpass
British Telecom
classification and regression tree
code-excited linear prediction
cluster identification

Centre National d’Etudes des Télécommunications
Centre National de la Recherche Scientifique
cooperativity principle
Centro Studi e Laboratori Telecommunicazioni
Center for Spoken Language Understanding
continuous speech recognition
consonant-vowel-consonant
Defense Advanced Research Projects Agency
dialogue description language
dynamic programming
design rationale
diagnostic rhyme test
design space development
dual tone multiple frequency
European Advisory Group on Language Engineering Standards
equal error rate
Evaluation and Language Resources Distribution Agency
European Language Resources Association
Ecole Polytechnique Fédérale de Lausanne
European Telecommunications Standards Institute
European Institute for Research and Strategic Studies in
Telecommunications
Fondazione Ugo Bordoni
generic cooperativity guideline
global system for mobile communication
GSM enhanced full-rate
GSM full-rate
GSM half-rate
graphical user interface
human equivalent noise ratio

human-to-human interaction
human-machine interaction
hidden Markov model
hidden Markov model toolkit
hypertext markup language
Institut dalle Molle d’Intelligence Artificielle Perceptive
International Electrotechnical Commission
Institut für Kommunikationsakustik
mean (normalized) rating on an intelligibility scale
internet protocol
intermediate reference system
informally redundant utterance
International Speech Communication Association
integrated services digital network
xviii
ISO
ITU
ITU-T
IVR
LD-CELP
LDC
LPC
MAPSSWE
MFCC
MIT
MLP
MN
MNRU
MOS,
MRT

NIST
NLP
OOV
PARADISE
PBX
PCM
PESQ
PIN
PLP
PNAMBIC
PSOLA
PSTN
QOC
QoS
RAD
RAMOS
RASTA
ROC
RPE-LTP
SALT
SASSI
SDL
SDS
SG
SI
SLDS
SNR
SPL
SPSS
SQL

SUS
SV
International Organization for Standardization
International Telecommunication Union
Telecommunication Standardization Sector of the ITU
interactive voice response
low-delay code-excited linear prediction
Linguistic Data Consortium
linear predictive coding
Matched-Pair-Sentence-Segment-Word-Error test
mel-frequency cepstral coefficient
Massachusetts Institute of Technology
maximum likelihood process
McNemar test
modulated noise reference unit
mean opinion score (normalized)
mean opinion score on a listening-effort scale (normalized)
modified rhyme test
National Institute of Standards and Technology
natural language processing
out-of-vocabulary
paradigm for dialogue system evaluation
private branch exchange
pulse code modulation
perceptual evaluation of speech quality
personal identification number
perceptual linear predictive
pay no attention to the man behind the curtain (see WoZ)
pitch-synchronous overlap and add
public switched telephone network

questions-options-criteria rationale
quality of service
rapid application developer
recognizer assessment by manipulation of speech
relative spectral
receiver operating curves
regular pulse excitation long term prediction
speech application language tags
subjective assessment of speech system interfaces
specification and description language
spoken dialogue system
specific cooperativity guideline
speaker identification
spoken language dialogue system
signal-to-noise ratio
sound pressure level
Statistical Package for the Social Sciences
structured query language
semantically unpredictable sentence
speaker verification
Definitions and Abbreviations
xix
TC
STQ
Tcl/Tk
TCP/IP
TCU
TDMA
TFW
TIPHON

TMF
TOSQA
TRP
TTS
UMTS
VoiceXML
VoIP
VSELP
VUI
WoZ
WSR
XML
Technical Committee Speech Processing, Transmission and Quality Aspects
tool command language and tool kit
transmission control protocol / internet protocol
turn-constructional unit
time division multiple access
time-and-frequency warping
Telecommunications and Internet Protocol Harmonization Over Networks
time-frequency-modulation
Telekom objective speech quality assessment
transition-relevant place
text-to-speech
universal mobile telecommunications system
voice extensible markup language
voice over internet protocol
vector sum excited linear prediction
voice user interface
Wizard-of-Oz
Wilcoxon signed rank test

extensible markup language
This page intentionally left blank
Chapter 1
MOTIVATION AND INTRODUCTION
Modern telecommunication networks promise to provide ubiquitous access
to multimedia information and communication services. In order to increase
the number of their users, telephone network operators create new speech inter-
action services for communication, information, transaction and E-commerce,
via an interconnected global network of wireline and mobile trunks. For mobile
network operators, speech-based services are a key feature to being different
from other operators. Other companies are cutting costs by automating call
centers and customer-service operations, and can improve internal operations
via web- and telephone-based services, especially for mobile workers. The
Gartner group expects 2003 about one third of the automatized telephone lines
to be equipped with automatic speech recognition capabilities (Thyfault, 1999).
Apart from the significant advances which have been made in speech and
language technology during the last twenty years, the possible economical ben-
efit for the service operators has been a key driving force for this development.
Following the argumentation of Whittaker and Attwater (1995), speech-based
systems help to
enable market differentiation,
exploit revenue opportunities,
improve the quality of existing services,
improve the accessibility of services,
reduce the cost of service provision, and
free-up people to concentrate on high-value tasks.
These reasons can be decisive for companies and service operators to integrate
speech and language technology into their services. Railway information can
be seen as a typical example: Based on a study of 130 information offices
2

in six countries (Billi and Lamel, 1997), over 100 million calls were handled
per year, with at least another 10 million calls remaining unanswered. About
91% of the callers solely asked for information, and only 9% performed a
reservation task. It was estimated that over 90% of the calls could be handled
by an automatic system with a recognition capability of 400 city names, and over
95% by a system with a 500 city names capability. Thus, automatic services
seem to be a very economic solution for handling such tasks. They help to
reduce waiting time and extend opening hours. The negative impact on the
employment situation should however not be disregarded.
Amongst all potential application areas of spoken dialogue systems, it is the
telecommunication sector which has provided the most powerful impetus for
research on practical systems to date (Fraser and Dalsgaard, 1996). From a tele-
com operator’s point of view, the new services differ in three relevant aspects
from traditional ones (Kamm et al., 1997b). On the service side, traditional
voice telephony was amended by the integrated transmission of voice, audio,
image, video, text and data, in fixed and mobile application situations. On the
transmission technology side, analogue narrow-band wireline transmission has
been replaced by a mix of wireline and wireless networks, using analogue or
digital representations, different transmission bandwidths, and different media
such as copper, fiber, radio cells, satellite or power lines. On the communication
side, the model changed from a purely human-to-human communication to an
interaction partly between humans, and partly between humans and machines.
These changes have consequences for the developers of spoken dialogue sys-
tems, for transmission network operators, and of course for the end users.
Interactive speech systems are “computer systems with which humans inter-
act on a turn-by-turn basis” (Fraser, 1997, p. 564). They enable and support the
communication of information between two parties (agents), mostly between a
human user and a machine agent. Here, only those systems will be addressed in
which spoken language plays a significant role as an interaction modality. Ac-
cording to Dybkjær and Bernsen (2000), p. 244, the most advanced commercial

systems
“have a vocabulary of several thousand words; understand speaker-independent spon-
taneous speech; do complex linguistic processing of the user’s input; handle shifts in
initiative; have quite complex dialogue management abilities including, e.g. reasoning
based on the user’s input, consultation of the recorded history of the dialogue so far, and
graceful degradation of the dialogue when faced with users who are difficult to under-
stand; carry out linguistic processing of the output to be generated; solve several tasks,
and not just one; and robustly carry out medium-length dialogues to provide the user
with, for instance, train timetable information on the departures and arrivals of trains
between hundreds of cities”.
Whereas not all of these characteristics need to be satisfied, the focus will be set
in the following investigations on systems which accept continuously spoken in-
Motivation and Introduction
3
put from different speakers, allow initiative to be taken from both the user and the
system, and which are capable of reasoning, correction, meta-communication
(communication about communication), anticipation, and prediction. These
systems are called ‘spoken dialogue systems’ (SDSs), in some literature also
‘spoken language dialogue systems’ (SLDSs). They have to be differentiated
from systems with more restricted capabilities, e.g. command systems or sys-
tems accepting only dialling tones as an input. A categorization of interactive
speech systems will be given in Section 2.1.3.
Most of the currently available systems enable a task-orientated dialogue, i.e.
the goal of the interaction is fixed to a specific task which can only be reached if
both interaction partners cooperate. This type of interaction is obviously very
restricted, and it should not be confused with a normal communication situation
between humans. In task-orientated dialogues, the structure of the task was
shown to carry a significant influence on the structure of the dialogue (Grosz,
1977), and this structure is a prerequisite for systems whose speech recognition
and understanding capabilities are still very limited. In practical cases, this

restriction is however not too severe, because task-orientated dialogues are
highly relevant for commercial applications.
Spoken dialogue systems can be seen as speech-based user interfaces (so-
called voice user interfaces, VUIs) to application system back-ends, and they
will thus compete with other types of interfaces, namely with graphical user
interfaces (GUIs). GUIs have the advantage of providing immediate feed-
back, reversible operations, incrementality, and of supporting rapid scanning
and browsing of information. Because the visual information may easily and
immediately indicate all options which are available at a specific point in the
interaction, GUIs are relatively easy to use for novices. Spoken language inter-
faces, on the other hand, show the inherent limitations of the sequential channel
for delivering information, of requiring the user to learn the language the sys-
tem can understand, of hiding available command options, and of leading to
unrealistic expectations as to their capabilities (Walker et al., 1998a). Speech
is perceptually transient rather than static. This implies that the user has to pick
up the information provided by the system immediately, or he/she will miss it
completely.
These arguments against SDSs are however only valid when such systems
just mimic GUIs. Human-to-human interaction via spoken dialogue shows that
humans are usually able to cope with the modality limitations very well. Even
better, spoken language is able to surpass several weaknesses which are inherent
to direct manipulation interfaces like GUIs (Cohen, 1992, p. 144):
“Merely allowing the users to select currently displayed entities provides them little
support for identifying objects not on the screen, for specifying temporal relations, for
identifying and operating on large sets and subsets of entities, and for using the context
of interaction. What is missing is a way for users to describe entities, by which it is
4
meant the use of an expression in a language (natural or artificial) to denote or pick out
an object, set, time period, and so forth.”
It seems that the limitations of speech-based interfaces can and have to be

addressed by an appropriate system design, and that in this way interfaces of-
fering a high utility and quality to their users can be set up. Some general
design principles are already well understood for GUIs, and should also be
taken into account in SDS design, e.g. to represent objects and actions continu-
ously, or to allow rapid, incremental, reversible operations on objects which are
immediately acknowledged (Shneiderman, 1992; Kamm and Walker, 1997).
These principles reflect to some extent the limitations of the human memory
and cognitive and sensory processing. In SDSs, a continuous representation
can be reached by using consistent vocabulary throughout the dialogue, or by
providing additional help information in case of time-outs. Immediate feedback
can be provided by explicit or implicit confirmation, and by allowing barge-in.
Summarization might be necessary at some points in the interaction in order to
respect the human auditory memory limitations.
Before developing a spoken dialogue system, it has to be decided whether
speech is the right modality for the application under consideration, and for
the individual tasks to be carried out. For example, users will not like to say
their PIN code out aloud to a cash machine in the street, and long timetable
lists are better displayed visually. The decision on an appropriate modality can
be taken in a systematic way by using modality properties, as it was proposed
by Bernsen (Bernsen, 1997; Bernsen et al., 1998). If speech is not sufficient
as a unique modality, multimodal systems may be a better solution (Fellbaum
and Ketzmerick, 2002). Such systems are able to handle several input and
output devices addressing different media in parallel. A user may interact with
the system using different modalities of input and output, and combinations of
modalities are possible. For example, a user may point to a touchscreen device
and ask “How can I get there?”. Or a system may display a route on the screen
and inform the user: “You have to turn right at this point!”. Cohen (1992),
p. 143, pointed out that a major advantage of multimodal user interfaces is “to
use the strengths of one modality to overcome weaknesses of another”.
Still, the speech modality will remain an essential element in multimedia

communication services. This fact is underlined by the strong persistence of
unimodal narrow-band telephone services even in networks which would al-
low for wideband and audio-visual services. Remote access to information is
highly desirable (in privacy, but also for mobile workers), and the telephone is
a lightweight and ubiquitous form of access which is available to nearly every-
one. Speech is also the only modality to address devices which are physically
very small, or which are desired to be invisible. Thus, it can be expected that
speech will continue to persist as the main modality in human-to-human com-

×