Speech and Audio Research Laboratory of the SAIVT program
Centre for Built Environment and Engineering Research
ACOUSTIC KEYWORD SPOTTING
IN SPEECH WITH APPLICATIONS
TO DATA MINING
A. J. Kishan Thambiratnam
BE(Electronics)/BInfTech
SUBMITTED AS A REQUIREMENT OF
THE DEGREE OF
DOCTOR OF PHILOSOPHY
AT
QUEENSLAND UNIVERSITY OF TECHNOLOGY
BRISBANE, QUEENSLAND
9 MARCH 2005
Keywords
Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword Verification, Confidence Scoring, Speech Recognition, Utterance Verification
i
ii
Abstract
Keyword Spotting is the task of detecting keywords of interest within continuous speech. The applications of this technology range from call centre dialogue
systems to covert speech surveillance devices. Keyword spotting is particularly
well suited to data mining tasks such as real-time keyword monitoring and unrestricted vocabulary audio document indexing. However, to date, many keyword
spotting approaches have suffered from poor detection rates, high false alarm
rates, or slow execution times, thus reducing their commercial viability.
This work investigates the application of keyword spotting to data mining
tasks. The thesis makes a number of major contributions to the field of keyword
spotting.
The first major contribution is the development of a novel keyword verification
method named Cohort Word Verification. This method combines high level linguistic information with cohort-based verification techniques to obtain dramatic
improvements in verification performance, in particular for the problematic short
duration target word class.
The second major contribution is the development of a novel audio document
indexing technique named Dynamic Match Lattice Spotting. This technique augments lattice-based audio indexing principles with dynamic sequence matching
techniques to provide robustness to erroneous lattice realisations. The resulting
algorithm obtains significant improvement in detection rate over lattice-based
iii
audio document indexing while still maintaining extremely fast search speeds.
The third major contribution is the study of multiple verifier fusion for the task
of keyword verification. The reported experiments demonstrate that substantial
improvements in verification performance can be obtained through the fusion
of multiple keyword verifiers. The research focuses on combinations of speech
background model based verifiers and cohort word verifiers.
The final major contribution is a comprehensive study of the effects of limited
training data for keyword spotting. This study is performed with consideration
as to how these effects impact the immediate development and deployment of
speech technologies for non-English languages.
iv
Contents
Keywords
i
Abstract
iii
List of Tables
xiii
List of Figures
xvi
List of Abbreviations
xxi
Authorship
xxiii
Acknowledgments
xxv
1 Introduction
1.1
1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
Aims and Objectives . . . . . . . . . . . . . . . . . . . . .
2
1.1.2
Research Scope . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Major Contributions of this Research . . . . . . . . . . . . . . . .
6
1.4
List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2 A Review of Keyword Spotting
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
9
9
2.2
The keyword spotting problem . . . . . . . . . . . . . . . . . . . .
10
2.3
Applications of keyword spotting . . . . . . . . . . . . . . . . . .
11
2.3.1
Keyword monitoring applications . . . . . . . . . . . . . .
11
2.3.2
Audio document indexing . . . . . . . . . . . . . . . . . .
13
2.3.3
Command controlled devices . . . . . . . . . . . . . . . . .
13
2.3.4
Dialogue systems . . . . . . . . . . . . . . . . . . . . . . .
14
The development of keyword spotting . . . . . . . . . . . . . . . .
15
2.4.1
Sliding window approaches . . . . . . . . . . . . . . . . . .
15
2.4.2
Non-keyword model approaches . . . . . . . . . . . . . . .
16
2.4.3
Hidden Markov Model approaches . . . . . . . . . . . . . .
17
2.4.4
Further developments . . . . . . . . . . . . . . . . . . . . .
17
Performance Measures . . . . . . . . . . . . . . . . . . . . . . . .
18
2.5.1
The reference and result sets . . . . . . . . . . . . . . . . .
19
2.5.2
The hit operator . . . . . . . . . . . . . . . . . . . . . . .
19
2.5.3
Miss rate
. . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.5.4
False alarm rate . . . . . . . . . . . . . . . . . . . . . . . .
21
2.5.5
False acceptance rate . . . . . . . . . . . . . . . . . . . . .
21
2.5.6
Execution time . . . . . . . . . . . . . . . . . . . . . . . .
22
2.5.7
Figure of Merit . . . . . . . . . . . . . . . . . . . . . . . .
22
2.5.8
Equal Error Rate . . . . . . . . . . . . . . . . . . . . . . .
23
2.5.9
Receiver Operating Characteristic Curves . . . . . . . . . .
24
2.5.10 Detection Error Trade-off Plots . . . . . . . . . . . . . . .
25
Unconstrained vocabulary spotting . . . . . . . . . . . . . . . . .
26
2.6.1
HMM-based approach . . . . . . . . . . . . . . . . . . . .
26
2.6.2
Neural Network Approaches . . . . . . . . . . . . . . . . .
28
Approaches to non-keyword modeling . . . . . . . . . . . . . . . .
31
2.7.1
Speech background model . . . . . . . . . . . . . . . . . .
31
2.7.2
Phone models . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.4
2.5
2.6
2.7
vi
2.7.3
Uniform distribution . . . . . . . . . . . . . . . . . . . . .
34
2.7.4
Online garbage model . . . . . . . . . . . . . . . . . . . .
34
Constrained vocabulary spotting . . . . . . . . . . . . . . . . . . .
36
2.8.1
Language model approaches . . . . . . . . . . . . . . . . .
36
2.8.2
Event spotting . . . . . . . . . . . . . . . . . . . . . . . .
39
Keyword verification . . . . . . . . . . . . . . . . . . . . . . . . .
41
2.9.1
A formal definition . . . . . . . . . . . . . . . . . . . . . .
42
2.9.2
Combining keyword spotting and verification . . . . . . . .
42
2.9.3
The problem of short duration keywords . . . . . . . . . .
43
2.9.4
Likelihood ratio based approaches . . . . . . . . . . . . . .
43
2.9.5
Alternate Information Sources . . . . . . . . . . . . . . . .
46
2.10 Audio Document Indexing . . . . . . . . . . . . . . . . . . . . . .
47
2.8
2.9
2.10.1 Limitations of the Speech-to-Text
3
Transcription approach . . . . . . . . . . . . . . . . . . . .
48
2.10.2 Reverse dictionary lookup searches . . . . . . . . . . . . .
49
2.10.3 Indexed reverse dictionary lookup searches . . . . . . . . .
51
2.10.4 Lattice based searches . . . . . . . . . . . . . . . . . . . .
53
HMM-based spotting and verification
57
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.2
The confusability circle framework . . . . . . . . . . . . . . . . . .
58
3.3
Analysis of non-keyword models . . . . . . . . . . . . . . . . . . .
60
3.3.1
All-speech models . . . . . . . . . . . . . . . . . . . . . . .
60
3.3.2
SBM methods . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.3.3
Phone-set methods . . . . . . . . . . . . . . . . . . . . . .
62
3.3.4
Target-word-excluding methods . . . . . . . . . . . . . . .
62
Evaluation of keyword spotting techniques . . . . . . . . . . . . .
63
3.4
3.4.1
Experiment setup . . . . . . . . . . . . . . . . . . . . . . .
vii
64
3.4.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.5
Tuning the phone set non-keyword model . . . . . . . . . . . . . .
68
3.6
Output score thresholding for SBM
spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
Performance across keyword length . . . . . . . . . . . . . . . . .
72
3.7.1
Evaluation sets . . . . . . . . . . . . . . . . . . . . . . . .
73
3.7.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
HMM-based keyword verification . . . . . . . . . . . . . . . . . .
74
3.8.1
Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . .
76
3.8.2
Evaluation procedure . . . . . . . . . . . . . . . . . . . . .
77
3.8.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
Discriminative background model KV . . . . . . . . . . . . . . . .
79
3.9.1
System architecture . . . . . . . . . . . . . . . . . . . . . .
79
3.9.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
3.10 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . .
82
3.7
3.8
3.9
4 Cohort word keyword verification
85
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
4.2
Foundational concepts . . . . . . . . . . . . . . . . . . . . . . . .
87
4.2.1
Cohort-based scoring . . . . . . . . . . . . . . . . . . . . .
87
4.2.2
The use of language information . . . . . . . . . . . . . . .
88
4.3
Overview of the cohort word technique . . . . . . . . . . . . . . .
90
4.4
Cohort word set construction . . . . . . . . . . . . . . . . . . . .
92
4.4.1
The choice of dmin and dmax . . . . . . . . . . . . . . . . .
92
4.4.2
Cohort word set downsampling . . . . . . . . . . . . . . .
94
4.4.3
Distance function . . . . . . . . . . . . . . . . . . . . . . .
94
Classification approach . . . . . . . . . . . . . . . . . . . . . . . .
96
4.5.1
96
4.5
2-class classification approach . . . . . . . . . . . . . . . .
viii
4.5.2
Hybrid N-class approach . . . . . . . . . . . . . . . . . . .
98
4.6
Summary of the cohort word algorithm . . . . . . . . . . . . . . . 100
4.7
Comparison of classifier approaches . . . . . . . . . . . . . . . . . 101
4.8
4.9
4.7.1
Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.7.2
Recogniser parameters . . . . . . . . . . . . . . . . . . . . 103
4.7.3
Cohort word selection . . . . . . . . . . . . . . . . . . . . 103
4.7.4
Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 104
4.7.5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Performance across target keyword length . . . . . . . . . . . . . 106
4.8.1
Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.8.2
Recogniser parameters . . . . . . . . . . . . . . . . . . . . 107
4.8.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.8.4
Analysis of poor 8-phone performance . . . . . . . . . . . . 110
4.8.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Effects of selection parameters . . . . . . . . . . . . . . . . . . . . 113
4.9.1
Cohort word set downsampling . . . . . . . . . . . . . . . 114
4.9.2
Cohort word selection range . . . . . . . . . . . . . . . . . 116
4.9.3
MED cost parameters . . . . . . . . . . . . . . . . . . . . 119
4.9.4
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.10 Fused cohort word systems . . . . . . . . . . . . . . . . . . . . . . 122
4.10.1 Training dataset
. . . . . . . . . . . . . . . . . . . . . . . 123
4.10.2 Neural network architecture . . . . . . . . . . . . . . . . . 123
4.10.3 Experimental procedure . . . . . . . . . . . . . . . . . . . 123
4.10.4 Baseline unfused results . . . . . . . . . . . . . . . . . . . 124
4.10.5 Fused SBM-CW experiments . . . . . . . . . . . . . . . . . 125
4.10.6 Fused CW-CW experiments . . . . . . . . . . . . . . . . . 128
4.10.7 Comparison of fused and unfused systems . . . . . . . . . 129
4.11 Conclusions and Summary . . . . . . . . . . . . . . . . . . . . . . 133
ix
5 Dynamic Match Lattice Spotting
137
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.3
Dynamic Match Lattice Spotting method . . . . . . . . . . . . . . 140
5.4
5.5
5.6
5.7
5.3.1
Basic method . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3.2
Optimised Dynamic Match Lattice Search . . . . . . . . . 145
Evaluation of DMLS performance . . . . . . . . . . . . . . . . . . 146
5.4.1
Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.4.2
Recogniser parameters . . . . . . . . . . . . . . . . . . . . 147
5.4.3
Lattice building . . . . . . . . . . . . . . . . . . . . . . . . 147
5.4.4
Query-time processing . . . . . . . . . . . . . . . . . . . . 148
5.4.5
Baseline systems . . . . . . . . . . . . . . . . . . . . . . . 149
5.4.6
Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 150
5.4.7
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Analysis of dynamic match rules . . . . . . . . . . . . . . . . . . . 152
5.5.1
System configurations . . . . . . . . . . . . . . . . . . . . 153
5.5.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Analysis of DMLS algorithm parameters . . . . . . . . . . . . . . 156
5.6.1
Number of lattice generation tokens . . . . . . . . . . . . . 157
5.6.2
Pruning beamwidth . . . . . . . . . . . . . . . . . . . . . . 158
5.6.3
Number of lattice traversal tokens . . . . . . . . . . . . . . 159
5.6.4
MED cost threshold . . . . . . . . . . . . . . . . . . . . . 160
5.6.5
Tuned systems . . . . . . . . . . . . . . . . . . . . . . . . 162
5.6.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Conversational telephone speech
experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.7.1
Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.7.2
Recogniser parameters . . . . . . . . . . . . . . . . . . . . 165
x
5.7.3
5.8
5.9
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Non-destructive optimisations . . . . . . . . . . . . . . . . . . . . 168
5.8.1
Prefix sequence optimisation . . . . . . . . . . . . . . . . . 169
5.8.2
Early stopping optimisation . . . . . . . . . . . . . . . . . 171
5.8.3
Combining optimisations . . . . . . . . . . . . . . . . . . . 173
Optimised system timings . . . . . . . . . . . . . . . . . . . . . . 174
5.9.1
Experimental procedure . . . . . . . . . . . . . . . . . . . 175
5.9.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6 Non-English Spotting
181
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.2
The issue of limited resources . . . . . . . . . . . . . . . . . . . . 182
6.3
The role of keyword spotting . . . . . . . . . . . . . . . . . . . . . 184
6.4
Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.4.1
Database design . . . . . . . . . . . . . . . . . . . . . . . . 185
6.4.2
Model architectures . . . . . . . . . . . . . . . . . . . . . . 186
6.4.3
Evaluation set design . . . . . . . . . . . . . . . . . . . . . 188
6.4.4
Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 188
6.5
English and Spanish stage 1 evaluations
6.6
English and Spanish post keyword
. . . . . . . . . . . . . . 189
verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.7
Indonesian spotting and verification . . . . . . . . . . . . . . . . . 197
6.8
Extrapolating Indonesian performance . . . . . . . . . . . . . . . 198
6.9
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 200
7 Summary, Conclusions and Future Work
7.1
203
HMM-based Spotting and Verification . . . . . . . . . . . . . . . 203
7.1.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 203
xi
7.1.2
7.2
7.3
7.4
Cohort Word Verification . . . . . . . . . . . . . . . . . . . . . . . 205
7.2.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.2.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Dynamic Match Lattice Spotting . . . . . . . . . . . . . . . . . . 206
7.3.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.3.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Non-English Spotting . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.4.1
7.5
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Final Comments
. . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Bibliography
210
A The Levenstein Distance
217
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
A.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
A.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
xii
List of Tables
3.1
Keyword spotting performance of baseline systems on Switchboard
1 data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
3.2
Effect of target word insertion penalty on PM-KS performance . .
69
3.3
Equal error rates of unnormalised and duration normalised output
score thresholding applied to SBM-KS . . . . . . . . . . . . . . .
71
3.4
Details of phone-length dependent evaluation sets . . . . . . . . .
73
3.5
SBM-KS performance on Switchboard 1 data for different phonelength target words . . . . . . . . . . . . . . . . . . . . . . . . . .
74
3.6
Statistics for keyword verification evaluation sets . . . . . . . . . .
77
3.7
Equal error rates for SBM-based keyword verification . . . . . . .
78
3.8
Equal error rates for SBM and MLP-SBM keyword verification . .
82
4.1
Evaluated cohort word selection parameters . . . . . . . . . . . . 103
4.2
Performance of selected cohort word KV systems on TIMIT evaluation set. Cohort word systems are qualified with the appropriate cohort word selection parameters using a tag in the format
{dmin , dmax , ψd , ψi }. . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3
Performance of SBM-KV and selected cohort word systems on the
SWB1 evaluation sets. Cohort word selection parameters are specified with each system in the format {dmin , dmax , ψd , ψi }. . . . . . 108
xiii
4.4
Mean and standard deviation of the number cohort words used
in the 3 best performing cohort word KV methods for the SWB1
evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5
Performance of baseline SBM-KV and best cohort word systems
on the SWB1 evaluation sets . . . . . . . . . . . . . . . . . . . . . 124
4.6
Performance of the best fused SBM-cohort systems on the SWB1
evaluation sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.7
Performance of the best fused cohort-cohort systems on the SWB1
evaluation sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.8
Correlation analysis of fused EER and individual unfused EER . . 130
4.9
Summary of best performing systems . . . . . . . . . . . . . . . . 135
5.1
Phone substitution costs for DMLS . . . . . . . . . . . . . . . . . 149
5.2
Baseline keyword spotting results evaluated on TIMIT . . . . . . 151
5.3
TIMIT performance when isolating various DP rules
5.4
Effect of adjusting number of lattice generation tokens . . . . . . 157
5.5
Effect of adjusting pruning beamwidth . . . . . . . . . . . . . . . 158
5.6
Effect of adjusting number of traversal tokens . . . . . . . . . . . 160
5.7
Effect of adjusting MED cost threshold Smax . . . . . . . . . . . . 161
5.8
Optimised DMLS configurations evaluated on TIMIT . . . . . . . 163
5.9
Keyword spotting results on SWB1 . . . . . . . . . . . . . . . . . 166
. . . . . . . 154
5.10 Relative speeds of optimised DMLS systems . . . . . . . . . . . . 176
5.11 Performance of a fully optimised DMLS system on Switchboard data177
5.12 Summary of key results . . . . . . . . . . . . . . . . . . . . . . . . 179
6.1
Summary of training data sets . . . . . . . . . . . . . . . . . . . . 186
6.2
Codes used to refer to model architectures . . . . . . . . . . . . . 187
6.3
Summary of evaluation data sets. . . . . . . . . . . . . . . . . . . 188
6.4
Stage 1 spotting rates for various model sets and database sizes . 191
xiv
6.5
Equal error rates after keyword verification for various model sets
and training database sizes . . . . . . . . . . . . . . . . . . . . . . 194
6.6
Stage 1 spotting and stage 2 post verification results for S1I experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
xv
xvi
List of Figures
2.1
An example of a Receiver Operating Characteristic curve . . . . .
24
2.2
An example of a Detection Error Trade-off plot . . . . . . . . . .
25
2.3
Recognition grammar for HMM-based keyword spotting . . . . . .
27
2.4
Sample recognition grammar for small non-keyword vocabulary
keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5
System architecture for HMM keyword spotting using a Speech
Background Model as the non-keyword model . . . . . . . . . . .
2.6
2.9
33
Constructing a recognition network for constrained vocabulary keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8
32
System architecture for HMM keyword spotting using a composite
non-keyword model constructed from phone models . . . . . . . .
2.7
29
38
An optimised constrained vocabulary keyword spotting recognition
network (language model probabilities omitted) . . . . . . . . . .
39
An event spotting network for detecting occurrences of times [16]
40
2.10 Likelihood ratio based keyword occurrence verification with multiple verifier fusion . . . . . . . . . . . . . . . . . . . . . . . . . .
45
2.11 Applying reverse dictionary searches to the detection of the word
ACQUIRE in a phone stream . . . . . . . . . . . . . . . . . . . .
50
2.12 Example of indexed reverse dictionary searching for the detection
of the word ACQUIRE . . . . . . . . . . . . . . . . . . . . . . . .
xvii
52
2.13 Using lattice based searching to locate instances of the word ACQUIRE within a phone lattice . . . . . . . . . . . . . . . . . . . .
54
3.1
Confusability circle for the target word STOCK . . . . . . . . . .
59
3.2
Example of the shared subevent confusable acoustic region for the
keyword STOCK . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Incorporating target word insertion penalty into HMM-based keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4
72
DET plots for duration normalised output score thresholding applied to SBM-KS for keyword length dependent evaluation sets . .
3.6
69
DET plots for unnormalised and duration normalised output score
thresholding applied to SBM-KS . . . . . . . . . . . . . . . . . . .
3.5
63
75
DET plots for different target keyword lengths for SBM-KV on
Switchboard 1 evaluation sets . . . . . . . . . . . . . . . . . . . .
78
3.7
System architecture for MLP background model based KV . . . .
80
3.8
DET plots for SBM and MLP-SBM systems for 4-phone words . .
81
3.9
DET plots for SBM and MLP-SBM systems for 6-phone words . .
81
3.10 DET plots for SBM and MLP-SBM systems for 8-phone words . .
81
4.1
Controlling the degree of CAR region modeling dmin and dmax tuning 93
4.2
A N-class classifier approach to cohort word verification for the
keyword w and cohort word set R(w) . . . . . . . . . . . . . . . .
4.3
99
DET plot for best cohort word and SBM-KV systems on SWB1
4-phone length evaluation set . . . . . . . . . . . . . . . . . . . . 109
4.4
DET plot for best cohort word and SBM-KV systems on SWB1
6-phone length evaluation set . . . . . . . . . . . . . . . . . . . . 109
4.5
Equal error rate versus mean number of cohort words . . . . . . . 112
4.6
Trends in equal error rate with changes in cohort word set downsampling size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xviii
4.7
Trends in equal error rate with changes in cohort word selection
range for 4-phone length cohort word KV . . . . . . . . . . . . . . 117
4.8
Trends in equal error rate with changes in cohort word selection
range for 6-phone length cohort word KV . . . . . . . . . . . . . . 118
4.9
Trends in equal error rate with changes in cohort word selection
range for 8-phone length cohort word KV . . . . . . . . . . . . . . 118
4.10 Trends in equal error rate with changes in MED cost parameters . 120
4.11 Correlation between unfused system performances and fused system performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.12 Boxplot of EERs for all evaluated architectures and phone-lengths 131
4.13 Boxplot of log(EERs) for all evaluated architectures and phonelengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.1
Segment of phone lattice for an instance of the word STOCK
. . 142
5.2
Effect of lattice traversal token parameter . . . . . . . . . . . . . 159
5.3
Trends in miss rate and FA/kw rate performance for various types
of tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.4
Plot of miss rate versus FA/kw rate for HMM, CLS and DMLS
systems evaluated on Switchboard . . . . . . . . . . . . . . . . . . 168
5.5
The relationship between cost matrices for subsequences . . . . . 169
5.6
Demonstration of the MED prefix optimisation algorithm . . . . . 170
6.1
Effect of training dataset size on speech recognition [24] . . . . . . 183
6.2
Trends in miss rate across training database size . . . . . . . . . . 190
6.3
Trends in FA/kw rate across training database size . . . . . . . . 190
6.4
DET plot for T16 experiments. 1=T16S3E, 2=T16S2E, 3=T16S1E,
4=T16S2S, 5=T16S1S . . . . . . . . . . . . . . . . . . . . . . . . 193
6.5
DET plot for M16 experiments. 1=M16S3E, 2=M16S2E, 3=M16S1E,
4=M16S2S, 5=M16S1S . . . . . . . . . . . . . . . . . . . . . . . . 193
xix
6.6
DET plot for M32 experiments. 1=M32S3E, 2=M32S2E, 3=M32S1E,
4=M32S2S, 5=M32S1S . . . . . . . . . . . . . . . . . . . . . . . . 193
6.7
Trends in EER across training dataset size . . . . . . . . . . . . . 195
6.8
DET plot for S2S experiments. 1=T16S2S, 2=M16S2S, 3=M32S2S 196
6.9
DET plot for S1I experiments. 1=T16S1I, 2=M16S1I, 3=M32S1I
197
6.10 Extrapolations of Indonesian keyword spotting performance using
larger sized databases . . . . . . . . . . . . . . . . . . . . . . . . . 199
A.1 Example of cost matrix calculated using Levenstein algorithm for
transforming deranged to hanged. Cost of substitutions, deletions
and insertions all fixed at 1, cost of match fixed at 0. . . . . . . . 220
xx
List of Abbreviations
ADI
Audio Document Indexing
CAR
Confusable Acoustic Region
CLS
Conventional Lattice-based Spotting
CMS
Cepstral Mean Subtraction
CW
Cohort Word
DAR
Disparate Acoustic Region
DET
Detection Error Trade-off
DMLS
Dynamic Match Lattice Spotting
EER
Equal Error Rate
FA
False Alarm
GMM
Gaussian Mixture Model
HMM
Hidden Markov Model
IRDL
Indexed Reverse Dictionary Lookup
KS
Keyword Spotting
KV
Keyword Verification
LVCSR
Large Vocabulary Continuous Speech Recognition
MED
Minimum Edit Distance
MLP
Multi-Layer Perceptron
PLP
Perceptual Linear Prediction
RDL
Reverse Dictionary Lookup
xxi
ROC
Receiver Operating Characteristic
SBM
Speech Background Model
SBM-KS
Speech Background Model based Keyword Spotting
SBM-KV
Speech Background Model based Keyword Verification
STT
Speech-to-Text Transcription
SWB1
Switchboard-1
TAR
Target Acoustic Region
WSJ1
Wall Street Journal 1
xxii
Authorship
The work contained in this thesis has not been previously submitted for a degree
or diploma at any other higher educational institution. To the best of my knowledge and belief, the thesis contains no material previously published or written
by another person except where due reference is made.
Signed:
Date:
xxiii