Tải bản đầy đủ (.pdf) (248 trang)

Acoustic keyword spotting in speech with applications to data mining

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.18 MB, 248 trang )

Speech and Audio Research Laboratory of the SAIVT program
Centre for Built Environment and Engineering Research

ACOUSTIC KEYWORD SPOTTING
IN SPEECH WITH APPLICATIONS
TO DATA MINING

A. J. Kishan Thambiratnam
BE(Electronics)/BInfTech

SUBMITTED AS A REQUIREMENT OF
THE DEGREE OF
DOCTOR OF PHILOSOPHY
AT
QUEENSLAND UNIVERSITY OF TECHNOLOGY
BRISBANE, QUEENSLAND
9 MARCH 2005



Keywords
Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword Verification, Confidence Scoring, Speech Recognition, Utterance Verification

i


ii


Abstract
Keyword Spotting is the task of detecting keywords of interest within continuous speech. The applications of this technology range from call centre dialogue


systems to covert speech surveillance devices. Keyword spotting is particularly
well suited to data mining tasks such as real-time keyword monitoring and unrestricted vocabulary audio document indexing. However, to date, many keyword
spotting approaches have suffered from poor detection rates, high false alarm
rates, or slow execution times, thus reducing their commercial viability.
This work investigates the application of keyword spotting to data mining
tasks. The thesis makes a number of major contributions to the field of keyword
spotting.
The first major contribution is the development of a novel keyword verification
method named Cohort Word Verification. This method combines high level linguistic information with cohort-based verification techniques to obtain dramatic
improvements in verification performance, in particular for the problematic short
duration target word class.
The second major contribution is the development of a novel audio document
indexing technique named Dynamic Match Lattice Spotting. This technique augments lattice-based audio indexing principles with dynamic sequence matching
techniques to provide robustness to erroneous lattice realisations. The resulting
algorithm obtains significant improvement in detection rate over lattice-based
iii


audio document indexing while still maintaining extremely fast search speeds.
The third major contribution is the study of multiple verifier fusion for the task
of keyword verification. The reported experiments demonstrate that substantial
improvements in verification performance can be obtained through the fusion
of multiple keyword verifiers. The research focuses on combinations of speech
background model based verifiers and cohort word verifiers.
The final major contribution is a comprehensive study of the effects of limited
training data for keyword spotting. This study is performed with consideration
as to how these effects impact the immediate development and deployment of
speech technologies for non-English languages.

iv



Contents
Keywords

i

Abstract

iii

List of Tables

xiii

List of Figures

xvi

List of Abbreviations

xxi

Authorship

xxiii

Acknowledgments

xxv


1 Introduction
1.1

1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1.1

Aims and Objectives . . . . . . . . . . . . . . . . . . . . .

2

1.1.2

Research Scope . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3


Major Contributions of this Research . . . . . . . . . . . . . . . .

6

1.4

List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2 A Review of Keyword Spotting
2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v

9
9


2.2

The keyword spotting problem . . . . . . . . . . . . . . . . . . . .

10

2.3

Applications of keyword spotting . . . . . . . . . . . . . . . . . .


11

2.3.1

Keyword monitoring applications . . . . . . . . . . . . . .

11

2.3.2

Audio document indexing . . . . . . . . . . . . . . . . . .

13

2.3.3

Command controlled devices . . . . . . . . . . . . . . . . .

13

2.3.4

Dialogue systems . . . . . . . . . . . . . . . . . . . . . . .

14

The development of keyword spotting . . . . . . . . . . . . . . . .

15


2.4.1

Sliding window approaches . . . . . . . . . . . . . . . . . .

15

2.4.2

Non-keyword model approaches . . . . . . . . . . . . . . .

16

2.4.3

Hidden Markov Model approaches . . . . . . . . . . . . . .

17

2.4.4

Further developments . . . . . . . . . . . . . . . . . . . . .

17

Performance Measures . . . . . . . . . . . . . . . . . . . . . . . .

18

2.5.1


The reference and result sets . . . . . . . . . . . . . . . . .

19

2.5.2

The hit operator . . . . . . . . . . . . . . . . . . . . . . .

19

2.5.3

Miss rate

. . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.5.4

False alarm rate . . . . . . . . . . . . . . . . . . . . . . . .

21

2.5.5

False acceptance rate . . . . . . . . . . . . . . . . . . . . .

21


2.5.6

Execution time . . . . . . . . . . . . . . . . . . . . . . . .

22

2.5.7

Figure of Merit . . . . . . . . . . . . . . . . . . . . . . . .

22

2.5.8

Equal Error Rate . . . . . . . . . . . . . . . . . . . . . . .

23

2.5.9

Receiver Operating Characteristic Curves . . . . . . . . . .

24

2.5.10 Detection Error Trade-off Plots . . . . . . . . . . . . . . .

25

Unconstrained vocabulary spotting . . . . . . . . . . . . . . . . .


26

2.6.1

HMM-based approach . . . . . . . . . . . . . . . . . . . .

26

2.6.2

Neural Network Approaches . . . . . . . . . . . . . . . . .

28

Approaches to non-keyword modeling . . . . . . . . . . . . . . . .

31

2.7.1

Speech background model . . . . . . . . . . . . . . . . . .

31

2.7.2

Phone models . . . . . . . . . . . . . . . . . . . . . . . . .

33


2.4

2.5

2.6

2.7

vi


2.7.3

Uniform distribution . . . . . . . . . . . . . . . . . . . . .

34

2.7.4

Online garbage model . . . . . . . . . . . . . . . . . . . .

34

Constrained vocabulary spotting . . . . . . . . . . . . . . . . . . .

36

2.8.1

Language model approaches . . . . . . . . . . . . . . . . .


36

2.8.2

Event spotting . . . . . . . . . . . . . . . . . . . . . . . .

39

Keyword verification . . . . . . . . . . . . . . . . . . . . . . . . .

41

2.9.1

A formal definition . . . . . . . . . . . . . . . . . . . . . .

42

2.9.2

Combining keyword spotting and verification . . . . . . . .

42

2.9.3

The problem of short duration keywords . . . . . . . . . .

43


2.9.4

Likelihood ratio based approaches . . . . . . . . . . . . . .

43

2.9.5

Alternate Information Sources . . . . . . . . . . . . . . . .

46

2.10 Audio Document Indexing . . . . . . . . . . . . . . . . . . . . . .

47

2.8

2.9

2.10.1 Limitations of the Speech-to-Text

3

Transcription approach . . . . . . . . . . . . . . . . . . . .

48

2.10.2 Reverse dictionary lookup searches . . . . . . . . . . . . .


49

2.10.3 Indexed reverse dictionary lookup searches . . . . . . . . .

51

2.10.4 Lattice based searches . . . . . . . . . . . . . . . . . . . .

53

HMM-based spotting and verification

57

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.2

The confusability circle framework . . . . . . . . . . . . . . . . . .

58

3.3

Analysis of non-keyword models . . . . . . . . . . . . . . . . . . .


60

3.3.1

All-speech models . . . . . . . . . . . . . . . . . . . . . . .

60

3.3.2

SBM methods . . . . . . . . . . . . . . . . . . . . . . . . .

61

3.3.3

Phone-set methods . . . . . . . . . . . . . . . . . . . . . .

62

3.3.4

Target-word-excluding methods . . . . . . . . . . . . . . .

62

Evaluation of keyword spotting techniques . . . . . . . . . . . . .

63


3.4

3.4.1

Experiment setup . . . . . . . . . . . . . . . . . . . . . . .
vii

64


3.4.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

3.5

Tuning the phone set non-keyword model . . . . . . . . . . . . . .

68

3.6

Output score thresholding for SBM
spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70


Performance across keyword length . . . . . . . . . . . . . . . . .

72

3.7.1

Evaluation sets . . . . . . . . . . . . . . . . . . . . . . . .

73

3.7.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

HMM-based keyword verification . . . . . . . . . . . . . . . . . .

74

3.8.1

Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.8.2

Evaluation procedure . . . . . . . . . . . . . . . . . . . . .


77

3.8.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

Discriminative background model KV . . . . . . . . . . . . . . . .

79

3.9.1

System architecture . . . . . . . . . . . . . . . . . . . . . .

79

3.9.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

3.10 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . .

82

3.7


3.8

3.9

4 Cohort word keyword verification

85

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

4.2

Foundational concepts . . . . . . . . . . . . . . . . . . . . . . . .

87

4.2.1

Cohort-based scoring . . . . . . . . . . . . . . . . . . . . .

87

4.2.2

The use of language information . . . . . . . . . . . . . . .


88

4.3

Overview of the cohort word technique . . . . . . . . . . . . . . .

90

4.4

Cohort word set construction . . . . . . . . . . . . . . . . . . . .

92

4.4.1

The choice of dmin and dmax . . . . . . . . . . . . . . . . .

92

4.4.2

Cohort word set downsampling . . . . . . . . . . . . . . .

94

4.4.3

Distance function . . . . . . . . . . . . . . . . . . . . . . .


94

Classification approach . . . . . . . . . . . . . . . . . . . . . . . .

96

4.5.1

96

4.5

2-class classification approach . . . . . . . . . . . . . . . .
viii


4.5.2

Hybrid N-class approach . . . . . . . . . . . . . . . . . . .

98

4.6

Summary of the cohort word algorithm . . . . . . . . . . . . . . . 100

4.7

Comparison of classifier approaches . . . . . . . . . . . . . . . . . 101


4.8

4.9

4.7.1

Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.7.2

Recogniser parameters . . . . . . . . . . . . . . . . . . . . 103

4.7.3

Cohort word selection . . . . . . . . . . . . . . . . . . . . 103

4.7.4

Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 104

4.7.5

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Performance across target keyword length . . . . . . . . . . . . . 106
4.8.1

Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.8.2


Recogniser parameters . . . . . . . . . . . . . . . . . . . . 107

4.8.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.8.4

Analysis of poor 8-phone performance . . . . . . . . . . . . 110

4.8.5

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Effects of selection parameters . . . . . . . . . . . . . . . . . . . . 113
4.9.1

Cohort word set downsampling . . . . . . . . . . . . . . . 114

4.9.2

Cohort word selection range . . . . . . . . . . . . . . . . . 116

4.9.3

MED cost parameters . . . . . . . . . . . . . . . . . . . . 119

4.9.4


Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.10 Fused cohort word systems . . . . . . . . . . . . . . . . . . . . . . 122
4.10.1 Training dataset

. . . . . . . . . . . . . . . . . . . . . . . 123

4.10.2 Neural network architecture . . . . . . . . . . . . . . . . . 123
4.10.3 Experimental procedure . . . . . . . . . . . . . . . . . . . 123
4.10.4 Baseline unfused results . . . . . . . . . . . . . . . . . . . 124
4.10.5 Fused SBM-CW experiments . . . . . . . . . . . . . . . . . 125
4.10.6 Fused CW-CW experiments . . . . . . . . . . . . . . . . . 128
4.10.7 Comparison of fused and unfused systems . . . . . . . . . 129
4.11 Conclusions and Summary . . . . . . . . . . . . . . . . . . . . . . 133
ix


5 Dynamic Match Lattice Spotting

137

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.2

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.3


Dynamic Match Lattice Spotting method . . . . . . . . . . . . . . 140

5.4

5.5

5.6

5.7

5.3.1

Basic method . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.3.2

Optimised Dynamic Match Lattice Search . . . . . . . . . 145

Evaluation of DMLS performance . . . . . . . . . . . . . . . . . . 146
5.4.1

Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.4.2

Recogniser parameters . . . . . . . . . . . . . . . . . . . . 147

5.4.3


Lattice building . . . . . . . . . . . . . . . . . . . . . . . . 147

5.4.4

Query-time processing . . . . . . . . . . . . . . . . . . . . 148

5.4.5

Baseline systems . . . . . . . . . . . . . . . . . . . . . . . 149

5.4.6

Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 150

5.4.7

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Analysis of dynamic match rules . . . . . . . . . . . . . . . . . . . 152
5.5.1

System configurations . . . . . . . . . . . . . . . . . . . . 153

5.5.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Analysis of DMLS algorithm parameters . . . . . . . . . . . . . . 156
5.6.1


Number of lattice generation tokens . . . . . . . . . . . . . 157

5.6.2

Pruning beamwidth . . . . . . . . . . . . . . . . . . . . . . 158

5.6.3

Number of lattice traversal tokens . . . . . . . . . . . . . . 159

5.6.4

MED cost threshold . . . . . . . . . . . . . . . . . . . . . 160

5.6.5

Tuned systems . . . . . . . . . . . . . . . . . . . . . . . . 162

5.6.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 163

Conversational telephone speech
experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.7.1

Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 165

5.7.2


Recogniser parameters . . . . . . . . . . . . . . . . . . . . 165
x


5.7.3
5.8

5.9

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

Non-destructive optimisations . . . . . . . . . . . . . . . . . . . . 168
5.8.1

Prefix sequence optimisation . . . . . . . . . . . . . . . . . 169

5.8.2

Early stopping optimisation . . . . . . . . . . . . . . . . . 171

5.8.3

Combining optimisations . . . . . . . . . . . . . . . . . . . 173

Optimised system timings . . . . . . . . . . . . . . . . . . . . . . 174
5.9.1

Experimental procedure . . . . . . . . . . . . . . . . . . . 175

5.9.2


Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6 Non-English Spotting

181

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

6.2

The issue of limited resources . . . . . . . . . . . . . . . . . . . . 182

6.3

The role of keyword spotting . . . . . . . . . . . . . . . . . . . . . 184

6.4

Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.4.1

Database design . . . . . . . . . . . . . . . . . . . . . . . . 185

6.4.2

Model architectures . . . . . . . . . . . . . . . . . . . . . . 186


6.4.3

Evaluation set design . . . . . . . . . . . . . . . . . . . . . 188

6.4.4

Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 188

6.5

English and Spanish stage 1 evaluations

6.6

English and Spanish post keyword

. . . . . . . . . . . . . . 189

verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.7

Indonesian spotting and verification . . . . . . . . . . . . . . . . . 197

6.8

Extrapolating Indonesian performance . . . . . . . . . . . . . . . 198

6.9


Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 200

7 Summary, Conclusions and Future Work
7.1

203

HMM-based Spotting and Verification . . . . . . . . . . . . . . . 203
7.1.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 203
xi


7.1.2
7.2

7.3

7.4

Cohort Word Verification . . . . . . . . . . . . . . . . . . . . . . . 205
7.2.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 205

7.2.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 206


Dynamic Match Lattice Spotting . . . . . . . . . . . . . . . . . . 206
7.3.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 207

7.3.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 208

Non-English Spotting . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.4.1

7.5

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 208

Final Comments

. . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Bibliography

210

A The Levenstein Distance

217


A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
A.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
A.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

xii


List of Tables
3.1

Keyword spotting performance of baseline systems on Switchboard
1 data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.2

Effect of target word insertion penalty on PM-KS performance . .

69

3.3

Equal error rates of unnormalised and duration normalised output
score thresholding applied to SBM-KS . . . . . . . . . . . . . . .

71

3.4


Details of phone-length dependent evaluation sets . . . . . . . . .

73

3.5

SBM-KS performance on Switchboard 1 data for different phonelength target words . . . . . . . . . . . . . . . . . . . . . . . . . .

74

3.6

Statistics for keyword verification evaluation sets . . . . . . . . . .

77

3.7

Equal error rates for SBM-based keyword verification . . . . . . .

78

3.8

Equal error rates for SBM and MLP-SBM keyword verification . .

82

4.1


Evaluated cohort word selection parameters . . . . . . . . . . . . 103

4.2

Performance of selected cohort word KV systems on TIMIT evaluation set. Cohort word systems are qualified with the appropriate cohort word selection parameters using a tag in the format
{dmin , dmax , ψd , ψi }. . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.3

Performance of SBM-KV and selected cohort word systems on the
SWB1 evaluation sets. Cohort word selection parameters are specified with each system in the format {dmin , dmax , ψd , ψi }. . . . . . 108
xiii


4.4

Mean and standard deviation of the number cohort words used
in the 3 best performing cohort word KV methods for the SWB1
evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.5

Performance of baseline SBM-KV and best cohort word systems
on the SWB1 evaluation sets . . . . . . . . . . . . . . . . . . . . . 124

4.6

Performance of the best fused SBM-cohort systems on the SWB1
evaluation sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125


4.7

Performance of the best fused cohort-cohort systems on the SWB1
evaluation sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.8

Correlation analysis of fused EER and individual unfused EER . . 130

4.9

Summary of best performing systems . . . . . . . . . . . . . . . . 135

5.1

Phone substitution costs for DMLS . . . . . . . . . . . . . . . . . 149

5.2

Baseline keyword spotting results evaluated on TIMIT . . . . . . 151

5.3

TIMIT performance when isolating various DP rules

5.4

Effect of adjusting number of lattice generation tokens . . . . . . 157

5.5


Effect of adjusting pruning beamwidth . . . . . . . . . . . . . . . 158

5.6

Effect of adjusting number of traversal tokens . . . . . . . . . . . 160

5.7

Effect of adjusting MED cost threshold Smax . . . . . . . . . . . . 161

5.8

Optimised DMLS configurations evaluated on TIMIT . . . . . . . 163

5.9

Keyword spotting results on SWB1 . . . . . . . . . . . . . . . . . 166

. . . . . . . 154

5.10 Relative speeds of optimised DMLS systems . . . . . . . . . . . . 176
5.11 Performance of a fully optimised DMLS system on Switchboard data177
5.12 Summary of key results . . . . . . . . . . . . . . . . . . . . . . . . 179
6.1

Summary of training data sets . . . . . . . . . . . . . . . . . . . . 186

6.2


Codes used to refer to model architectures . . . . . . . . . . . . . 187

6.3

Summary of evaluation data sets. . . . . . . . . . . . . . . . . . . 188

6.4

Stage 1 spotting rates for various model sets and database sizes . 191
xiv


6.5

Equal error rates after keyword verification for various model sets
and training database sizes . . . . . . . . . . . . . . . . . . . . . . 194

6.6

Stage 1 spotting and stage 2 post verification results for S1I experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

xv


xvi


List of Figures
2.1


An example of a Receiver Operating Characteristic curve . . . . .

24

2.2

An example of a Detection Error Trade-off plot . . . . . . . . . .

25

2.3

Recognition grammar for HMM-based keyword spotting . . . . . .

27

2.4

Sample recognition grammar for small non-keyword vocabulary
keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5

System architecture for HMM keyword spotting using a Speech
Background Model as the non-keyword model . . . . . . . . . . .

2.6

2.9


33

Constructing a recognition network for constrained vocabulary keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.8

32

System architecture for HMM keyword spotting using a composite
non-keyword model constructed from phone models . . . . . . . .

2.7

29

38

An optimised constrained vocabulary keyword spotting recognition
network (language model probabilities omitted) . . . . . . . . . .

39

An event spotting network for detecting occurrences of times [16]

40

2.10 Likelihood ratio based keyword occurrence verification with multiple verifier fusion . . . . . . . . . . . . . . . . . . . . . . . . . .

45


2.11 Applying reverse dictionary searches to the detection of the word
ACQUIRE in a phone stream . . . . . . . . . . . . . . . . . . . .

50

2.12 Example of indexed reverse dictionary searching for the detection
of the word ACQUIRE . . . . . . . . . . . . . . . . . . . . . . . .
xvii

52


2.13 Using lattice based searching to locate instances of the word ACQUIRE within a phone lattice . . . . . . . . . . . . . . . . . . . .

54

3.1

Confusability circle for the target word STOCK . . . . . . . . . .

59

3.2

Example of the shared subevent confusable acoustic region for the
keyword STOCK . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3

Incorporating target word insertion penalty into HMM-based keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


3.4

72

DET plots for duration normalised output score thresholding applied to SBM-KS for keyword length dependent evaluation sets . .

3.6

69

DET plots for unnormalised and duration normalised output score
thresholding applied to SBM-KS . . . . . . . . . . . . . . . . . . .

3.5

63

75

DET plots for different target keyword lengths for SBM-KV on
Switchboard 1 evaluation sets . . . . . . . . . . . . . . . . . . . .

78

3.7

System architecture for MLP background model based KV . . . .

80


3.8

DET plots for SBM and MLP-SBM systems for 4-phone words . .

81

3.9

DET plots for SBM and MLP-SBM systems for 6-phone words . .

81

3.10 DET plots for SBM and MLP-SBM systems for 8-phone words . .

81

4.1

Controlling the degree of CAR region modeling dmin and dmax tuning 93

4.2

A N-class classifier approach to cohort word verification for the
keyword w and cohort word set R(w) . . . . . . . . . . . . . . . .

4.3

99


DET plot for best cohort word and SBM-KV systems on SWB1
4-phone length evaluation set . . . . . . . . . . . . . . . . . . . . 109

4.4

DET plot for best cohort word and SBM-KV systems on SWB1
6-phone length evaluation set . . . . . . . . . . . . . . . . . . . . 109

4.5

Equal error rate versus mean number of cohort words . . . . . . . 112

4.6

Trends in equal error rate with changes in cohort word set downsampling size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xviii


4.7

Trends in equal error rate with changes in cohort word selection
range for 4-phone length cohort word KV . . . . . . . . . . . . . . 117

4.8

Trends in equal error rate with changes in cohort word selection
range for 6-phone length cohort word KV . . . . . . . . . . . . . . 118

4.9


Trends in equal error rate with changes in cohort word selection
range for 8-phone length cohort word KV . . . . . . . . . . . . . . 118

4.10 Trends in equal error rate with changes in MED cost parameters . 120
4.11 Correlation between unfused system performances and fused system performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.12 Boxplot of EERs for all evaluated architectures and phone-lengths 131
4.13 Boxplot of log(EERs) for all evaluated architectures and phonelengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.1

Segment of phone lattice for an instance of the word STOCK

. . 142

5.2

Effect of lattice traversal token parameter . . . . . . . . . . . . . 159

5.3

Trends in miss rate and FA/kw rate performance for various types
of tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

5.4

Plot of miss rate versus FA/kw rate for HMM, CLS and DMLS
systems evaluated on Switchboard . . . . . . . . . . . . . . . . . . 168

5.5

The relationship between cost matrices for subsequences . . . . . 169


5.6

Demonstration of the MED prefix optimisation algorithm . . . . . 170

6.1

Effect of training dataset size on speech recognition [24] . . . . . . 183

6.2

Trends in miss rate across training database size . . . . . . . . . . 190

6.3

Trends in FA/kw rate across training database size . . . . . . . . 190

6.4

DET plot for T16 experiments. 1=T16S3E, 2=T16S2E, 3=T16S1E,
4=T16S2S, 5=T16S1S . . . . . . . . . . . . . . . . . . . . . . . . 193

6.5

DET plot for M16 experiments. 1=M16S3E, 2=M16S2E, 3=M16S1E,
4=M16S2S, 5=M16S1S . . . . . . . . . . . . . . . . . . . . . . . . 193
xix


6.6


DET plot for M32 experiments. 1=M32S3E, 2=M32S2E, 3=M32S1E,
4=M32S2S, 5=M32S1S . . . . . . . . . . . . . . . . . . . . . . . . 193

6.7

Trends in EER across training dataset size . . . . . . . . . . . . . 195

6.8

DET plot for S2S experiments. 1=T16S2S, 2=M16S2S, 3=M32S2S 196

6.9

DET plot for S1I experiments. 1=T16S1I, 2=M16S1I, 3=M32S1I

197

6.10 Extrapolations of Indonesian keyword spotting performance using
larger sized databases . . . . . . . . . . . . . . . . . . . . . . . . . 199
A.1 Example of cost matrix calculated using Levenstein algorithm for
transforming deranged to hanged. Cost of substitutions, deletions
and insertions all fixed at 1, cost of match fixed at 0. . . . . . . . 220

xx


List of Abbreviations
ADI


Audio Document Indexing

CAR

Confusable Acoustic Region

CLS

Conventional Lattice-based Spotting

CMS

Cepstral Mean Subtraction

CW

Cohort Word

DAR

Disparate Acoustic Region

DET

Detection Error Trade-off

DMLS

Dynamic Match Lattice Spotting


EER

Equal Error Rate

FA

False Alarm

GMM

Gaussian Mixture Model

HMM

Hidden Markov Model

IRDL

Indexed Reverse Dictionary Lookup

KS

Keyword Spotting

KV

Keyword Verification

LVCSR


Large Vocabulary Continuous Speech Recognition

MED

Minimum Edit Distance

MLP

Multi-Layer Perceptron

PLP

Perceptual Linear Prediction

RDL

Reverse Dictionary Lookup
xxi


ROC

Receiver Operating Characteristic

SBM

Speech Background Model

SBM-KS


Speech Background Model based Keyword Spotting

SBM-KV

Speech Background Model based Keyword Verification

STT

Speech-to-Text Transcription

SWB1

Switchboard-1

TAR

Target Acoustic Region

WSJ1

Wall Street Journal 1

xxii


Authorship
The work contained in this thesis has not been previously submitted for a degree
or diploma at any other higher educational institution. To the best of my knowledge and belief, the thesis contains no material previously published or written
by another person except where due reference is made.


Signed:
Date:

xxiii


×