01/31/1
5
Automatic Speech
Recognition
1
Quan. V, Ha. N
ATK-HTK
01/31/1
5
Automatic Speech
Recognition
2
Quan. V, Ha. N
An Application ToolKit for HTK
•
/>Basic Recognition System
HTK
wav
mfcc
phrase
01/31/1
5
Automatic Speech
Recognition
3
Quan. V, Ha. N
The Hidden Markov Model Toolkit
•
Data Preparation
•
Creating Monophone HMMs
•
Creating Tied-State Triphones
•
Recognizer Evaluation
•
Mixture Incrementing
•
Adapting the HMMs
01/31/1
5
Automatic Speech
Recognition
4
Quan. V, Ha. N
Training Strategy
Monophone Training
Making Triphones
from Monophones
Unclustered Triphone
Training
Making Tied-state
Triphones
Clustered Triphone
Training
Mixture Incrementing
Recogning the Test Data
Final HMM set
Continue splitting?
Y
N
01/31/1
5
Automatic Speech
Recognition
5
Quan. V, Ha. N
Data Preparation
•
Step 1 - the Task Grammar
•
Step 2 - the Dictionary
•
Step 3 - Recording the Data
•
Step 4 - Creating a Transcription Files
•
Step 5 - Coding the Data
01/31/1
5
Automatic Speech
Recognition
6
Quan. V, Ha. N
Step 1 - the Task Grammar
HParse.exe gram.txt wdnet.txt
Gram.tx t
$digit = MOOJT | HAI | BA | BOOSN | NAWM |
SASU | BARY | TASM | CHISN | KHOONG;
$name = [ THAAFY ] QUAAN |
[ HOAFNG ] HAJ;
( SENT-START ( NOOSI [MASY] TOWSI [SOOS] <$digit> |
(LIEEN LAJC | GOJI) $name) SENT-END )
01/31/1
5
Automatic Speech
Recognition
7
Quan. V, Ha. N
Step 1 - the Task Grammar
Wdnet.txt
VERSION=1.0
N=28 L=62
I=0 W=SENT-END
I=1 W=HAJ
I=2 W=!NULL
…
I=27 W=!NULL
J=0 S=2 E=0
J=1 S=11 E=0
…
J=61 S=0 E=26
I=27
W=!NULL
I=25
W=SENT-START
I=24
W=NOOSI
I=0
W=SENT-END
I=26
W=!NULL
…
J=60
J=61
01/31/1
5
Automatic Speech
Recognition
8
Quan. V, Ha. N
Step 1 - the Task Grammar
01/31/1
5
Automatic Speech
Recognition
9
Quan. V, Ha. N
Step 2 - the Dictionary
Dict.txt
BA B A sp
BOOSN B OO <S> N sp
LAJC L A <J> C sp
LIEEN L I EE N sp
MASY M A <S> Y sp
NOOSI N OO <S> I sp
SENT-START [] sil
SENT-END [] sil
THAAFY TH AA <F> Y
HDMan.exe -m -w wlist -n monophones1
-l dlog dict beep names
01/31/1
5
Automatic Speech
Recognition
10
Quan. V, Ha. N
Step 3 - Recording the Data
HSGen.exe -l -n 10 wdnet.txt dict.txt
>> prompts.txt
Prompts.txt
S001 NOOSI MASY TOWSI TASM BA
S002 GOJI QUAAN
S003 GOJI THAAFY QUAAN
S004 NOOSI MASY TOWSI MOOJT TASM KHOONG
S005 LIEEN LAJC THAAFY QUAAN
S006 GOJI HOAFNG HAJ
S007 NOOSI TOWSI CHISN
S008 LIEEN LAJC THAAFY QUAAN
S009 LIEEN LAJC HOAFNG HAJ
S010 LIEEN LAJC QUAAN
S001.wav
S002.wav
S003.wav
S004.wav
S005.wav
S006.wav
S007.wav
S008.wav
S009.wav
S010.wav
01/31/1
5
Automatic Speech
Recognition
11
Quan. V, Ha. N
Step 4 –
Creating a Transcription Files
Words.mlf
#!MLF!#
"S001.lab"
NOOSI
MASY
TOWSI
TASM
BA
.
"S002.lab"
GOJI
QUAAN
.
etc
Perl.exe prompts2mlf
words.mlf prompts.txt
Prompts.txt
S001 NOOSI MASY TOWSI TASM BA
S002 GOJI QUAAN
etc
01/31/1
5
Automatic Speech
Recognition
12
Quan. V, Ha. N
Step 4 –
Creating a Transcription Files
HLEd.exe -l '*' -d dict.txt
–i phones0.mlf
mkphones0.led words.mlf
Phones0.mlf
#!MLF!#
“*/S001.lab"
sil
N
OO
<S>
I
M
A
<S>
Y
T
OW
<S>
I
…
Phones1.mlf
#!MLF!#
“*/S001.lab"
sil
N
OO
<S>
I
sp
M
A
<S>
Y
sp
T
OW
…
01/31/1
5
Automatic Speech
Recognition
13
Quan. V, Ha. N
Step 5 - Coding the Data
Config_Hcopy.txt
#coding parameters - HCopy
SOURCEKIND = WAVEFORM
SOURCEFORMAT = WAV
TARGETKIND = MFCC_0_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = F
Hcopy.exe -T 1
-C config_HCopy
-S wav2mfc.scp
wav2mfc.scp
S001.wav S001.mfc
S002.wav S002.mfc
S003.wav S003.mfc
S004.wav S004.mfc
…
01/31/1
5
Automatic Speech
Recognition
14
Quan. V, Ha. N
Step 5 - Coding the Data
•
WAVEFORM sampled waveform
•
LPC linear prediction filter coe±cients
•
LPREFC linear prediction reflection
coe±cients
•
LPCEPSTRA LPC cepstral coe±cients
•
LPDELCEP LPC cepstra plus delta
coe±cients
•
IREFC LPC reflection coef in 16 bit
integer format
•
MFCC mel-frequency cepstral
coe±cients
•
FBANK log mel-filter bank channel
outputs
•
MELSPEC linear mel-filter bank
channel outputs
•
USER user defined sample kind
•
DISCRETE vector quantised data
•
E has energy
•
N absolute energy suppressed
•
D has delta coeffcients
•
A has acceleration coeffcients
•
C is compressed
•
Z has zero mean static coef.
•
K has CRC checksum
•
O has 0’th cepstral coef.
01/31/1
5
Automatic Speech
Recognition
15
Quan. V, Ha. N
Creating Monophone HMMs
•
Step 6 – Creating Flat Start Monophones
•
Step 7 – Fixing the Silence Models
•
Step 8 – Realigning the Training Data
01/31/1
5
Automatic Speech
Recognition
16
Quan. V, Ha. N
Step 6 – Creating Flat Start Monophones
HCompV -C config_HCompV.txt
-f 0.01 -m
-S train.scp
-M hmm0 proto.txt
Proto.txt
~o <VecSize> 39 <MFCC_0_D_A>
~h "proto"
<BeginHMM>
<NumStates> 5
<State> 2
<Mean> 39
0 0 0
<Variance> 39
1 1 1
<State> 3
<Mean> 39
0 0 0
<Variance> 39
1 1 1
<State> 4
<Mean> 39
0 0 0
<Variance> 39
1 1 1
<TransP> 5
0.0 1.0 0.0 0.0 0.0
0.0 0.6 0.4 0.0 0.0
0.0 0.0 0.6 0.4 0.0
0.0 0.0 0.0 0.7 0.3
0.0 0.0 0.0 0.0 0.0
<EndHMM>
Config_HCompV.txt
TARGETKIND = MFCC_0_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = F
01/31/1
5
Automatic Speech
Recognition
17
Quan. V, Ha. N
Step 6 – Creating Flat Start Monophones
macro
~o
<VecSize> 39
<MFCC_0_D_A>
~v “varFloor1”
<Variance> 39
0.0012 0.0003
hmmdefs
~h “sil”
<BeginHMM>
<EndHMM>
~h “a”
<BeginHMM>
<EndHMM>
~h “b”
<BeginHMM>
<EndHMM>
hmmdefs
~h “proto”
<BeginHMM>
<EndHMM>
monophone0
sil
a
b
x
=
monophone1
sil
sp
a
b
01/31/1
5
Automatic Speech
Recognition
18
Quan. V, Ha. N
A Re-Estimation Tool - HERest
HERest -C config -I phones0.mlf
-t 250.0 150.0 1000.0
-S train.scp
-H hmm0/macros -H hmm0/hmmdefs
-M hmm1
monophones0
HERest.exe [options] hmmList trainFile
The flat start monophones stored in the directory hmm0
are re-estimated using HERest:
01/31/1
5
Automatic Speech
Recognition
19
Quan. V, Ha. N
Step 7 – Fixing the Silence Models
“sil g o <j> i ch i <s> n m oo <j> t m oo <j> t sil”
monophones0
“sil g o <j> i sp
ch i <s> n sp
m oo <j> t sp
m oo <j> t sil”
monophones1
01/31/1
5
Automatic Speech
Recognition
20
Quan. V, Ha. N
Step 7 – Fixing the Silence Models
1. HERest x 2 for monophones0
2. add sp HMM
3. HERest x 2 for monophones1
~h "sil"
<BEGINHMM>
<NUMSTATES> 5
<STATE> 2
<MEAN> 39
-7.030658e+000 1.095834e+000
<VARIANCE> 39
9.946199e+000 1.149288e+001
<GCONST> 8.910428e+001
<STATE> 3
~s “sil_sp”
<STATE> 4
<MEAN> 39
-1.071942e+001 -3.000225e+000
<VARIANCE> 39
5.828240e+000 7.320161e+000
<GCONST> 8.172852e+001
<TRANSP> 5
<ENDHMM>
~s “sil_sp”
<MEAN> 39
-8.414185e+000 -2.211869e+000
<VARIANCE> 39
7.550930e+000 1.156416e+001
<GCONST> 1.045451e+002
~h "sp"
<BEGINHMM>
<NUMSTATES> 3
<STATE> 2
~s “sil_sp”
<TRANSP> 3
0 1 0
0 0.3 0.7
0 0 0
<ENDHMM>
01/31/1
5
Automatic Speech
Recognition
21
Quan. V, Ha. N
Step 8 –
Realigning the Training Data
HVite.exe -l * -o SWT -b SILENCE
-a
-H hmm7/macros -H hmm7/hmmdefs
-i aligned.mlf -m -t 250.0
-y lab -I words.mlf
-S train.scp
dict.txt monophones1
HERest x 2 for aligned.mlf
•
multiple pronunciations
01/31/1
5
Automatic Speech
Recognition
22
Quan. V, Ha. N
Step 8 –
Realigning the Training Data
01/31/1
5
Automatic Speech
Recognition
23
Quan. V, Ha. N
Creating Tied-State Triphones
•
Step 9 – Making triphones from Monophones
•
Step 10 – Making Tied-state Triphones
01/31/1
5
Automatic Speech
Recognition
24
Quan. V, Ha. N
Step 9 –
Making triphones from Monophones
HLEd -n triphones1 -l * -i wintri.mlf
mktri.led aligned.mlf
triphone_cross word
triphone within word:
“sil b i t sp b u t sil”
“sil b+i b-i+t i-t sp b+u b-u+t u-t sil”
“sil sil-b+i b-i+t i-t+b sp
t-b+u b-u+t u-t+sil sil”
01/31/1
5
Automatic Speech
Recognition
25
Quan. V, Ha. N
Word Netword Expansion
FORCECXTEXP = F
ALLOWXWRDEXP = F
FORCECXTEXP = T
ALLOWXWRDEXP = F
FORCECXTEXP = T
ALLOWXWRDEXP = T