Tải bản đầy đủ (.doc) (5 trang)

Vietnamese Speech Recognition and Synthesis in Embedded System Using T-Engine

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (176.78 KB, 5 trang )

Vietnamese Speech Recognition and Synthesis in
Embedded System Using T-Engine
Trinh Van Loan, La The Vinh
Department of Computer Engineering, Faculty of Information Technology, Hanoi University of Technology
1A DaiCoViet, Hanoi, Vietnam.
Abstract – In Vietnam, researches in speech recognition and
synthesis have been started in recent years. Together with the
developing trend of human-computer interaction systems using
speech, the optimization of speech recognition and synthesis
modules in both speed and quality is an important problem, in
order to combine the two modules in one interactive product.
Based on well-known methods for recognition and synthesis
problems, we do some experiment and enhancement to improve
both speed and quality of speech engines. Finally, we demonstrate
a human-computer interaction software in T-Engine embedded
system.
I. INTRODUCTION
In this paper, we are concerned with the combination of speech
recognition and synthesis engines and the implementation of
them in T-Engine Embedded system.
Based on previous research in Vietnamese speech
recognition and synthesis, we propose some enhancements to
improve the synthetic speech quality. Besides, the use of
system resource (memory, CPU, storage...) in implementation
is a considerable problem, especially for an embedded system
as T-Engine in our case.
The paper is organized as follows. In Section 2, a short
introduction of T-Engine is given, while a method proposed for
speech recognition in T-Engine is provided in Section 3,
following is Vietnamese speech synthesis method in Section 4.
In the two last sections, we provide concluding remarks.


II. T-ENGINE INTRODUCTION
The T-Engine is a project to develop a standardized, open, real
time computing system and development environment. The T-
Engine has standardized hardware (T-Engine board) as well as
real-time operating system (T-Kernel), the hardware includes:
• CPU with built-in MMU
- SH7660
- Clock 16.6667Mhz
- Speed 200Mhz (x12)
• RAM
- EDS2516APTA
- 64 MB
• FLASH Memory
- 8 MB
- MBM29DL640
- 256 pins BGA
• LCD
- ADS 7843
- 16 pins SSOP
• Realtime clock
- RV5C348B
- 10 pins SSOP
• CODEC (UDA1342)
- UDA1342
- Minimum sampling frequency 44KHz
- -51dB/Pa
Fig.1. T-Engine layout

III. PROPOSED METHOD FOR SPEECH RECOGNITION IN
T-ENGINE

Fig.2. Speech recognition in T-Engine
The UDA1342 audio codec in T-Engine provides a minimal
sampling frequency (SF) of 44100Hz. This SF is really not
necessary for our recognition, while it increases the
computation considerably. So, in order to improve the
recognition speed we need a downsampling module with factor
of four for pre-processing of speech signal before the feature
extraction phase.
Feature
Extraction
Model
training
Trained
model
Downsampling
to 11025 Hz
recognition output
In the figure 2, feature extraction module and recognition
model are the most important. In our system we used MFCC
feature of speech signal, since this one was proved to be a
good feature of speech. To form a feature vector, we firstly
divide the signal into frames, then for each frame a feature
vector is calculated including 13 MFCC values together with
first and second derivative. Assume that x[0..L-1] is the speech
signal, the k
th
frame of the speech is constructed as:
s[n] = x[k*N+n] for n = 0...K-1
Where K is frame length, N is shift length of frames. From 13
MFCC values m[0..12] the first and second derivative are

calculated as following:
m1[0] = 0
m1[k] = m[k]-m[k-1] for k = 1...12
m2[0] = 0
m2[k] = m1[k] - m1[k-1] for k = 1..12
Where m1 and m2 are the first and the second derivative
respectively. In training phase, feature vectors are used to
adjust HMM’s parameters such as number of Gaussian
mixtures, transition probability matrix, observation probability
matrix for the best observation of models with the input data.
Table 1 illustrates the experimenting results of our recognition
system with MFCC features and twenty-Gaussian-mixtures,
six-left-to-right-states hidden Markov models.
Table I
Speech recognition result

Note that, in our recognition engine, we separate the training
phase from the recognizing phase in order to reduce the system
resource used in T-Engine. The training phase is implement in
a PC with speech data from T-Engine, only the recognition
module is implemented in T-Engine.
IV. VIETNAMESE SPEECH SYNTHESIS IN T-ENGINE
Previous researches in Vietnamese speech synthesis have
indicated that PSOLA is an effective method to synthesize
speech, based on the concatenation of diphones with amplitude
and tone balancing. PSOLA is not only a good-quality method
but also a speed-optimal one. Hence, PSOLA is very suitable
to implement in embedded systems, this is the reason why we
use it in our implementation of speech human-machine
interaction product using T-Engine. However, in Vietnamese

there are some specific characteristics that make the speech
synthesis a little difference than in other languages. Following
are some particular traits needed to consider in a Vietnamese
TTS system: Vietnamese is a monosyllable and tonal language,
there are six tones corresponding to difference varying rules of
the fundamental frequency (f0) of speech. Because of these
features, there are two most common way of synthesizing
Vietnamese tones. The first method is to change the f0 to get
the correspond tone, this way will reduce the size of speech
diphones data, and the complexity of f0 balancing
considerably. But the quality of speech is not very good,
especially with “~” and “?” tones such as "bão" and "bảo"
because of the very complicated changing of amplitude
together with f0.
The second one is to concatenate diphones with already
recorded tones. In this manner, the size of data will increase
noticeably, but the tones are created exactly like the natural
speech, so the speech quality is quite good. However f0
balancing when concatenated recorded-tone diphones is a little
more difficult. To solve this problem we will cut diphones into
frames, each frame is one speech period.
Fig.3. Speech signal frames
Then, frames are multiplied with Hamming window.
Fig.4. Speech signal frames multiplied by Hamming window
To keep the f0 contour smoothly, frames are overlapped with
desired period.
The two frames of contact are used for power balancing
between the two diphones. Assume that x(n) and y(n) are
frames of contact with length N, we compute a power factor
by:

1
2
0
1
2
0
( )
( )
x
y
N
n
N
y
n
E
p
E
Ex x n
E y n

=

=
=
=
=


Then the overlapping is done with second diphones’ frames

multiply by p. Assume that x[0..L-1] is the current speech
signal, y[0..N-1] is the next frame, and K is the period. The
synthesis signal s[] is calculated by:
Speakers Trained
Test
Number
Accuracy
Training
data:
1100 File
HMM:
6 states
20 Gauss
1 N.M.C No 250 92.00%
2 N.M.T No 250 94.80%
3 N.T.C No 200 93.50%
4 N.V.H No 200 92.00%
5 N.T.L No 100 91.00%
6 C.T.H Yes 100 96.00%
7 N.T.P Yes 100 94.00%
s[n] = x[n] for n = 0,1...L-N/2-1
s[n] = x[n] + p.y[n-L+N/2] for n = L-N/2...L-N/2+K-1
s[n] = p*y[L-N/2+k] for n = L-N/2+K...L+K;
To reduce memory needed in implementation of the above
algorithm for T-Engine embedded system, we store each
diphone in a separate data file together with index file. By this
way, only two diphones are loaded in memory at the same time
so the memory is reduced considerably. Table 2 is the database
index file structure
Table II

Diphone index file structure
Length Information
2 BYTE The end-point of the first frame, is
also the start-point of a
period of the vowel. This
field is available for the first
diphone only.
2 BYTE End-point of n period
2 BYTE End-point of n+1 period
2 BYTE End-point of n+1 period
… …
All the data in the table above is calculated manually to ensure
the quality of synthesis speech. In order to speed the creation
of database, we have built a database tool supporting auto
frame detecting. This is a very useful tool for creating the
database with less effort. This tool can produce a pitch contour
automatically from a wave data file with high accuracy, then
save the contour to a database index file as described in table
II.
Fig.7. Waiting for speech commands from users screen shot
V. APPLICATION OF VIETNAMESE RECOGNITION AND SYNTHESIS
Vietnamese speech recognition and synthesis has a wide range
of application, especially in human-computer interaction
(HCI).
Fig.6. Screen shot of HCM Museum introduction
To demonstrate the use of speech in HCI we have combined
speech recognition together with speech synthesis into our
software running in T-Engine. This software allow users to
use speech-commands to query information about places in Ha
Noi. Figure 7 illustrate the main screen of the software, in this

screen user can see a map of Ha Noi with some places in bold
title. When an user read the title, for example "Bao tang Ho
Chi Minh", the software will tell the user some information
about the place. Figure 6 is a screen shot of Ho Chi Minh
museum introduction.
VI. CONCLUSIONS
This paper is concerned with an advanced method in
Vietnamese speech synthesis from tones-already diphones
database. We have done some experiments with the
implementation of human-computer interaction system in T-
Engine embedded system, besides our enhancement and
optimization allow system to be implemented in low resource
embedded systems. The complete experiment consists of two
part: recognizing and synthesizing, table I illustrate some
recognition results with difference voice.
REFERENCES
[1] Dang Ngoc Duc, John-Paul Hosom2 and Luong Chi Mai, 'HMM/ANN
System for Vietnamese Continuous Digit Recognition'
[2] Dang Ngoc Duc, Luong Chi Mai, 'Improve the Vietnamese Speech
Recognition System Using Neural Network'
[3] Nguyen Van Giap, Tran Hong Viet, 'Kỹ thuật nhận dạng tiếng nói và ứng
dụng trong điều khiển'
[4] Giang Tang, Jessica Barlơ, 'Characteristics of the sound systems of
monolingual Vietnamese-speaking children with phonological impairment'
[5] Le Hong Minh, Quach Tuan Ngoc, 'Some Results in Phonetic Analysis to
Vietnamese Text-to-Speech Synthesis Based on Rules'
[6] Le Hong Minh, 'Some Results in Phonetic Analysis to Vietnamese Text-to-
Speech Synthesis Based on Rules' ICT.rda 2003
[7] Wesley Mattheyses, Werner Verhelst and Piet Verhoev, 'Robust pitch
marking for prosodic modification of speech using td-psola', 2006

×