Tải bản đầy đủ (.pdf) (5 trang)

Lecture Notes in Computer Science- P75 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (350.43 KB, 5 trang )

A New Chinese Speech Synthesis Method Apply in Chinese Poetry Learning 359
constituted of the oral cavity, the nasal cavity etc. then change into the voice. Ac-
cording to the different vocal tract form, the airflow becomes into different speeches.
Thus we see the vocal tract parameters decide a certain voice.
for example, the syllable “
” will be read as “yi1”. The vocal tract will keep still
when pronounce the phoneme “i”, but in word “
” whose spelling is “yi1ge4”, at
the end of the pronunciation of phoneme “i” the vocal tract is ready to change to fit the
phoneme “g”. So the pronunciation of “i” will be different from the single “i” in a word
when another syllable is come with it.
The new speech synthesis method this paper present is to make the synthesized
speech presents the affection produced by the coarticulation in the word.
3.3 The Sentence Prosody
Firstly we take a look at the prosody of words in a sentence. Here we have to in-
troduce a concept called “pitch resetting”, it is comparative with the “pitch con-
tinuous” which means the latter syllable’s start pitch is equal to the previous one’s
end pitch in a word.
A sentence can divided into several words, the first syllable’s start pitch in each words
will reset to a certain pitch, but in the word the syllable’s pitch is vary continuous. We call
it “pitch resetting”. Pitch resetting often happens when we exchange breath during read-
ing. We often take a word, a phrase or a short sentence as a breath exchanging unit. As
shown in figure 2, the sentence “
” is segmented as
spelling sequence “xi4lie4bao4dao4/ gan3shou4er4ling2ling2si4/ jin1tian1bo1chu1/ ”.
The sentence is composed by three phrase, from the figure 2 we see at the first syllable of
each phrase the pitch is reset to certain value. We can decide the pitch reset prosody
boundary when we do Chinese word segmentation in Text proceeding.

Fig. 2. Chinese sentence prosody
Another feature of sentence prosody is whole sentence’s pitch trends. In statement


sentences the pitch trend is declining. This trend overlaps on every syllable in the
sentence. So when pitch resetting occurs in a statement sentence the “definite pitch”
will a little lower than it last time was.
Consider the Chinese poetry’s reading feature; we assume the pitch resetting hap-
pens in a single syllable’s end or a single word’s end.
360 C. Zhu and Y. Zhu
4 The Speech Synthesis Method
The TTS system mainly including three parts: text processing module, prosody module
and speech synthesis module. In speech synthesis module, what kind of speech syn-
thesis algorithm should be chosen is most important. As it is an important part of the
TTS system, we make a close look at it.
4.1 Speech Synthesis Algorithm
This paper addresses a new speech synthesis method which takes the time-domain
waveform editing algorithm as basic speech synthesis algorithm and overlaps the vocal
cepstrum parameters which get from homomorphism analysis on the adjacent syllables
in a word to smooth the speech transition affections. The waveform editing synthesis
whose advantage is rapid for process and vocal tract parametric synthesis whose ad-
vantage is flexible for adjustment as it is considering the essence of the sounds.
4.1.1 The Voice Database
Because the waveform editing algorithm is our basic algorithm, the voice database is
needed to store all the elementary waveforms. The voice database mainly stores the
synthesis elements.
The choice of the base synthesis element not only decides the quality of the final
speech but also relative to the limit of the hardware storage ability. So many Chinese
TTS systems choose syllables, words or phrases, even sentences as the base synthesis
element, which lead to a big voice database. Our approach is taking initial consonant
and simple/compound vowel as basic elements according to the reference [3]. Thus the
storage of voice database is cut down to several hundreds of KB meanwhile maintains a
fairly equal level of voice quality.
4.1.2 PSOLA Algorithm

E.Moulines and F.Charpentier found a speech synthesis algorithm based on time do-
main waveform modification called PSOLA (Pitch Synchronous Overlap Add) [4]. It is
being widely used nowadays. To know more detail about PSOLA algorithm please see
the reference [5] and [6].
The PSOLA algorithm ensure the waveform and the spectrum persist smooth and
continuous when the speech signal being modified. It works by three steps. As shown in
figure 3. Firstly make a transform on a small segment of the original time domain
waveform, whose duration is about 2 times of the pitch period, we call the transformed
speech signal as short time temporary signal. Then modify the temporary signal. At last
rebuild the time domain waveform from the modified temporary signal. So we can do
the modifications in step 2 to synthesis the speech we required.
original time
domain waveform
temporary signal
modified
temporary signal
modified time
domain wavform
step 1 step 2 step 3

Fig. 3. Main steps in PSOLA algorithm
A New Chinese Speech Synthesis Method Apply in Chinese Poetry Learning 361
For example, if we want to synthesize a syllable of 400ms, but the corresponding
syllable in voice database is 200ms, then we can process it as shown in figure 4.
1
2
3
modified 1
modified 1
modified 2

modified 2
modified 3
modified 3
modified 3
synthesized waveform
original waveform
duration 200 ms
duration 400 ms
i/modified i
: temporary signal
transform
overlap
and transform
according to
pitch information

Fig. 4. PSOLA synthesis procedures
Firstly calculate how many temporary signals there should be in 400ms duration and
calculate all the temporary signals of the original syllable’s waveform, then according
to the pitch information, find the temporary signals which about to be synthesized
should equal with which ones in the original’s and arrange them on the duration line,
finally overlap them to produce the synthesized speech.
4.1.3 Concept of Cepstrum
We called time-domain signal sequence
)(
ˆ
nx
as the complex cepstrum of signal
sequence
)(nx

. The
)(
ˆ
nx
is calculated by formula 1.
)]]]([[ln[)(
ˆ
1
nxZZnx

= ,
(1)
take the real part of
)(
ˆ
nx as )(nc , we called )(nc the cepstrum and )(nc is cal-
culated by formula 2.
|])([|[ln)(
1
nxZZnc

= .
(2)
4.1.4 Homomorphism Analysis to Get the Vocal Tract Cepstrum Parameters
The time domain speech signal
)(nx is the convolution of speech source signal
)(ne and vocal tract signal )(nv in a simple digital speech model. We have known
that vocal tract contains the most important information of the speech, thus we want to
separate the vocal tract signal and modify it in order to produce the speech we need.
362 C. Zhu and Y. Zhu

There is no good way to separate )(nv from )(nx in time domain, but the
homomorphism analysis is helpful. In homomorphism analysis, do Z transform on both
sides of the equation 3.
)()()( nvnenx ∗= .
(3)
The convolution is changed into product and we get the equation 4.
)()()( kVkEkX = .
(4)
Do logarithm operation on both sides of the equation, then we change the product
operation into linear operation and get the equation 5.
))(ln())(ln())(ln( kVkEkX += .
(5)
Make it as equation 6.
)(
ˆ
)(
ˆ
)(
ˆ
kVkEkX += .
(6)
Do
1−
Z
transform, the equation change into equation 7.
)(
ˆ
)(
ˆ
)(

ˆ
nvnenx += .
(7)
Now we can get the vocal tract cepstrum parameter
)(
ˆ
nv by an linear filter. After
we modified the vocal tract cepstrum parameter, the converse operation can be used to
make the cepstrum domain signal
)(
ˆ
nv back to time domain signal )(nv .
4.1.5 Vocal Tract Cepstrum Parameter Speech Synthesis
When dealing with the adjacent syllables in one word during the speech synthesis, we
could synthesize the speech through adding the latter syllable’s vocal tract cepstrum
parameters into the former syllable.
In the step 2 of PSOLA algorithm, after the temporary signal to be synthesized is
calculated we take the last k temporary signal’s vocal tract cepstrum parameters of the
first syllable and the first k temporary signal’s vocal tract cepstrum parameters of the
second syllable with a linear operation, the operation result as the first syllable’s last k
temporary signal’s new vocal tract cepstrum parameters. Finally transform the cep-
strum parameters and temporary signal back to time domain then we get the synthe-
sized speech. The linear operation method is shown in formula 8. The linear coefficient
is determined according to the reference [7].
)
2
sin()
2
sin(1
~

~
~
25
2
1
25
2
1
25
2
1
K
k
v
v
v
K
k
v
v
v
v
v
v
b
b
b
f
f
f

k
f
f
f
ππ
×












+






−×















=














#
##
,
(8)
A New Chinese Speech Synthesis Method Apply in Chinese Poetry Learning 363

In formula 8,
fi
v
~
is the first syllable’s modified vocal tract cepstrum,
fi
v is the first
syllable’s original vocal tract cepstrum,
bi
v is the second syllable’s original vocal tract
cepstrum.
Thus we resolve the affection between the adjacent syllables.
4.2 The Programming Implementation
The method this paper mention is implement under VC.net framework with C++
languages.
The Figure 5 shows the logic procedure of the Chinese TTS system, When the
Chinese poetry text is input into the TTS system we can predict the basic duration, pitch
of the syllable and then sentence mood, and then do words segmentation to mark the
boundaries of pitch resetting, the next step is synthesize the speech with the consonants,
vowels, tones which has been analyzed already by PSOLA algorithm, meanwhile to
adjust the prosody of the adjacent syllables in one word with vocal tract cepstrum
parameter synthesis algorithm, and finally get the synthesized speech.
Chinese poetry text Chinese words Chinese syllables
Predicted duration
Predicted pitch
Pitch resetting
boundary
consonant vowel tone
Modify vocal tract parameter Changed tone type
Synthesized by PSOLA algorithm and cepstrum parameters algorithm sound

Words segmentation

Fig. 5. The logic flow of TTS system
The final user interface including the waveform which is synthesized by the system
is shown in Figure 6.

Fig. 6. User interface

×