MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF TECHNOLOGY
Thesis for the degree of
MASTER OF SCIENCE
Modeling the prosody
of Vietnamese language for speech
synthesis
Speciality: “Information processing and Communication”
Code:23.04.3898
MẠC ĐĂNG KHOA
Supervisor:
Prof. PHẠM THỊ NGỌC YẾN
Hanoi, 2007
Faculty of Information Technology
International research center of
Multimedia Information, Communication and Application
- 1 -
Master thesis
Mạc Đăng Khoa
Acknowledgment
Many people provided me generous help and inspiration during my time of master
student.
First, I would like to express my deep sense of respect and gratitude towards my
supervisors: Dr. Eric Castelli and Prof. Phạm Thị Ngọc Yến. Thank you very much
for orienting and guiding my research in speech processing domain. Thank you for
all your useful advices, your true criticisms and your patience during my time of
master research.
Special thanks also goes to Mrs. Geneviève Caelen-Haumont, PhD students Trần
Đỗ Đạt, Vũ Minh Quang and all members of MICA’s speech group. I could not
have done this thesis without your supports. Thank all of you for all your
suggestions and your sincere remarks on entire of my research.
I would like to thank to Ms. Đoàn Thị Ngọc Hiền, who guiding me in recording the
corpus. I would also like to thank to a lot of MICA member who spent much of time
for recording and testing for my research.
I am grateful to Prof. Nguyễn Trọng Giảng and MICA’s directorate supporting me
the best convenient conditions during time working in International Research
Center MICA.
Finally, I owe a great deal to my parents and my sister for their continued support. I
also give a very special thanks to my girl friend for her constant encouragement,
giving me strength and motivation in my work and in my life.
- 2 -
Master thesis
Mạc Đăng Khoa
Abstract
Text-To-Speech (TTS) system is a computer system which is able to produce the
speech from the text. In the TTS system, the naturalness of the produced speech
depends greatly on the variation of pitch, duration and energy during speaking. We
call it the “prosody controlling ability”. A TTS system with good prosody
controlling ability can be simulate the human speech prosody corresponding to the
context of speaking.
With tonal languages such as Vietnamese, the prosody of an utterance is the
combination results of the two components: "micro-prosody" corresponding to the
tone of each syllable in a sentence and "macro-prosody" corresponding to the whole
sentence.
The main goal of this thesis is to model the characteristics of Vietnamese prosody
for speech synthesis. It focuses on the influences of the macro-prosody on the
micro-prosody, in three types of sentence: assertive, interrogative and imperative.
The first task is to set up a “prosody corpus” and extract all possible prosody
parameters. Base on the extracted data, we defined seventy-two simple prosody
patterns for Vietnamese syllables in three types of sentence. After that, these
patterns were applied to synthesize some simple sentences. Finally, some perception
experiments were taken to evaluate these synthesized sentences. The results shown
that the proposed patterns can be applied successfully to generate the prosody of
simple sentence.
This work is our preliminary work in Vietnamese prosody, just concerning the
sentence types and the position of syllable in a sentence. In the future, we expect to
continue this research with more factors of Vietnamese prosody, improve our
pattern and apply them Vietnamese TTS system.
- 3 -
Master thesis
Mạc Đăng Khoa
- 4 -
Master thesis
Mạc Đăng Khoa
List of Figures
Figure 1-1: Category of methods for predicting syllable duration [6] 23
Figure 2-1: Example of the contours of six tones, as described in [21] 30
Figure 2-2: The shape of Tone 1 with female and male voice [18] 31
Figure 2-3: The shape of Tone 2 with female and male voice [18] 31
Figure 2-4: The shape of Tone 3 with female and male voice [18] 32
Figure 2-5: The shape of Tone 4 with female and male voice [18] 32
Figure 2-6: The shape of Tone 5 with female and male voice [18] 32
Figure 2-7: The shape of Tone 5b with female and male voice [18] 33
Figure 2-8: The shape of Tone 6 with female and male voice [18] 33
Figure 2-9: The shape of Tone 6b with female and male voice [18] 34
Figure 2-10: Sentence classification by structure [20] 35
Figure 2-11: The sentences “Lan thích ăn cơm không” in 36
Figure 2-12: The sentences “Bảo cố gắng tập đi” in 36
Figure 2-13: The sentences “Tân bỏ đi chứ” in 37
Figure 2-14: The differences of F0 contour between Assertive and Interrogative
sentence [16] 37
Figure 3-1: A general function diagram of TTS system [13] 41
Figure 3-2: Fujisaki model 46
Figure 3-3: Fujisaki model for tonal language [19] 46
Figure 3-4: Function diagram of proposal TTS system 47
Figure 3-5: Prosody generation module 48
Figure 4-1: Key-syllable segmentation 56
Figure 4-2: Extracting F0 contour using PRAAT 57
Figure 4-3: An example of prosody pattern 60
Figure 5-1: An example of synthesized non-sense phrase 73
Figure 5-2: Perception test 1 74
Figure 5-3: An example of synthesized multi-type sentences 80
- 5 -
Master thesis
Mạc Đăng Khoa
Figure 5-4: Interface for Perception test 2 82
Figure 5-5: Correct recognition rate with 8 tones of last syllable 85
Figure 5-6: Correct recognition rate (%) with other types of sentences 86
Figure 5-7: Result comparison of three experiments 87
- 6 -
Master thesis
Mạc Đăng Khoa
List of Tables
Table 1.1: Prosody functions 16
Table 1.2:Links between levels of representation of prosodic phenomena [13] 17
Table 1.3: Intonation model classification 18
Table 2.1:Vietnamese vowels. 27
Table 2.2:Vietnamese consonants 28
Table 2.3: Arrangement of Vietnamese consonants. 28
Table 2.4:The phonological hierarchy of Vietnamese syllables with total numbers of
each phonetic unit [14]. 29
Table 2.5 The six Vietnamese tones 30
Table 3.1: Comparison between direct pattern and model pattern 50
Table 4.1: Prosody corpus structure 52
Table 4.2: Prosody corpus text information 53
Table 4.3: Recording information of Prosody corpus 54
Table 5.1: Confusion matrix (in %) for 8 tones with male voice 75
Table 5.2: Confusion matrix (in %) for 8 tones with female voice 75
Table 5.3: Confusion matrix (%) of sentence types with male voice 76
Table 5.4: Confusion matrix (%) of sentence types with female voice 77
Table 5.5: Test data for Experiment 2 79
Table 5.6: Confusion matrix (in %) of sentence types (with male voice) 82
Table 5.7: Confusion matrix (in %) of sentence types (with female voice) 83
Table 5.8: Confusion matrix (in %) of sentence types (average of Male and Female)
84
Table 5.9: Correct recognition rate (%) with other types of sentences 86
Table 5.10: Result of three experiments 87
- 7 -
Master thesis
Mạc Đăng Khoa
Table of contents
Acknowledgment 1
Abstract 2
List of Figures 4
List of Tables 6
Table of contents 7
0
0
INTRODUCTION 9
1
1
PROSODY AND PROSODIC MODEL 12
1.1. Overview of prosody 12
1.1.1.
The concept of prosody 12
1.1.2.
Major components of prosody 13
1.1.3.
The functions of prosody 14
1.1.4.
Levels of representation of prosodic phenomena 16
1.2. Prosody modeling 17
1.2.1.
Intonation models 18
1.2.2.
Duration modeling 21
1.2.3.
This thesis work approach 23
2
2
VIETNAMESE LANGUAGE AND PROSODY 25
2.1. Vietnamese language 25
2.1.1.
Vietnamese characteristics 25
2.1.2.
Vietnamese phoneme system 27
2.1.3.
Syllable structure 29
2.2. Vietnamese prosody 29
2.2.1.
Micro-prosody and tones system in Vietnamese 30
2.2.2.
Macro-prosody and sentence types in Vietnamese 34
2.2.3.
Some special phenomena in Vietnamese prosody 38
3
3
TTS SYSTEM AND PROSODY GENERATION 40
3.1. An overview of TTS system 40
3.2. Prosody generation 41
3.2.1.
Overview of prosody generation 41
3.2.2.
From text to prosody 43
3.3. Other researches and our proposal 45
4
4
PROSODY PATTERNS EXTRACTION 51
4.1. Prosody corpus 51
- 8 -
Master thesis
Mạc Đăng Khoa
4.1.1.
Objectives 51
4.1.2.
Define the corpus text 52
4.1.3.
Recording 54
4.1.4.
Sentence segmentation 54
4.2. Analysis and extracting prosody parameters 55
4.2.1.
Segmentation 55
4.2.2.
Extracting prosody parameters of key-syllable 56
4.3. Proposal the patterns for Vietnamese prosody 58
4.3.1.
Methodology 58
4.3.2.
Prosody patterns 59
4.3.3.
Some visual remarks on extracted patterns 70
5
5
EXPERIMENTS AND EVALUATION 72
5.1. Experiment 1: Tone and non-sense phrase 72
5.1.1.
Objectives 72
5.1.2.
Method and Implementation 72
5.1.3.
Results and discussion 74
5.2. Experiment 2: Multi-type sentences 79
5.2.1.
Objectives 79
5.2.2.
Method and Implementation 79
5.2.3.
Results and discussion 82
5.3. Comparison and conclusion 87
6
6
CONCLUSION AND PERSPECTIVES 89
REFERENCES 92
APPENDIX 95
A. Text for prosody corpus 95
B: Datasheet of prosody patterns 100
- 9 -
Chapter 0: In
troduction
Mạc Đăng Khoa
0
0
Introduction
Speech is the primary means of communication between people. Speech synthesis,
automatic generation of speech waveforms, has been under development for several
decades. Recent progress in speech synthesis has produced synthesizers with very
high intelligibility but the sound quality and naturalness remain a major problem.
Most of recent researches attempt to improve the naturalness of synthesized sound
to reach to human speech.
In Vietnam, there are currently some Vietnamese synthesis system like VnVoice
(develop by Institute of Information Technology) or HoaSung (develop by
International Research Center MICA). These researches obtained some encouraging
results. However, to release their systems to the market yet, they have to improve
the produced speech quality, especially the naturalness of speech prosody.
Thus, this thesis aims to study the characteristics of Vietnamese prosody for
applying to synthesize the speech. This work is carried out in International research
center of Multimedia Information, Communication and Application (MICA) and is
part of MICA’s project: VN-Synthesis.
With the research of PhD student Tran Do Dat in MICA, we have already
developed a speech synthesis system using sound samples concatenation
techniques. The first version now can produce sound from detailed text description,
which consists of:
- 10 -
Chapter 0: In
troduction
Mạc Đăng Khoa
• The sequence of phonemes for composing the utterance: can be obtained
automatically from the raw text using a "phonetization” module, whose
development is currently underway.
• All information related to voice modulations: mostly pitch, energy and
duration variations that constitute the intonation or prosody of the uttered
statement. We call it “prosody description”.
For tonal languages such as Vietnamese, the prosody of speech is composed of two
components, which we call “micro-prosody” and “macro-prosody”:
• Micro-prosody is the variations of pitch, duration and intensity of
individual word or syllable. For tonal language, the micro-prosody is very
important to distinguish the syllable’s tone. Thus, the meaning of the
synthesized sound greatly depend on the quality of micro-prosody.
• Macro-prosody is the application of prosody to whole phrase or sentence.
It depends on the type of sentence, speaker's intentions, the emotions etc.
Therefore, the "naturalness" of synthesized speech is depends on ability
of macro-prosody controlling during speech synthesis process.
Objectives and Tasks
This thesis is part of MICA speech synthesis research and its main goal is to extract
characteristics of Vietnamese prosody to generate the “prosody description” for
speech synthesis.
In this thesis, we just focus on the differences of Vietnamese tones in different
positions in the sentence and in different types of sentences. In other words, these
are the influences of macro-prosody on micro-prosody.
The first task is setting up a corpus for researching Vietnamese prosody. With this
corpus, we extract and analysis parameters of fundamental frequency, duration and
intensity of the syllables in eight Vietnamese tones, in three positions and in three
type sentences.
- 11 -
Chapter 0: In
troduction
Mạc Đăng Khoa
After that, using these prosody parameters, we defined the simple prosody patterns
for Vietnamese tones, corresponding to the cases of syllable in three types of
sentence: assertive, interrogative and imperative. By applying these patterns to re-
synthesize some simple sentences and doing some perception experiment, we can
examine the appropriateness of these prosody patterns.
Thesis outline
This thesis is structured as follows:
• Chapter 1 starts with Section 1.1 giving some background on prosody,
also some definitions and some term we use in this thesis book. Section
1.2 briefly presents modeling prosody and some prosodic models.
• Chapter 2 gives an overview of Vietnamese language and Vienamese
prosody.
• Chapter 3 starts with the introduction of Text-to-Speech system, the
general structure of TTS system and the prosody generation. In last
section of this chapter, we present some related work and propose a
simple structure for prosody generation module for TTS system.
• Chapter 4: Section 3.1 and 3.2 describes our work of setting up and
analyzing the Vietnamese prosody corpus. In section 3.3, we propose set
of prosody patterns for the Vietnamese syllables.
• In chapter 5, a series of perception experiments is presented for
evaluating our proposal patterns.
• Chapter 6 completes with the conclusions from the work presented in the
thesis and suggestions for further work
- 12 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
1
1
Prosody and Prosodic model
In this chapter, we give an overview of prosody and explain some terms we use in
this thesis. The concept of modeling prosody and some prosodic models are also
briefly presented after that.
1.1. Overview of prosody
1.1.1. The concept of prosody
There is not an exact definition of the term “prosody”. We can use the term
"prosody" broadly, meaning “a time series of speech-related information that is not
predictable from a reasonable window (i.e. word-sized or sentence-sized) applied to
the phoneme sequence” [1].
Viewed in the large, prosody is a parallel channel for communication, carrying
some information that cannot be simply deduced from the lexical channel. All
aspects of prosody are transmitted by muscle motions, and in most of them, the
recipient can perceive, fairly directly, the motions of the speaker.
Clearly, with that broad definition of prosody, hand gestures, eyebrow and face
motions, can be considered prosody, because they carry information that modifies
and can even reverse the meaning of the lexical channel. However, in the domain of
speech processing, we concentrate on the aspect of speech of prosody. Thus, the
prosody could include: “Pitch”, “Duration” and “Stress”. In the aspect of speech
- 13 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
signal, the prosody is represented by three components: “Fundamental frequency
(F0)”, “Duration” and “Intensity”.
“Prosody” and “Intonation”
The term prosody refers to certain properties of the speech signal such as audible
changes in pitch, loudness, and syllable length. For some authors the set of prosodic
features also includes other aspects related to speech timing such as rhythm and
speech rate. [13]
Some as a synonym for prosody use the term intonation. It is restricted to the tonal
(melodic) aspects of prosody by others. In the thesis, intonation refers to pitch
variation in speech production and is part of prosody. [13]. In other words, we have:
Prosody = Intonation + Duration
1.1.2. Major components of prosody
As we discuss above, the prosody consist of:
•
Pitch (Fundamental frequency)
: Among prosodic event, the most overt are
changes in pitch, which together constitute the pitch contour of the
utterance. (F0 contour of speech signal). Some analysis of sentences-lever
pitch contours show that the pitch contour of longer utterances can be
broken down to a sequence of elementary contours, which can further be
divided into syllabic contours. [13]
•
Duration
: duration in prosody is concerning to the length of sentence,
phrase, word, syllable, voiced part in syllable, syllabic nuclei, and so on.
The duration of syllable and speech sounds depends on several
(dependent or interdependent) factor such as speech rate, rhythm,
phonetic nature, etc. Most of case, the absolute duration of an event is
easily measured. However sometime, it is not obvious to define the
boundary of an event.
- 14 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
•
Stress (Intensity):
stress is a prosodic property that has been described
since the very first work on prosody in phonetics. It was said to be related
to loudness and phonology force. Both these characterizations refer to the
perceptual form of prosody: the syllable carrying stress is prominent with
respect to the surrounding syllables, either due to its loudness or to its
dynamic properties.
1.1.3. The functions of prosody
Prosody, as expressed in pitch, gives clues to many channels of linguistic and para-
linguistic information. Linguistic functions such as stress and tone tend to be
expressed as local excursions of pitch movement. Intonation types and para-
linguistic functions may affect the global pitch setting, in addition to characteristic
local pitch excursion near the edge of the sentence (i.e. boundary tones). [1]
Prosody used to convey lexical meaning: Stress, accentual and tone languages.
• Stress language: English is an example of a stress language. Stress
location is part of the lexical entry of each English word. For example,
"apple" and "orange" both have stress on the first syllable, while
"banana" has stress on the second syllable. When an English word is
spoken in isolation in declarative intonation, f0 typically peaks on the
stressed syllable.
• Accentual language: Japanese is an example of an accentual language. A
word is lexically marked as accented (on a particular syllable) or un-
accented. A simplified description is that pitch rises near the beginning of
an accentual phrase and falls on the accented syllable. For detailed
analysis, see Beckman and Pierrehumbert (1988).
• Tone language: Mandarin, Vietnamese are the examples of a lexical tone
language. Each syllable is lexically marked with one lexical tones (.
Tones have distinctive pitch contours. Altering the pitch contour may
have the consequence of changing the lexical meaning of a word, and
- 15 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
perhaps the meaning of a sentence. For example in Vietnamese, the
meaning of syllables “ta” (we), “tà” (lap of dress), “tã” (nappy), “tả” (to
describe), “tá” (twelve), “tạ” (quintal) are different.
Prosody used to convey non-lexical information: Intonation type (Question vs.
declarative sentences).
Languages may employ prosody in different ways to differentiate declarative
sentences from questions. A general trend is that questions are associated with
higher pitch somewhere in the sentence, most commonly near the end. This may be
manifested as a final rising contour, or higher/expanded pitch range near the end of
the sentence. In English, declarative intonation is marked by a falling ending while
yes-no question intonation is marked by a rising one, as shown on the last digit
"one" in the English examples. Russian question, on the other hand, uses strong
emphasis on a key word instead of a rising tail. Chinese questions are manifested by
an expanded pitch range near the end of the sentences, however, the speaker
preserves the lexical tone shapes. [1]
Prosody used to convey discourse functions: Focus, prominence, discourse
segments, etc.
Topic initialization is typically associated with high pitch. Pitch is typically raised
in the discourse initial section and lowered in the discourse final section. Also, new
information in the discourse structure is typically accented while old information
de-accented. [1]
Prosody used to convey emotion.
Most experiments studying emotional speech study stylized emotion, as delivered
by actors and actresses. In these acted-out emotions, a few categories of emotions
can be reliably identified by listeners, and one can find consistent acoustic
correlates of these categories. For example, excitement is expressed by high pitch
and fast speed, while sadness is expressed by low pitch and slow speed. Hot anger is
characterized by over-articulation, fast, downward pitch movement, and overall
- 16 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
elevated pitch. Cold anger shares many attributes with hot anger, but the pitch range
is set lower.
The study of emotion in natural speech is a lot more complicated. It is generally
recognized that speakers show mixed feelings and ambiguous states of mind, and
the emotions do not fall into clear cut categories.[1]
We have the summary of prosody functions in Table 1.1:
Table 1.1: Prosody functions
modifying meaning not modifying meaning
Linguistic (Lexicon
information)
Paralinguistic
(non-lexicon information)
Discourse function Extra
linguistic
- Tone
- Accent
Sentence type:
- Assertive
- Interrogative
- Imperative
- Focus,
- Prominence
- …
- Emotion
- Sex of
speaker
- …
In this thesis work, we just focus on studying the functions of prosody which
modify meaning, namely tones and sentence types in Vietnamese prosody.
1.1.4. Levels of representation of prosodic phenomena
As for other properties of the speech signal, prosodic events can be studied at
various levels of representation (see Table 1.2) [13]
• First, the acoustic level: the acoustic manifestation of prosody
(fundamental frequency, amplitude, and duration) can be measured
directly, using specialized hardware or algorithms (such as pitch
determination algorithms).
• Second, the perceptual level represents the prosodic events as heard by
the listener. As for spectral properties of speech sounds, acoustic
characteristics that can be measured are not always perceptible. The
perceptual representation is accessible to the individual listener, but this
mental representation can hardly be measured. Alternatively it can be
computed with a fair amount of precision on the basis of our knowledge
about psychoacoustics.
- 17 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
• Finally, the linguistic level represents the prosody of an utterance as a
sequence of abstract units (signs, symbols), some of which have a
communicative function in speech, while others may just fulfill syntactic
requirements. The linguistic structure of prosody is not some hidden code
that simply can be revealed using some standard procedure.
Table 1.2:Links between levels of representation of prosodic phenomena [13]
Acoustic Perceptual Linguistic
Fundamental frequency (F
0
) Pitch Tone, intonation, aspect of stress
Intensity Loudness Aspect of stress
Duration Length Aspect of stress
Given the different nature of these representations, it is important to keep them
apart. It can be helpful to have the terminology reflect the lever of representation.
For instance, measuring loudness does not equal measuring signal energy. It is
obvious that the perception of loudness is not exclusively related to the amplitude at
one point of the signal, but also dependent on the duration of a speech fragment (the
loudness of which we are measuring), and relative to the loudness of other parts in
the signal.
As one moves away from acoustic level towards the perceptual and/or linguistic
levels, the measurement of some given prosodic property will progressively involve
segmentation (for example, into syllables), context (such as relative prominence),
and structural information (the linguistic interpretation of a syllabic tone, for
example, often depends on whether the related syllable is stressed or not, which
requires a prior analysis of the segmental layer).
1.2. Prosody modeling
Prosodic models serve two purposes: On one hand, they can be scientific
hypotheses that explain how we communicate with each other, and what we
communicate. On the other hand, they can be engineered software systems that are
part of a dialog system or speech synthesizer. To a lesser extent - and this is mostly
potential - a prosodic model can be the background for a system to recognize
prosody in human speech.
- 18 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
In general, a prosodic model is combined of two component, they are: intonation
model and duration model. In this section, we word like to give an overview of
some methods for prediction intonation (F0 contours) and duration which have
actually been applied in speech synthesis.
1.2.1. Intonation models
1.2.1.1 Intonation model classification
The primary goal of intonation research is to model natural f0 contours of speech,
preferably in relation to a transcription and a description of the prosodic intent of
the speaker. The starting point of intonation research is the time series of F0. But
the interpretation of the F0 information diverges widely among intonation models.
The Table 1.3 represents a view of how one can classify the various intonation
models.
Table 1.3: Intonation model classification
Intonation model classified by the way they describe prosody.
Under-specified
- - Fully Specified
Single Component INTSINT ToBI, Xu
Tilt, IPO
Olive, Machine learning
Two components Grønnum - Fujisaki -
Multiple components
- - - Van Santen
Under-specified or Fully specified
The shape of an accent may be fully-specified (i.e. defined without gaps) or under-
specified (defined by disconnected regions or isolated points). Along another
dimension, f0 values at any given time may be treated as a single component or as
the combination of multiple components.
The advantage of using an under-specified accent shape is that it allows sufficient
distance between specified accent targets to allow a smooth f0 transition, typically
by way of interpolation. The drawback is that it ignores changes of shape between
specified targets.
- 19 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
On the other hand, a system with fully specified accents leaves little room to resolve
conflicting targets. A simple concatenation of fully-specified accents will result in a
pitch curve with unnatural jumps at the concatenation joints. Many systems, such as
Fujisaki (1983, 1988), use filters to smooth out abrupt changes in F0. Alternatively,
van Santen (1997, 2000) requires each accent to begin and end at zero to ensure
smooth connections between accents.
Single component or many components?
Many intonation models treat surface intonation contours as the superposition of a
phrase component and an accent component. Grønnum (1992) and Fujisaki (1983,
1988) are representatives of this view.
Well-defined model that fully specifies accent shape and uses multiple components
is Van Santen's model (van Santen and Möbius, 1997, 2000; van Santen et al.,
1998), where accents are represented by densely populated points, providing a
mechanism to describe highly complex accent shapes in detail. We characterize van
Santen's system as having multiple components, because in addition to the phrase
component, each accent in the phrase also adds a phrase-length component that
contributes to the surface f0 contour.
The advantage of multiple components is that it provides a mechanism to separate
individual accents from long-term effects. However, if one allows multiple
components, then one necessarily faces the problem that there is no unique solution
in the decomposition of a single f0 time series into multiple components [1]. Any
such decomposition depends on a model of the speech process, and is only as good
as the underlying model.
In contrast, Liberman and Pierrehumbert (1984) explicitly reject the notion of a
phrase curve and represent intonation contours as a single component. The
advantage of representing f0 information as a single component is that the
representation of accent heights will then be transparent, which lends itself to
convenient automatic labeling. [1]
- 20 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
1.2.1.2 Some prosody models
The following give an over view of intonation models in Table 1.3
• INTSINT (Hirst et al., 2000) is an underspecified intonation system that
defines an accent by a single point. Fitting quadratic spline curves
through these points generates surface f0.
• ToBI: The most widely used under-specified accent shape is represented
by the ToBI model (Beckman and Ayers, 1997; Silverman et al., 1992),
which developed from earlier works such as Pierrehumbert (1980),
Liberman and Pierrehumbert (1984), and Pierrehumbert and Beckman
(1988). Each accent is represented by no more than two points, which
specify abstractly the relative contrast of high (H) and low (L). One goal
of the ToBI system is to specify a minimal set of categorical labels for
intonation. These labels are usually interpreted as phonological
distinction between accent types.
• Xu: Xu et al. (1999) represents Chinese tones with under-specified, static
or dynamic targets. The surface f0 contours are generated with a model
that approaches these targets asymptotically within the domain of a
syllable.
• Tilt (Taylor, 2000; Taylor, 1998) allows more samples than ToBI near
the peak of an accent and leaves the other regions unspecified, hence its
status half way to a fully specified system. Tilt considers all accent types
to be continuous variations of a single class. Surface variations are
accounted for by changes in the continuous parameters.
• IPO (de Pijper, 1983) prepares a piecewise-linear approximation to the
pitch contour. They then associate the slope and height of these lines with
various types of accents. Olive (1975) described a very early fully-
specified system, following work by Levitt and Rabiner (1970). His
model stored the surface pitch vs. time contour as a function of the
grammatical structure of the sentence. The contour was then
- 21 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
approximated by polynomial splines attached to words, to allow for
duration variations.
• Machine-learning: Several works using machine-learning techniques
generate densely sampled f0 values, including Chen et al. (1992) and
Malfrère et al. (1998). We classify these works as fully specified systems
even though in some cases the concept of accent may not be clear. Ross
and Ostendorf (1999) described an interesting machine learning system
where a discrete learning system would predict vectors attached to
phonemes and syllables, and these vectors would in turn drive a (learned)
dynamical system to predict f0.
• Fujisaki: Fujisaki’s phonetic intonation model (Fujisaki and Kawai,
1982). Fujisaki’s model was developed from the filter method first
proposed by O¨ hman (1967). Fujisaki states that intonation contours are
comprised of two types of components, the phrase and the accent. The
production process is represented by a glottal oscillation mechanism
which takes phrase and accent information as input and produces a
continuous F0 contour as output. The input to the mechanism is in the
form of impulses, used to produce phrase shapes, and step functions
which produce accent shapes [10]. The Fujisaki model has been
successfully applied for decomposing F0 contours in many languages like
Japanese, German, and Finnish and in some tonal languages like Chinese,
Thai. Currently, some researches of applying Fujisaki model to
Vietnamese are on the way [11]. We will return to this model in Chapter
3.
1.2.2. Duration modeling
We now give a general overview of modeling the duration component of prosody.
Common methods to predict duration in speech synthesis differ in the following
aspects: [6]
- 22 -
Chapter 1:
Prosody and Prosodic model
Mạc Đăng Khoa
• Durational Unit Predicted: the temporal unit predicted by most current
systems are either the phone (phoneme), often referred to as “segment”,
or the syllable. Since eventually phone duration are required for the
acoustic synthesis, all syllable-based models include some kind of
mechanism for calculating segment duration from the unit syllable
duration. For example, in Barbosa and Bailly’s model, the basic unit is
delimited by the onset of nuclear vowel and the onset of the following
vowel. They are computed by a sequential network constrained by an
internal clock (basically the speaking rate).
• Predictor factors: Every model uses a particular vector of input features,
which are extracted on the linguistic and phonetic levels. Most commonly
employed factors include:
on the syllabic level: the degree of accentuation and the position in
a higher-level unit, such as the foot or accent group.
on the segmental level: the properties of the phone to be
synthesized and its neighboring phones
on the phrase level: the location of a segment with respect to a
minor or major boundary an the position of the phrase in a
sentence.
• The Prediction Method: The algorithms used for calculating a numerical
duration value from the vector of input features can be roughly divided
into rule systems and statistical approaches. In the Figure 1-1: Category
of methods for predicting syllable duration [6]Figure 1-1Error!
Reference source not found., the statistical approaches are subdivided
into parametric and non-parametric regression models. Whereas the
structure of a parametric regression model in term of how it processes the
input factors is determined a priori, non-parametric regression models are
developed by unsupervised training and the model structure is determined
automatically (multi-layer perceptrons, CARTs). The main difference
between rule-base and statistical models is that a rule system can be build