Tải bản đầy đủ (.pdf) (176 trang)

Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.31 MB, 176 trang )

MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

Pham Van Dong

SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGES
BASED ON ADAPTATION APPROACH: APPLICATION TO
MUONG LANGUAGE

DOCTORAL DISSERTATION IN
COMPUTER SCIENCE

Ha Noi – 2023


MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

Pham Van Dong

SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGES
BASED ON ADAPTATION APPROACH: APPLICATION TO
MUONG LANGUAGE

Major: Computer science
Code: 9480101

DOCTORAL DISSERTATION IN
COMPUTER SCIENCE

ADVISORS:



1. Dr. MAC DANG KHOA
2. Assoc. Prof. TRAN DO DAT

Ha Noi - 2023


DECLARATION OF AUTHORSHIP
I, Pham Van Dong, declare that the dissertation titled “Speech Synthesis for LowResourced Languages based on Adaptation Approach: Application to Muong Language” has
been entirely composed by myself. I assure you of some points as follows:
• This work was done wholly or mainly while in candidature for a Ph.D.
research degree at Hanoi University of Science and Technology.
• The work has not been submitted for any other degree or qualifications at
Hanoi University of Science and Technology or any other institution.
• Appropriate acknowledgment has been given within this dissertation, where
reference has been made to the published work of others.
• The dissertation submitted is my own, except where work in the
collaboration has been included. The collaborative contributions have been
indicated.
Hanoi, September 19, 2023
Ph.D. Student

Pham Van Dong
ADVISORS
1. Dr. Mac Dang Khoa

2. Assoc. Prof. Tran Do Dat

i



ACKNOWLEDGMENT
Foremost, I would like to express my most sincere and deepest gratitude to my thesis
advisors Dr. Mạc Đăng Khoa (Speech Communication Department, MultiLab at MICA), Prof.
TRẦN Đỗ Đạt (The Ministry of Science and Technology, Vietnam) for their continuous
support and guidance during my Ph.D. program, and for providing me with such a severe and
inspiring research environment. I am grateful to Dr. Mạc Đăng Khoa for his excellent
mentorship, caring, patience, and immense Text-To-Speech (TTS) knowledge. His advice
helped me in all the research and writing of this thesis. I am very thankful to Prof. Đạt for
shaping my thesis at the beginning and for their enthusiasm and encouragement. Prof. Trần Đỗ
Đạt substantially facilitated my Ph.D. research, especially when I was a freshman on speech
processing and TTS, with his valuable comments on Vietnamese and Muong TTS.
I thank all MICA members for their help during my Ph.D. study. My sincere thanks to Dr.
Nguyen Viet Son, Assoc. Prof. Dao Trung Kien and Dr. Do Thi Ngoc Diep for giving me much
support and valuable advice. Thanks to Nguyen Van Thinh, Nguyen Tien Thanh, Dang Thanh
Mai, and Vu Thi Hai Ha for their help. I want to thank my Hanoi University of Mining and
Geology colleagues for all their support during my Ph.D. study. Special thanks to my family
for understanding my hours glued to the computer screen.
Hanoi, September 19, 2023
Ph.D. Student

ii


ABSTRACT
Text-to-speech (TTS) synthesis is the automatic conversion of text into speech. Typically,
building high-quality voiceovers requires collecting tens of hours of the voice of a professional
speaker with a high-quality microphone. There are about 7,000 languages spoken worldwide,
but only a few languages, such as English, Spanish, Mandarin, and Japanese, are used in good
TTS. With so-called "low-resourced languages" or even languages that are not yet written, these

languages do not have TTS. Thus, to apply TTS technology to low-resourced language, it is
necessary to study other TTS methods.
In Vietnam, Vietnamese is the mother tongue and is used the most. The Muong is a group
of the language spoken by the Muong people of Vietnam. They are in the Austroasiatic
language family and are closely related to Vietnamese, and Muong is also one of the five ethnic
groups with the largest population. However, Muong still needs an official script, a typical
representative of the low-resourced language in Vietnam. Therefore, researching TTS
technologies to create TTS for the Muong language is challenging.
In the first part of this thesis, we do an overview of TTS. Researching the phonetics of
Vietnamese and Muong languages, the thesis has also researched and published some tools to
support TTS technology for Vietnamese and Muong languages. In the rest of the thesis, we
conduct various experiments in creating TTS for low-resourced language; specifically, we
experiment with the Muong language. We focus on two main low-resourced language groups:
• Written: We use emulating to simulate the reading of the Muong language
using Vietnamese TTS and cross-lingual adaptation transfer-learning.
• Unwritten: We experiment with adaptation in two directions. The first is to
create Muong speech synthesis directly from Vietnamese Text and Muong
voice. The second is to create Muong speech synthesis from translation
through intermediate representation
We hope our findings can serve as an impetus to develop speech synthesis for low-resourced
languages worldwide and contribute to the basis for speech synthesis development for 53 ethnic
minority languages in Viet Nam.
Hanoi, September 19, 2023
Ph.D. Student

iii


CONTENT
DECLARATION OF AUTHORSHIP.................................................................................I

ACKNOWLEDGMENT .................................................................................................... II
ABSTRACT .........................................................................................................................III
CONTENT..............................................................................................................................IV
ABBREVIATIONS ........................................................................................................... VIII
LIST OF TABLES ................................................................................................................IX
LIST OF FIGURES ..............................................................................................................XI
INTRODUCTION ................................................................................................................. 1
PART 1 : BACKGROUND AND RELATED WORKS ............................................ 5
CHAPTER 1. OVERVIEW OF SPEECH SYNTHESIS AND SPEECH
SYNTHESIS FOR LOW-RESOURCED LANGUAGE...................................................... 6
1.1. Overview of speech synthesis .................................................................................... 6
1.1.1. Overview...................................................................................................................... 6
1.1.2. TTS architecture .......................................................................................................... 8
1.1.3. Evolution of TTS methods over time ........................................................................ 9
1.1.3.1. TTS using unit-selection method...................................................................... 10
1.1.3.2. Statistical parameter speech synthesis.............................................................. 11
1.1.3.3. Speech synthesis using deep neural networks ................................................. 13
1.1.3.4. Neural speech synthesis .................................................................................... 14
1.2. Speech synthesis for low-resourced languages..................................................... 19
1.2.1. TTS using emulating input approach....................................................................... 20
1.2.2. TTS using the polyglot approach ............................................................................. 22
1.2.3. Speech synthesis for low-resourced language using the adaptation approach...... 25
1.3. Machine translation.................................................................................................. 27
1.3.1. Neural translation model........................................................................................... 28
1.3.2. Attention in neural machine translation ................................................................... 29
1.3.3. Statistical machine translation based on phrase ...................................................... 30
1.3.3.1. Statistical machine translation problem based on phrase................................ 30
1.3.3.2. Translation model and language model ........................................................... 31
1.3.3.3. Decode the input sentence in the translation system ....................................... 32
1.3.3.4. Model for building a statistical translation system .......................................... 34

1.3.4. Machine translation through intermediate representation ...................................... 34
1.3.5. Speech translation for unwritten low-resourced languages.................................... 36
1.4. Speech synthesis evaluation metrics ...................................................................... 38
1.4.1. Mean Opinion Score (MOS) .................................................................................... 38
1.4.1.1. Definition ........................................................................................................... 38
1.4.1.2. Formula .............................................................................................................. 38
1.4.1.3. Significance........................................................................................................ 38
1.4.1.4. Confidence Interval (CI) ................................................................................... 39
1.4.2. Mel Cepstral Distortion (MCD) ............................................................................... 39
iv


1.4.2.1. Concept .............................................................................................................. 39
1.4.2.2. Formula .............................................................................................................. 39
1.4.2.3. Significance........................................................................................................ 40
1.4.2.4. MCD with Dynamic Time Warping (MCD – DTW) .................................... 40
1.4.3. Analysis of variance (Anova) ................................................................................... 40
1.4.4. Intelligibility .............................................................................................................. 42
1.5. Conclusion.................................................................................................................. 42
CHAPTER 2. VIETNAMESE AND MUONG LANGUAGE ..................................... 44
2.1. Vietnamese language ................................................................................................ 44
2.1.1. History of Vietnamese .............................................................................................. 44
2.1.2. Vietnamese phonetic system .................................................................................... 45
2.1.2.1. Vietnamese syllabus structure .......................................................................... 46
2.1.2.2. Vietnamese phonetic system............................................................................. 47
2.1.2.3. Vietnamese tone system.................................................................................... 49
2.2. Muong language........................................................................................................ 50
2.2.1. Overview of Muong people and Muong language ................................................. 50
2.2.1.1. Muong history.................................................................................................... 50
2.2.1.2. Viet Muong group ............................................................................................. 51

2.2.1.3. Muong dialects................................................................................................... 53
2.2.1.4. Muong written script ......................................................................................... 54
2.2.2. Muong phonetics system .......................................................................................... 55
2.2.2.1. Muong syllable structure................................................................................... 55
2.2.2.2. Muong phoneme system ................................................................................... 55
2.2.2.3. Muong tone system ........................................................................................... 57
2.3. Comparison between Vietnamese and Muong.................................................... 57
2.4. Dicussion and proposal approach .......................................................................... 60
PART 2 : SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE
........................................................................................................................................................ 61
CHAPTER 3. EMULATING OF THE MUONG TTS BASED ON INPUT
TRANSFORMATION OF THE VIETNAMESE TTS ...................................................... 62
3.1. Proposed method ...................................................................................................... 63
3.1.1. Muong G2P module.................................................................................................. 64
3.1.2. Muong emulating IPA module................................................................................. 65
3.2. Experiment................................................................................................................. 65
3.2.1. Testing materials ....................................................................................................... 66
3.2.2. Experiment protocol.................................................................................................. 67
3.2.3. Results ........................................................................................................................ 68
3.2.4. Analysis by ANOVA method .................................................................................. 72
3.2.4.1. MOS analysis by ANOVA ............................................................................... 72
3.2.4.2. Intelligibility analysis by ANOVA................................................................... 75
3.3. Conclusion.................................................................................................................. 77

v


CHAPTER 4. CROSS-LINGUAL TRANSFER LEARNING FOR MUONG
SPEECH SYNTHESIS.............................................................................................................. 78
4.1. Proposed method ...................................................................................................... 78

4.2. Experiment................................................................................................................. 82
4.2.1. Dataset........................................................................................................................ 82
4.2.1.1. Vietnamese data................................................................................................. 82
4.2.1.2. Muong Project‘s data ........................................................................................ 84
4.2.1.3. Muong fine-tuning data..................................................................................... 84
4.2.2. Graphemes to phonemes .......................................................................................... 85
4.2.3. Training the pretrained model using Vietnamese dataset....................................... 86
4.2.4. Finetuned TTS model on Muong datasets .............................................................. 87
4.3. Evaluation .................................................................................................................. 88
4.4. MOS analysis by ANOVA....................................................................................... 91
4.5. Conclusion.................................................................................................................. 94
PART 3 : SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN
LANGUAGE.............................................................................................................................. 96
CHAPTER 5. GENERATE UNWRITTEN LOW-RESOURCED LANGUAGE’S
SPEECH DIRECTLY FROM RICH-RESOURCE LANGUAGE’S TEXT ................. 97
5.1. Introduction ............................................................................................................... 97
5.2. Proposed method ...................................................................................................... 98
5.2.1. Model architecture .................................................................................................... 98
5.2.2. Database..................................................................................................................... 99
5.2.3. Training the speech synthesis system .................................................................... 100
5.2.4. Evaluation ................................................................................................................ 100
5.2.5. MOS analysis by ANOVA..................................................................................... 105
5.2.5.1. ANOVA analysis in Muong Bi speech synthesis ......................................... 105
5.2.5.2. ANOVA analysis in Muong Tan Son speech synthesis ............................... 108
5.3. Conclusion................................................................................................................ 111
CHAPTER 6. SPEECH SYNTHESIS FOR UNWRITTEN LOW-RESOURCED
LANGUAGE USING INTERMEDIATE REPRESENTATION .................................. 112
6.1. Proposal Method..................................................................................................... 112
6.2. Experiment............................................................................................................... 114
6.2.1. Database building .................................................................................................... 114

6.2.2. System development ............................................................................................... 114
6.2.2.1. Text to phone translation................................................................................. 115
6.2.2.2. Phone to Sound Conversion............................................................................ 117
6.3. Evaluation ................................................................................................................ 119
6.3.1. Evaluation in Muong Bi and Muong Tan Son ...................................................... 119
6.3.2. MOS analysis by ANOVA..................................................................................... 122
6.3.2.1. ANOVA analysis in Muong Bi speech synthesis ......................................... 122
6.3.2.2. ANOVA analysis in Muong Tan Son speech synthesis ............................... 125
6.4. Conclusion and comparison.................................................................................. 128
CONCLUSION AND FUTURE WORKS ............................................................... 135
vi


Conclusions ........................................................................................................................... 135
Future work.......................................................................................................................... 136
PUBLICATIONS............................................................................................................... 138
BIBLIOGRAPHY ............................................................................................................... 139
APPENDIX A........................................................................................................................... 1
A.1. Vietnamese and Muong phonetic............................................................................. 1
A.2. Muong G2P.................................................................................................................. 4
A.3. Muong Vietnamese phone mapping........................................................................ 6
A.4. Information of Muong volunteers who participated in the assessment ............ 9
A.5. Speech signal samples of the Muong TTS in chapter 5...................................... 12

vii


ABBREVIATIONS
Abbreviation
CART

F0
HMM
HTK

HTS
IPA
MARY
(TTS)
MFCC
ML
MLSA
MOS
MSDHMM
NLP
OCR
POS
PP
PSOLA
SAMPA
SPTK
SSML
TDPSOLA
TTS
VNSP
WEKA

X-SAMPA
XML

p(e | f)




Expansion

Explanation

Classification And Regression
Tree
Fundamental Frequency
Hidden Markov Model
Hidden markov model ToolKit

HMM-based speech synthesis
International Phonetic Alphabet
Modular Architecture for
Research on speech sYnthesis
Mel Frequency Cepstral
Coefficents
Maximum Likelihood
Mel Log Spectrum
Approximation
Mean Opinion Score
Multi-Space probability
Distribution HMM
Natural Language Processing
Optical Character Recognition
Part-Of-Speech
Prepositional Phrase
Pitch Synchronous OverLap and

Add
Speech Assessment Methods
Phonetic Alphabet
Speech signal Processing ToolKit
Speech Synthesis Markup
Language
Time-Domain Pitch Synchronous
OverLap and Add
Text-To-Speech
VNSpeechCorpus for synthesis
Waikato Environment for
Knowledge
Analysis
Extended Speech Assessment
Methods Phonetic Alphabet
eXtensible Markup Language
Conditional Probability
Pi
Sigma Factor

viii

A portable toolkit for building and
manipulating hidden Markov
models

Word class or a lexical category

A collection of machine learning
algorithms for data mining tasks:


Product of a sequence of numbers


LIST OF TABLES
Table 2.1 Vietnamese syllabus structure [94] ..........................................................46
Table 2.2 Vietnamese syllabus structure [96] ..........................................................46
Table 2.3 Vietnamese syllables based on structure ..................................................47
Table 2.4 Hanoi Vietnamese inital consonants .........................................................48
Table 2.5 The letter of initial consonant ...................................................................48
Table 2.6 Hanoi Vietnamese final consonant ...........................................................49
Table 2.7 Tone of Hanoi Vietnamese [108] ..............................................................49
Table 2.8 Muong syllabic structure ..........................................................................55
Table 2.9 Muong final sound system .........................................................................56
Table 2.10 Muong Hoa Binh tone system [115] .......................................................57
Table 2.11 Muong Bi and Muong Tan Son Tone ......................................................57
Table 2.12 Muong and Vietnamese phonetic comparison (orthography in normal,
IPA in italic; Vi: Vietnamese; Mb: Muong Bi ; Mts : Muong Tan Son) .......................59
Table 2.13 Comparing the tone of Vietnamese with Muong Tan Son and Muong Bi
.......................................................................................................................................60
Table 3.1 Muong G2P Result Sample .......................................................................64
Table 3.2 Examples of applying transformation rules to convert the Muong text into
input text for Vietnamese TTS........................................................................................65
Table 3.3. Testing material for emulating tone .........................................................66
Table 3.4. Testing material for emulating phone (the concerning phonemes in bold)
.......................................................................................................................................67
Table 3.5. Testing material for remaining phonemes ...............................................67
Table 3.6 ANOVA Results for MOS Test ...................................................................73
Table 3.7 ANOVA Results for Intelligibility Test ......................................................75
Table 4.1 Parameters of acoustic model ...................................................................80

Table 4.2 Vietnamese dataset information ................................................................83
Table 4.3 Muong recorded data ................................................................................85
Table 4.4 The Muong split data set ...........................................................................85
Table 4.5 Parameter for optimizer ............................................................................86
Table 4.6 Value of parameters when training Hifigan model ..................................86
Table 4.7 The specifications of the in-domain and out-domain test sets ..................89
Table 4.8 Test set samples .........................................................................................89
Table 4.9 Evaluation results .....................................................................................90
Table 4.10 ANOVA Results for in-domain MOS Test ...............................................92
Table 4.11 ANOVA Results for out-domain MOS Test .............................................93
Table 4.12 ANOVA Results for in/out domain MOS Test .........................................94
Table 5.1 Evaluation Score .....................................................................................102
Table 5.2 TTS evaluation with in-domain test set ...................................................103
Table 5.3 TTS evaluation with out-domain test set .................................................104
Table 5.4 ANOVA Results for in-domain MOS Test for Muong Bi ........................106
Table 5.5 ANOVA Results for out-domain MOS Test for Muong Bi ......................107
Table 5.6 ANOVA Results for Muong Bi in/out domain MOS Test ........................107
Table 5.7 ANOVA Results for in-domain MOS Test for Muong Tan Son ...............109
Table 5.8 ANOVA Results for out-domain MOS Test for Muong Tan Son .............110
Table 5.9 ANOVA Results for Muong Tan Son in/out domain MOS Test...............110
Table 6.1 Examples of labeling Vietnamese text into an intermediate representation
of Muong Bi and Muong Tan Son phonemes. .............................................................117
Table 6.2 Text information of Muong language datasets .......................................118
Table 6.3 TTS evaluation with in-domain test set ...................................................119
Table 6.4 TTS evaluation with out-domain test set .................................................120
ix


Table 6.5 ANOVA Results for in-domain MOS Test for Muong Bi ........................123
Table 6.6 ANOVA Results for out-domain MOS Test for Muong Bi ......................124

Table 6.7 ANOVA Results for Muong Bi in/out domain MOS Test ........................124
Table 6.8 ANOVA Results for in-domain MOS Test for Muong Tan Son ...............126
Table 6.9 ANOVA Results for out-domain MOS Test for Muong Tan Son .............127
Table 6.10 ANOVA Results for Muong Tan Son in/out domain MOS Test.............127
Table A.1 Vietnamese vowels ......................................................................................1
Table A.2 The Muong initial consonant ......................................................................1
Table A.3 Muong vowels system .................................................................................2
Table A.4 The correspondences between Vietnamese and Muong in 12 words refer to
the human body parts [137] ............................................................................................4
Table A.7 Muong G2P.................................................................................................4
Table A.8 Muong Vietnamese phone mapping............................................................7
Table A.9 Muong Hoa Binh volunteers .......................................................................9
Table A.10 Muong Phu Tho volunteers ....................................................................10

x


LIST OF FIGURES
Figure 1.1. Basic system architecture of a TTS system [22] ......................................8
Figure 1.2 Neural TTS architecture [3] ......................................................................9
Figure 1.3. General and clustering-based unit-selection scheme: Solid lines represent
target costs and dashed lines represent concatenation costs [13] ................................10
Figure 1.4. Core architecture of HMM-based speech synthesis system [25] ...........11
Figure 1.5. General HMM-based synthesis scheme [13, p. 5] .................................12
Figure 1.6. A speech synthesis framework based on a DNN [29] ............................13
Figure 1.7 Encoder and Decoder diagram in Seq2Seq model .................................14
Figure 1.8 Char2Wav model [23] .............................................................................17
Figure 1.9 Model of the Tacotron synthesis system [24] ..........................................18
Figure 1.10 Block diagram of the Tacotron 2 system architecture [25] ..................19
Figure 1.11 Scheme of a HMM-based polyglot synthesizer [48] .............................23

Figure 1.12 Approaches to transfer TTS model from source language to target
language [32] ................................................................................................................26
Figure 1.13 Examples of sequence to sequence transformation [55] ......................28
Figure 1.14 Describe the location of the Attention model in neural machine
translation ......................................................................................................................29
Figure 1.15 Example of translating an English input sentence into Chinese based on
the phrase ......................................................................................................................31
Figure 1.16 Illustrate the process of translating a Spanish sentence into an English
sentence [63] .................................................................................................................33
Figure 1.17 Deploying a statistical translation system [67] ....................................34
Figure 1.18 An ordinary voice translation system [11]............................................35
Figure 1.19 Model of the speech-to-speech machine translation system using
intermediate representation for unwritten language.....................................................36
Figure 1.20 Voice-to-text translation system [83] ....................................................37
Figure 2.1. Mon-Khmer branch of the Austroasiatic family [109, pp. 175–176] ....51
Figure 2.2 Viet-Muong Group [110] ........................................................................52
Figure 2.3 The distribution of the Muong dialects [114, p. 299] .............................53
Figure 3.1 Emulating TTS for Muong .......................................................................63
Figure 3.2 Muong G2P Module ................................................................................64
Figure 3.3 Intelligibility Results for Muong emulating tones ..................................69
Figure 3.4 Intelligibility Test Result for emulating close phonemes ........................70
Figure 3.5 Intelligibility Test Result for Equivalent phonemes ................................71
Figure 3.6 MOS Emulating Test Result ....................................................................72
Figure 4.1 Low-resourced L2 TTS transfer learning from rich resource L1 ...........79
Figure 4.2 Block diagram of the speech synthesis system architecture ...................80
Figure 4.3 Duration histogram .................................................................................83
Figure 4.4 Duration distribution across the M_15m, M_30m, and M_60m datasets.
.......................................................................................................................................85
Figure 4.5 Training loss and validation loss of pretrained TTS model ....................87
Figure 4.6 Training loss and validation error of Hifigan model ..............................87

Figure 4.7 Training loss and validation loss of M_15m ...........................................88
Figure 4.8 Training loss and validation loss of M_30m and M_60m ......................88
Figure 5.1 System architecture .................................................................................99
Figure 5.2 WaveGlow model architecture [136] ......................................................99
Figure 5.3 Muong Phu Tho training loss and validation loss after training acoustic
model ...........................................................................................................................100
Figure 5.4 Muong Hoa Binh training loss and validation loss after training acoustic
model ...........................................................................................................................100
xi


Figure 5.5 Testing interface ....................................................................................102
Figure 6.1 Training phase TTS L1 text to L2 speech system uses intermediate
representation of phoneme level ..................................................................................113
Figure 6.2 Decoding phase TTS L1 text to L2 speech system uses intermediate
representation of phoneme level ..................................................................................113
Figure 6.3 The result after manual annotation. ......................................................114
Figure 6.4 Phone to sound module, as a speech synthesis from phone sequence ..117
Figure 6.5 Muong Hoa Binh Training loss and validation loss after training acoustic
model ...........................................................................................................................118
Figure 6.6 Muong Phu Tho Training loss and validation loss after training acoustic
model ...........................................................................................................................118
Figure 6.7 Testing interface ....................................................................................119
Figure 6.8 Comparing the synthesized speech results on Muong Hoa Binh using three
methods ........................................................................................................................129
Figure 6.9 Comparing the synthesized speech results on Muong Hoa Binh using three
methods ........................................................................................................................130
Figure 6.10 Comparing the synthesized speech results on Muong Phu Tho using two
methods ........................................................................................................................131
Figure 6.11 Comparing the synthesized speech results on Muong Phu Tho using two

methods ........................................................................................................................132
Figure 6.12 Sumary of direction for low-resourced language speech synthesis ....133
Figure A.1 Raw Muong Hoa Binh: ban vận động thành lập hội trí thức tỉnh ra mắt
.......................................................................................................................................13
Figure A.2 Muong Hoa Binh synthesis: ban vận động thành lập hội trí thức tỉnh ra
mắt .................................................................................................................................13
Figure A.3 Muong Phu Tho raw: ban vận động thành lập hội trí thức tỉnh ra mắt .14
Figure A.4 Muong Phu Tho synthesis: ban vận động thành lập hội trí thức tỉnh ra
mắt .................................................................................................................................14
Figure A.5 Muong Hoa Binh raw - Bố cháu ở nhà hay đi đâu .................................15
Figure A.6 Muong Hoa Binh synthesis: Bố cháu ở nhà hay đi đâu .........................15
Figure A.7 Muong Phu Tho raw: Bố cháu ở nhà hay đi đâu ...................................16
Figure A.8 Muong Phu Tho synthesis: Bố cháu ở nhà hay đi đâu ...........................16

xii


INTRODUCTION
Motivation
Today's speech-processing technology is essential in many aspects of human-machine
interaction. Many recent voice interaction systems have been introduced, allowing users to
communicate with devices on various platforms, such as smartphones (Apple Siri, Google
Cloud, Amazon Alexa, etc.), intelligent cars (BMW, Ford, etc.), and smart homes. In these
systems, one of the essential components is speech synthesis or Text-to-Speech (TTS), which
can convert input text into speech. Developing a TTS system for a language is not only the
implementation of speech processing techniques but also requires linguistic studies such as
phonetics, phonology, syntax, and grammar.
According to statistics in the 25th edition of Ethnologue1 (regarded as the most
comprehensive source of information on linguistic statistics), there are 7,151 living languages
in the world, belonging to 141 language families, of which 2,982 languages are not written.

Some languages have not been described in academic literature, such as dialects of ethnic
minorities. Machine learning methods based on big data do not immediately apply to lowresourced languages, especially unwritten ones. The low-resourced/unwritten language
processing field has started to pay attention in the past few years and has yet to have many
results. However, the research results of this field are essential because, in addition to bringing
voice communication technologies to ethnic minority communities, products applying this
technology are also essential. It also contributes to the conservation of endangered languages.
Regarding the Vietnamese language and speech processing field, domestic research units
have given it comprehensive attention and addressed various aspects, ranging from natural
language processing problems such as text processing, syntactic component separation, and
semantics to speech processing problems such as synthesis and recognition. However, the
problem of language and speech processing in general, including TTS) systems for minority
languages without a writing system in Vietnam, has not received much attention due to the
scarcity of data sources such as bilingual text data and speech data, as well as a lack of related
linguistic studies.
The Muong language presents unique linguistic characteristics that make it challenging to
develop a TTS system, such as tonality and complex phonetic structures. Therefore, this thesis
aims to fill this gap by focusing on developing a TTS system for the Muong language, a
minority language spoken in Vietnam that does not have a writing system (only the Muong
Hoa Binh dialect had a writing system in 2016). This research area is novel not only in Vietnam
but also worldwide, and the development of a Muong TTS system can contribute to preserving
and promoting this endangered language.
Context and constraints
This thesis will classify low-resourced languages into two categories: written and
unwritten. The Muong language will be the object of study in both cases:
• Written: The Muong dialect of Hoa Binh will be examined, as it possesses a
written form.
• Unwritten: The Muong dialect of Phu Tho will be investigated, as it lacks a
written form.
In other regions, the Muong people currently do not use written language. They often
read directly from Vietnamese text and convert it into Muong speech for broadcasting and


1

/>1


communication purposes. This research aims to address these challenges and improve the
accessibility of TTS technology for both written and unwritten Muong dialects.
Moreover, this thesis is conducted within the scope of, and in collaboration with, the
project DLCN.20/17: "Research and development automatic translation system from
Vietnamese text to Muong speech, apply to unwritten minority languages in Vietnam"
(Nghiên cứu xây dựng hệ dịch tự động văn bản tiếng Việt ra tiếng nói tiếng Mường, hướng
đến áp dụng cho các ngôn ngữ dân tộc thiểu số chưa có chữ viết ở ViệtNam). Specific
components of this project include:
• Recorded speech from both Muong Hoa Binh and Muong Phu Tho dialects.
• A machine translation tool that converts Vietnamese text to an intermediate
representation of the Muong language.
Conversely, the research findings of this thesis have been successfully applied and
integrated into the project above, demonstrating the practical value of the work undertaken
in this thesis.
Challenges
Challenges Faced by Current Research:
• Data Scarcity: The foremost challenge is the paucity of training data.
TTS models demand substantial text-speech pairs for effective training.
However, for low-resourced languages, acquiring such data can be
exceedingly difficult, if not impossible.
• Limited Linguistic Knowledge: Inadequate linguistic knowledge hinders
TTS system development. Understanding language structure,
vocabulary, and prosody is crucial, but this knowledge is frequently
absent for low-resourced languages.

• Lack of Linguistic Studies: Linguistic research serves as the backbone
for building TTS systems. Unfortunately, languages with limited
resources often lack comprehensive linguistic studies, making it arduous
to capture essential linguistic characteristics.
To address these challenges, this work proposes an adaptive TTS approach that
efficiently utilizes limited resources to synthesize high-quality speech for the Muong, a
low-resourced language. The approach leverages transfer learning techniques from related
languages and applies unsupervised learning methods to reduce the need for extensive
labelled data. In addition, emulating the input of rich-resource TTS is also a good idea with
written low-resourced language. With an unwritten low-resourced language, an adaptation
is to use text or an intermediate representation of another language to help build better
TTS.
The proposed approach demonstrates the effectiveness of adaptive TTS in synthesizing
low-resourced languages. However, further research and investment in linguistic studies
for low-resourced languages are necessary to improve the quality of TTS systems. With
continued efforts, we can develop more robust TTS systems that provide access to speech
synthesis for all languages, regardless of their resource availability.
Objectives & approachs
This thesis aims to develop a Text-to-Speech (TTS) system for low-resourced languages,
focusing on the Muong language, by utilizing adaptation techniques. We categorize lowresourced languages into two groups, and for each group, we aim to employ suitable methods
to generate TTS:
2


• Written low-resourced languages: Using emulating input and an adaptive
approach to enhance the available linguistic resources.
• Unwritten
low-resourced
languages:
Employing

intermediate
representations or leveraging text from rich-resourced languages to bridge
the gap in linguistic resources.
In this way, the thesis aims to make TTS technology more accessible to low-resourced
languages, thus expanding its applications and fostering communication across diverse
linguistic communities. By focusing on Muong language as a specific case study, this research
not only contributes to the broader field of low-resourced languages but also opens doors for
practical applications. For instance, it paves the way for the development of applications
catering to the Muong community, including Muong radio broadcasts and Muong-speaking
newspapers, all generated from Vietnamese text. This demonstrates the real-world impact of
the research, showcasing its potential to empower minority languages like Muong and preserve
their cultural heritage.
Contributions
The thesis presents the following key contributions:
• First contribution: A method for the synthesizing speech from the written
text for a language with limited data, using the Muong language as a specific
application case. This includes (1) an adaptation technique that utilizes input
from a Vietnamese speech synthesis system (without requiring training data)
and (2) fine-tuning the Vietnamese speech synthesis model with a small
amount of Muong language data.
• Second contribution: A method for synthesizing speech for an unwritten
language using a closely related language with available resources
(generating Muong speech from Vietnamese text). This approach treats the
Muong language as if it were unwritten. The two proposed methods are: (1)
employing an intermediate representation and (2) directly converting
Vietnamese text into Muong speech.
In addition to the two main contributions mentioned above, we also researched the
comparison of Vietnamese and Muong languages, drawing several valuable conclusions for
phonetic studies and natural language processing. We have published various educational
materials and tools for processing text and vocabulary in Vietnamese and Muong.

Dissertation outline
The dissertation is composed of three parts and six chapters, organized as follows:
PART 1: BACKGROUND AND RELATED WORK
• Chapter 1, titled "Overview of speech synthesis and speech synthesis for
Low-Resourced Languages": This chapter concisely reviews the existing
literature to gain a comprehensive understanding of TTS. Research
directions for low-resourced TTS are also detailed in this chapter.
• Chapter 2, titled "Vietnamese and Muong Language": This chapter
presents research on the phonology of Vietnamese and Muong
languages. Computational linguistic resources for Vietnamese speech
processing are described in detail as applied in Vietnamese TTS.
PART 2: SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE
• Chapter 3, titled "Emulating Muong TTS Based on Input Transformation
of Vietnamese TTS, " presents the proposal to synthesize Muong speech
by adapting existing Vietnamese TTS systems. This approach can be
3


experimentally applied to create TTS systems for other Vietnamese
ethnic minority languages quickly.
• Chapter 4, titled "Cross-Lingual Transfer Learning for Muong Speech
Synthesis": In this chapter, we use and experiment with approaches for
Muong TTS that leverage Vietnamese resources. We focus on transfer
learning by creating Vietnamese TTS, further training it with different
Muong datasets, and evaluating the resulting Muong TTS.
PART 3: SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN
LANGUAGE
• Chapter 5, titled "Generating Unwritten Low-Resourced Language's
Speech Directly from Rich-resource Language's Text," presents our
approach for addressing speech synthesis challenges for unwritten lowresourced languages by synthesizing L2 speech directly from L1 text.

The proposed system is built using end-to-end neural network
technology for text-to-speech. We use Vietnamese as L1 and Muong as
L2 in our experiments.
• Chapter 6, titled "Speech synthesis for Unwritten Low-Resourced
Languages Using Intermediate Representation": This chapter proposes
using phoneme representation due to its close relationship with speech
within a single language. The proposed method is applied to the
Vietnamese and Muong language pair. Vietnamese text is translated into
an intermediate representation of two unwritten dialects of the Muong
language: Muong Bi - Hoa Binh and Muong Tan Son - Phu Tho. The
evaluation reveals relatively high translation quality for both dialects.
In conclusion, speech synthesis for low-resourced languages is a significant research
area with the potential to positively impact the lives of speakers of these languages. Despite
challenges posed by limited data and linguistic knowledge, advancements in speech
synthesis technology and innovative approaches enable the developing of high-quality
speech synthesis systems for low-resourced languages. The work presented in this
dissertation contributes to this field by exploring novel methods and techniques for speech
synthesis in low-resourced languages.
For future work, there is a need to continue developing innovative approaches to speech
synthesis for low-resourced languages, particularly in response to the growing demand for
accessible technology. This can be achieved through ongoing research in transfer learning,
unsupervised learning, and data augmentation. Additionally, there is a need for further
investment in collecting and preserving linguistic data for low-resourced languages and
developing phonological studies for these languages. With these efforts, we can ensure
that speech synthesis technology is accessible to everyone, regardless of their language.

4


PART 1 : BACKGROUND AND RELATED WORKS


5


Chapter 1. Overview of speech synthesis and speech
synthesis for low-resourced language
This section presents a concise overview of Text-to-Speech (TTS) synthesis and its
application to low-resourced languages. It highlights the challenges faced in developing TTS
systems for languages with limited resources and data. Additionally, it introduces various
approaches and techniques to address these challenges and improve TTS quality for lowresourced languages.
1.1. Overview of speech synthesis
This section offers a brief introduction to the field of speech synthesis. It highlights the
key concepts and techniques in converting written text into spoken language. It also provides
a foundation for understanding the complexities and challenges of developing speech
synthesis systems.
1.1.1. Overview
Speech synthesis is the artificial generation of human speech using technology. A
computer system designed for this purpose, known as a speech computer or speech
synthesizer, can be realized through software or hardware implementations. A text-tospeech (TTS) system specifically converts standard written language text into audible
speech, whereas other systems transform symbolic linguistic representations, such as
phonetic transcriptions, into speech [1]. TTS technology has evolved significantly over the
years, incorporating advanced algorithms and machine learning techniques to produce more
natural-sounding and intelligible speech output. By simulating various aspects of human
speech, including pitch, tone, and intonation, TTS systems strive to provide a seamless and
user-friendly listening experience.
The development of TTS technology has undergone remarkable progress over time:
• In the 1950s, pioneers like Homer Dudley with his "VODER" and
Franklin S. Cooper's "Pattern Playback" initiated the foundation for
modern TTS systems.
• The 1960s brought forth formant-based synthesis, utilizing models of

vocal tract resonances to produce speech sounds.
• The 1970s introduced linear predictive coding (LPC), enhancing speech
signal modeling and producing more natural synthesized speech.
• The 1980s saw the emergence of concatenative synthesis, a method that
combined pre-recorded speech segments for the final output.
• During the 1990s, unit selection synthesis became popular, using
extensive databases to select the best-fitting speech units for more natural
output.

6


• The 2000s experienced the rise of statistical parametric synthesis
techniques, such as Hidden Markov Models (HMMs), providing a datadriven and adaptable approach to TTS.
• The 2010s marked the beginning of deep learning-based TTS with models
like Google's WaveNet, revolutionizing speech synthesis by generating
raw audio waveforms instead of relying on traditional signal processing.
• End-to-end neural TTS systems like Tacotron streamlined the TTS
process by directly converting text to speech without intermediate stages.
• Transfer learning and multilingual TTS models have recently enabled the
development of high-quality TTS systems for low-resourced languages,
expanding the reach of TTS technology.
• Today, TTS plays a vital role in everyday life, powering virtual assistants,
accessibility tools, and various digital content types.
Some current applications of text-to-speech (TTS) technology includes:
• Assistive technology for the visually impaired: TTS systems help blind
and visually impaired individuals by reading text from books, websites,
and other sources, converting it into audible speech.
• Learning tools: TTS systems are used in computer-aided learning
programs, aiding language learners and students with reading difficulties

or dyslexia by providing auditory reinforcement.
• Voice output communication aids: TTS technology assists individuals
with severe speech impairments by enabling them to communicate
through synthesized speech.
• Public transportation announcements: TTS provides automated
announcements for passengers on buses, trains, and other public
transportation systems.
• E-books and audiobooks: TTS systems can read electronic books and
generate audiobooks, making content accessible to a broader audience.
• Entertainment: TTS technology is utilized in video games, animations,
and other forms of multimedia entertainment to create realistic and
engaging voiceovers.
• Email and messaging: TTS systems can read emails, text messages, and
other written content aloud, helping users stay connected and informed.
• Call center automation: TTS is employed in automated phone systems,
allowing users to interact with voice-activated menus and complete
transactions through spoken commands.
• Virtual assistants: TTS is a crucial component of popular voice-activated
virtual assistants like Apple's Siri, Google Assistant, and Amazon's
Alexa, enabling them to provide spoken responses to user queries.

7


• Voice search applications: By integrating TTS with speech recognition,
users can use speech as a natural input method for searching and
retrieving information through voice search apps.
In conclusion, TTS technology has come a long way since its inception, with continuous
advancements in algorithms, machine learning, and deep learning techniques. As a result,
TTS systems now provide more natural-sounding and intelligible speech, enhancing the user

experience across various applications such as assistive technology, learning tools,
entertainment, virtual assistants, and voice search. The ongoing development and integration
of TTS into our daily lives will continue to shape the future of human-computer interaction
and digital accessibility.
1.1.2. TTS architecture
The architecture of a TTS system is generally composed of several components, as
depicted in Figure 1.1. The Text Processing component is responsible for preparing the input
text for speech synthesis. The G2P Conversion component converts the written words into

Figure 1.1. Basic system architecture of a TTS system [22]

their corresponding phonetic representations. The Prosody Modeling component adds
appropriate intonation, duration, and other prosodic features to the phonetic sequence.
Lastly, the Speech Synthesis component generates the speech waveform based on the
parameters derived from the fully tagged phonetic sequence [2].
Text processing is crucial for identifying and interpreting all textual or linguistic
information that falls outside the realms of phonetics and prosody. Its primary function is to
transform non-orthographic elements into words that can be spoken aloud. Through text
normalization, symbols, numbers, dates, abbreviations, and other non-orthographic text
elements are converted into a standard orthographic transcription, facilitating subsequent
phonetic conversion. Additionally, analyzing whitespace, punctuation, and other delimiters
is vital for determining document structure and providing context for all subsequent steps.
Certain text structure elements may also directly impact prosody. Advanced syntactic and
semantic analysis can be achieved through effective text-processing techniques [2, p. 682].
The phonetic analysis aims to transform orthographic symbols of words into phonetic
representations, complete with any diacritic information or lexical tones present in tonal
languages. Although future TTS systems might rely on word-sounding units and possess
increased storage capacity, homograph disambiguation and grapheme-to-phoneme (G2P)
conversion for new words remain essential for accurate pronunciation of every word. G2P


8


conversion is relatively straightforward in languages with a clear relationship between
written and spoken forms. A small set of rules can effectively describe this direct correlation,
which is characteristic of phonetic languages such as Spanish and Finnish. Conversely,
English is not a phonetic language due to its diverse origins, resulting in less predictable
letter-to-sound relationships. In these cases, employing general letter-to-sound rules and
dictionary lookups can facilitate the conversion of letters to sounds, enabling the correct
pronunciation of any word [2, p. 683].
In TTS systems, prosodic analysis involves examining prosodic features within the text
input, such as stress, duration, pitch, and intensity. This information is then utilized to
generate more natural and expressive speech. Prosodic analysis helps determine the
appropriate stress, intonation, and rhythm for the synthesized speech, resulting in a more
human-like output. Predicting prosodic features can be achieved through rule-based or
machine-learning methods, including acoustic modeling and statistical parametric speech
synthesis. By adjusting the synthesized speech, TTS systems can convey various emotions
or speaking styles, enhancing their versatility and effectiveness across diverse applications.
Speech synthesis employs anticipated information from the fully tagged phonetic
sequence to generate the corresponding speech waveform. Broadly, two traditional speech
synthesis techniques are concatenative and source/filter synthesizers. Concatenative
synthesizers assemble pre-recorded human speech components to produce the desired
utterance. In contrast, source/filter synthesizers create synthetic voices using a source/filter
model based on the parametric description of speech. The first method necessitates
assistance in generating high-quality speech using the input text's parametric representation
and speech parameters. Meanwhile, the second approach requires a combination of
algorithms and signal processing adjustments to ensure smooth and continuous speech,
particularly at junctures.
Several improvements have been proposed for high-quality text-to-speech (TTS)
systems, drawing from the two fundamental speech synthesis techniques. Among the most

prominent state-of-the-art methods are statistical parametric speech synthesis and unit
selection techniques, which have been the subject of extensive debate among researchers in
the field.

Figure 1.2 Neural TTS architecture [3]

With the advancement of deep learning, neural network-based TTS (neural TTS) systems
have been proposed, utilizing (deep) neural networks as the core model for speech synthesis.
A neural TTS system comprises three fundamental components: a text analysis module, an
acoustic model, and a vocoder. As illustrated in Figure 1.2 the text analysis module
transforms a text sequence into linguistic features. The acoustic model then generates
acoustic features from these linguistic features, and finally, the vocoders synthesize the
waveform from the acoustic features.
1.1.3. Evolution of TTS methods over time
The evolution of TTS methods has progressed significantly over time, with
advancements in technology and research contributing to more natural and intelligible

9


speech synthesis. Early TTS systems relied on rule-based methods and simple concatenation
techniques, which have since evolved into sophisticated machine learning approaches,
including neural network-based TTS systems. These modern systems offer improved speech
quality, prosody, and adaptability, resulting in more versatile applications across various
industries.
1.1.3.1. TTS using unit-selection method
The unit-selection approach allows for the creation of new genuinely sounding utterances
by picking relevant sub-word units from a natural speech database [4], based on how well a
chosen unit matches a specification/a target unit (and how well two chosen units join
together). During synthesis, an algorithm chooses one unit from the available options to

discover the best overall sequence of units that meets the specification [1]. The specification
and the units are described by a feature set that includes linguistic and speech elements. The
feature set is used to do a Viterbi-style search to determine the sequence of units with the
lowest total cost.
Although they are theoretically quite similar, the review of Zen [4] suggests that there are
two fundamental methods in unit-selection synthesis: (i) the selection model [5], shown in
Figure 1.3a; (ii) the clustering approach [6], shown in Figure 1.3b, which effectively enables
the target cost to be pre-calculated. The second method asks questions about features
available at the time of synthesis and groups units of the same type into a decision tree.

Figure 1.3. General and clustering-based unit-selection scheme: Solid lines represent
target costs and dashed lines represent concatenation costs [13]

In the selection model for TTS synthesis, speech units are chosen based on a cost function
calculated in real time during the synthesis process. This cost function considers the acoustic
and linguistic similarity between the target text and available speech units in the database,
selecting the unit with the lowest cost for synthesis. Conversely, the clustering approach precalculates the cost for each speech unit, grouping similar units into a decision tree. This tree
allows for rapid speech unit selection during synthesis based on available features, reducing
the real-time computation and resulting in faster, more efficient TTS synthesis. Both
methods have their advantages and disadvantages, with the selection model offering greater
flexibility for adapting to different languages and voices and the clustering approach

10


providing enhanced speed and efficiency. The choice between these methods depends on
the specific needs of the TTS system being developed.
1.1.3.2. Statistical parameter speech synthesis
In a typical statistical parametric speech synthesis system, a set of generative models is
used to model the parametric speech representations extracted from a speech database,

including spectral and excitation parameters (also known as vocoder parameters are used as
inputs of the vocoder). The model parameters are frequently estimated using the Maximum
Likelihood (ML) criterion. Then, to maximize their output probabilities, speech parameters
are constructed for a specific word sequence to be synthesized from the estimated models.
Finally, a speech waveform is built from the parametric representations of speech [4].
Any generative model can be employed; however, HMMs are mainly well-known. In
HMM-based speech synthesis (HTS) [7], context-dependent HMMs statistically model and
produce the speech parameters of a speech unit, such as the spectrum and excitation
parameters (for example, fundamental frequency - F0). A typical HMM-based speech

Figure 1.4. Core architecture of HMM-based speech synthesis system [25]

synthesis system's core architecture, as shown in Figure 1.4 [8], consists of two main
processes: training and synthesis.
The Expectation Maximization (EM) algorithm is used to do the ML estimation (MLE)
during training, and it is similar to speech recognition. The primary distinction is that
excitation and spectrum parameters are taken from a database of natural speech that a
collection of multi-stream context-dependent HMMs has modeled. Excitation parameters
include log F0 and its dynamic properties.
Another distinction is adding prosodic and linguistic circumstances to phonetic settings
(called contextual features). The state-duration distribution for each HMM is also used to
describe the temporal structure of speech. The Gamma distribution and the Gaussian

11


×