Tải bản đầy đủ (.docx) (178 trang)

Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (5.9 MB, 178 trang )

MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

Pham Van Dong

SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGES
BASED ON ADAPTATION APPROACH: APPLICATION TO
MUONG LANGUAGE

DOCTORAL DISSERTATION IN
COMPUTER SCIENCE

Ha Noi – 2023


MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

Pham Van Dong

SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGES
BASED ON ADAPTATION APPROACH: APPLICATION TO
MUONG LANGUAGE

Major: Computer science
Code: 9480101

DOCTORAL DISSERTATION IN
COMPUTER SCIENCE

ADVISORS:



1. Dr. MAC DANG KHOA
2. Assoc. Prof. TRAN DO DAT

Ha Noi - 2023


DECLARATION OF AUTHORSHIP
I, Pham Van Dong, declare that the dissertation titled “Speech Synthesis for LowResourced Languages based on Adaptation Approach: Application to Muong
Language” has been entirely composed by myself. I assure you of some points as
follows:
 This work was done wholly or mainly while in candidature for a Ph.D.
research degree at Hanoi University of Science and Technology.
 The work has not been submitted for any other degree or qualifications at
Hanoi University of Science and Technology or any other institution.
 Appropriate acknowledgment has been given within this dissertation,
where reference has been made to the published work of others.
 The dissertation submitted is my own, except where work in the
collaboration has been included. The collaborative contributions have been
indicated.
Hanoi, September 19, 2023
Ph.D. Student

Pham Van Dong
ADVISORS
1. Dr. Mac Dang Khoa

2. Assoc. Prof. Tran Do Dat

i



ACKNOWLEDGMENT
Foremost, I would like to express my most sincere and deepest gratitude to my
thesis advisors Dr. Mạc Đăng Khoa (Speech Communication Department, MultiLab at
MICA), Prof. TRẦN Đỗ Đạt (The Ministry of Science and Technology, Vietnam) for
their continuous support and guidance during my Ph.D. program, and for providing me
with such a severe and inspiring research environment. I am grateful to Dr. Mạc Đăng
Khoa for his excellent mentorship, caring, patience, and immense Text-To-Speech
(TTS) knowledge. His advice helped me in all the research and writing of this thesis. I
am very thankful to Prof. Đạt for shaping my thesis at the beginning and for their
enthusiasm and encouragement. Prof. Trần Đỗ Đạt substantially facilitated my Ph.D.
research, especially when I was a freshman on speech processing and TTS, with his
valuable comments on Vietnamese and Muong TTS.
I thank all MICA members for their help during my Ph.D. study. My sincere thanks
to Dr. Nguyen Viet Son, Assoc. Prof. Dao Trung Kien and Dr. Do Thi Ngoc Diep for
giving me much support and valuable advice. Thanks to Nguyen Van Thinh, Nguyen
Tien Thanh, Dang Thanh Mai, and Vu Thi Hai Ha for their help. I want to thank my
Hanoi University of Mining and Geology colleagues for all their support during my
Ph.D. study. Special thanks to my family for understanding my hours glued to the
computer screen.
Hanoi, September 19, 2023
Ph.D. Student


ABSTRACT
Text-to-speech (TTS) synthesis is the automatic conversion of text into speech. Typically,
building high-quality voiceovers requires collecting tens of hours of the voice of a professional
speaker with a high-quality microphone. There are about 7,000 languages spoken worldwide,
but only a few languages, such as English, Spanish, Mandarin, and Japanese, are used in good

TTS. With so-called "low-resourced languages" or even languages that are not yet written, these
languages do not have TTS. Thus, to apply TTS technology to low-resourced language, it is
necessary to study other TTS methods.
In Vietnam, Vietnamese is the mother tongue and is used the most. The Muong is a group of
the language spoken by the Muong people of Vietnam. They are in the Austroasiatic language
family and are closely related to Vietnamese, and Muong is also one of the five ethnic groups
with the largest population. However, Muong still needs an official script, a typical
representative of the low-resourced language in Vietnam. Therefore, researching TTS
technologies to create TTS for the Muong language is challenging.
In the first part of this thesis, we do an overview of TTS. Researching the phonetics of
Vietnamese and Muong languages, the thesis has also researched and published some tools to
support TTS technology for Vietnamese and Muong languages. In the rest of the thesis, we
conduct various experiments in creating TTS for low-resourced language; specifically, we
experiment with the Muong language. We focus on two main low-resourced language groups:
 Written: We use emulating to simulate the reading of the Muong language using
Vietnamese TTS and cross-lingual adaptation transfer-learning.
 Unwritten: We experiment with adaptation in two directions. The first is to create
Muong speech synthesis directly from Vietnamese Text and Muong voice. The
second is to create Muong speech synthesis from translation through intermediate
representation
We hope our findings can serve as an impetus to develop speech synthesis for low-resourced
languages worldwide and contribute to the basis for speech synthesis development for 53 ethnic
minority languages in Viet Nam.
Hanoi, September 19, 2023
Ph.D. Student


CONTENT
DECLARATION OF AUTHORSHIP.......................................................................I
ACKNOWLEDGMENT..........................................................................................II

ABSTRACT........................................................................................................III
CONTENT.............................................................................................................. IV
ABBREVIATIONS..............................................................................................VIII
LIST OF TABLES..................................................................................................IX
LIST OF FIGURES.................................................................................................XI
INTRODUCTION.....................................................................................................1
PART 1 : BACKGROUND AND RELATED WORKS.......................................5
CHAPTER 1. OVERVIEW OF SPEECH SYNTHESIS AND SPEECH
SYNTHESIS FOR LOW-RESOURCED LANGUAGE............................................6
1.1.
Overview of speech synthesis...........................................................................6
1.1.1. Overview.........................................................................................................6
1.1.2. TTS architecture...............................................................................................8
1.1.3. Evolution of TTS methods over time..................................................................9
1.1.3.1. TTS using unit-selection method......................................................................10
1.1.3.2. Statistical parameter speech synthesis................................................................11
1.1.3.3. Speech synthesis using deep neural networks.....................................................13
1.1.3.4. Neural speech synthesis...................................................................................14
1.2.
Speech synthesis for low-resourced languages...............................................19
1.2.1. TTS using emulating input approach.................................................................20
1.2.2. TTS using the polyglot approach......................................................................22
1.2.3. Speech synthesis for low-resourced language using the adaptation approach.........25
1.3.
Machine translation......................................................................................27
1.3.1. Neural translation model..................................................................................28
1.3.2. Attention in neural machine translation.............................................................29
1.3.3. Statistical machine translation based on phrase...................................................30
1.3.3.1. Statistical machine translation problem based on phrase......................................30
1.3.3.2. Translation model and language model.............................................................31

1.3.3.3. Decode the input sentence in the translation system............................................32
1.3.3.4. Model for building a statistical translation system...............................................34
1.3.4. Machine translation through intermediate representation.....................................34
1.3.5. Speech translation for unwritten low-resourced languages...................................36
1.4.
Speech synthesis evaluation metrics...............................................................38
1.4.1. Mean Opinion Score (MOS)............................................................................38
1.4.1.1. Definition.......................................................................................................38
1.4.1.2. Formula.........................................................................................................38
1.4.1.3. Significance...................................................................................................38
1.4.1.4. Confidence Interval (CI)..................................................................................39
1.4.2. Mel Cepstral Distortion (MCD)........................................................................39


1.4.2.1. Concept.........................................................................................................39
1.4.2.2. Formula.........................................................................................................39
1.4.2.3. Significance...................................................................................................40
1.4.2.4. MCD with Dynamic Time Warping (MCD – DTW).........................................40
1.4.3. Analysis of variance (Anova)...........................................................................40
1.4.4. Intelligibility...................................................................................................42
1.5.
Conclusion....................................................................................................42
CHAPTER 2. VIETNAMESE AND MUONG LANGUAGE..................................44
2.1.
Vietnamese language.....................................................................................44
2.1.1. History of Vietnamese.....................................................................................44
2.1.2. Vietnamese phonetic system............................................................................45
2.1.2.1. Vietnamese syllabus structure..........................................................................46
2.1.2.2. Vietnamese phonetic system............................................................................47
2.1.2.3. Vietnamese tone system..................................................................................49

2.2.
Muong language...........................................................................................50
2.2.1. Overview of Muong people and Muong language..............................................50
2.2.1.1. Muong history................................................................................................50
2.2.1.2. Viet Muong group..........................................................................................51
2.2.1.3. Muong dialects...............................................................................................53
2.2.1.4. Muong written script.......................................................................................54
2.2.2. Muong phonetics system.................................................................................55
2.2.2.1. Muong syllable structure.................................................................................55
2.2.2.2. Muong phoneme system.................................................................................55
2.2.2.3. Muong tone system.........................................................................................57
2.3.
Comparison between Vietnamese and Muong...............................................57
2.4.
Dicussion and proposal approach..................................................................60
PART 2 : SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE
........................................................................................................................................................ 61
CHAPTER 3. EMULATING OF THE MUONG TTS BASED ON INPUT
TRANSFORMATION OF THE VIETNAMESE TTS............................................62
3.1.
Proposed method..........................................................................................63
3.1.1. Muong G2P module.......................................................................................64
3.1.2. Muong emulating IPA module.........................................................................65
3.2.
Experiment...................................................................................................65
3.2.1. Testing materials.............................................................................................66
3.2.2. Experiment protocol........................................................................................67
3.2.3. Results...........................................................................................................68
3.2.4. Analysis by ANOVA method..........................................................................72
3.2.4.1. MOS analysis by ANOVA..............................................................................72

3.2.4.2. Intelligibility analysis by ANOVA....................................................................75
3.3.
Conclusion....................................................................................................77


CHAPTER 4. CROSS-LINGUAL TRANSFER LEARNING FOR MUONG
SPEECH SYNTHESIS............................................................................................78
4.1.
Proposed method..........................................................................................78
4.2.
Experiment...................................................................................................82
4.2.1. Dataset..........................................................................................................82
4.2.1.1. Vietnamese data.............................................................................................82
4.2.1.2. Muong Project‘s data......................................................................................84
4.2.1.3. Muong fine-tuning data...................................................................................84
4.2.2. Graphemes to phonemes.................................................................................85
4.2.3. Training the pretrained model using Vietnamese dataset.....................................86
4.2.4. Finetuned TTS model on Muong datasets.........................................................87
4.3.
Evaluation....................................................................................................88
4.4.
MOS analysis by ANOVA............................................................................91
4.5.
Conclusion....................................................................................................94
PART 3 : SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN
LANGUAGE.......................................................................................................96
CHAPTER 5. GENERATE UNWRITTEN LOW-RESOURCED LANGUAGE’S
SPEECH DIRECTLY FROM RICH-RESOURCE LANGUAGE’S TEXT............97
5.1.
Introduction..................................................................................................97

5.2.
Proposed method..........................................................................................98
5.2.1. Model architecture..........................................................................................98
5.2.2. Database........................................................................................................99
5.2.3. Training the speech synthesis system..............................................................100
5.2.4. Evaluation....................................................................................................100
5.2.5. MOS analysis by ANOVA............................................................................105
5.2.5.1. ANOVA analysis in Muong Bi speech synthesis..............................................105
5.2.5.2. ANOVA analysis in Muong Tan Son speech synthesis.....................................108
5.3.
Conclusion..................................................................................................111
CHAPTER 6. SPEECH SYNTHESIS FOR UNWRITTEN LOW-RESOURCED
LANGUAGE USING INTERMEDIATE REPRESENTATION...........................112
6.1.
Proposal Method.........................................................................................112
6.2.
Experiment.................................................................................................114
6.2.1. Database building.........................................................................................114
6.2.2. System development.....................................................................................114
6.2.2.1. Text to phone translation................................................................................115
6.2.2.2. Phone to Sound Conversion...........................................................................117
6.3.
Evaluation..................................................................................................119
6.3.1. Evaluation in Muong Bi and Muong Tan Son..................................................119
6.3.2. MOS analysis by ANOVA............................................................................122
6.3.2.1. ANOVA analysis in Muong Bi speech synthesis..............................................122
6.3.2.2. ANOVA analysis in Muong Tan Son speech synthesis.....................................125
6.4.
Conclusion and comparison........................................................................128
CONCLUSION AND FUTURE WORKS........................................................135



Conclusions............................................................................................................135
Future work...........................................................................................................136
PUBLICATIONS...................................................................................................138
BIBLIOGRAPHY.................................................................................................139
APPENDIX A............................................................................................................1
A.1. Vietnamese and Muong phonetic....................................................................1
A.2. Muong G2P....................................................................................................4
A.3. Muong Vietnamese phone mapping................................................................6
A.4. Information of Muong volunteers who participated in the assessment.............9
A.5. Speech signal samples of the Muong TTS in chapter 5...................................12


ABBREVIATIONS
Abbreviation
CART
F0
HMM
HTK
HTS
IPA
MARY
(TTS)
MFCC
ML
MLSA
MOS
MSDHMM
NLP

OCR
POS
PP
PSOLA
SAMPA
SPTK
SSML
TDPSOLA
TTS
VNSP
WEKA

X-SAMPA
XML

p(e | f)



Expansion

Explanation

Classification And Regression
Tree
Fundamental Frequency
Hidden Markov Model
Hidden markov model ToolKit
HMM-based speech synthesis
International Phonetic Alphabet

Modular Architecture for
Research on speech sYnthesis
Mel Frequency Cepstral
Coefficents
Maximum Likelihood
Mel Log Spectrum
Approximation
Mean Opinion Score
Multi-Space probability
Distribution HMM
Natural Language Processing
Optical Character Recognition
Part-Of-Speech
Prepositional Phrase
Pitch Synchronous OverLap and
Add
Speech Assessment Methods
Phonetic Alphabet
Speech signal Processing ToolKit
Speech Synthesis Markup
Language
Time-Domain Pitch Synchronous
OverLap and Add
Text-To-Speech
VNSpeechCorpus for synthesis
Waikato Environment for
Knowledge
Analysis
Extended Speech Assessment
Methods Phonetic Alphabet

eXtensible Markup Language
Conditional Probability
Pi
Sigma Factor

A portable toolkit for building and
manipulating hidden Markov models

Word class or a lexical category

A collection of machine learning
algorithms for data mining tasks:

Product of a sequence of numbers


LIST OF TABLES
Table 2.1 Vietnamese syllabus structure [94]................................................................46
Table 2.2 Vietnamese syllabus structure [96]................................................................46
Table 2.3 Vietnamese syllables based on structure.........................................................47
Table 2.4 Hanoi Vietnamese inital consonants..............................................................48
Table 2.5 The letter of initial consonant........................................................................48
Table 2.6 Hanoi Vietnamese final consonant.................................................................49
Table 2.7 Tone of Hanoi Vietnamese [108]...................................................................49
Table 2.8 Muong syllabic structure..............................................................................55
Table 2.9 Muong final sound system............................................................................56
Table 2.10 Muong Hoa Binh tone system [115].............................................................57
Table 2.11 Muong Bi and Muong Tan Son Tone...........................................................57
Table 2.12 Muong and Vietnamese phonetic comparison (orthography in normal, IPA in
italic; Vi: Vietnamese; Mb: Muong Bi ; Mts : Muong Tan Son)

.....................................................................................................................................
59
Table 2.13 Comparing the tone of Vietnamese with Muong Tan Son and Muong Bi
.......................................................................................................................................60
Table 3.1 Muong G2P Result Sample...........................................................................64
Table 3.2 Examples of applying transformation rules to convert the Muong text into input
text for Vietnamese TTS
.....................................................................................................................................
65
Table 3.3. Testing material for emulating tone...............................................................66
Table 3.4. Testing material for emulating phone (the concerning phonemes in bold)
.......................................................................................................................................67
Table 3.5. Testing material for remaining phonemes......................................................67
Table 3.6 ANOVA Results for MOS Test.......................................................................73
Table 3.7 ANOVA Results for Intelligibility Test............................................................75
Table 4.1 Parameters of acoustic model.......................................................................80
Table 4.2 Vietnamese dataset information.....................................................................83
Table 4.3 Muong recorded data...................................................................................85
Table 4.4 The Muong split data set...............................................................................85
Table 4.5 Parameter for optimizer...............................................................................86
Table 4.6 Value of parameters when training Hifigan model..........................................86
Table 4.7 The specifications of the in-domain and out-domain test sets............................89
Table 4.8 Test set samples...........................................................................................89
Table 4.9 Evaluation results........................................................................................90
Table 4.10 ANOVA Results for in-domain MOS Test.....................................................92
Table 4.11 ANOVA Results for out-domain MOS Test...................................................93
Table 4.12 ANOVA Results for in/out domain MOS Test................................................94
Table 5.1 Evaluation Score.......................................................................................102
Table 5.2 TTS evaluation with in-domain test set.........................................................103
Table 5.3 TTS evaluation with out-domain test set.......................................................104

Table 5.4 ANOVA Results for in-domain MOS Test for Muong Bi.................................106
Table 5.5 ANOVA Results for out-domain MOS Test for Muong Bi...............................107
Table 5.6 ANOVA Results for Muong Bi in/out domain MOS Test.................................107
Table 5.7 ANOVA Results for in-domain MOS Test for Muong Tan Son........................109
Table 5.8 ANOVA Results for out-domain MOS Test for Muong Tan Son......................110
Table 5.9 ANOVA Results for Muong Tan Son in/out domain MOS Test........................110


Table 6.1 Examples of labeling Vietnamese text into an intermediate representation of
Muong Bi and Muong Tan Son phonemes.......................................................................117
Table 6.2 Text information of Muong language datasets..............................................118
Table 6.3 TTS evaluation with in-domain test set.........................................................119
Table 6.4 TTS evaluation with out-domain test set.......................................................120


Table 6.5 ANOVA Results for in-domain MOS Test for Muong Bi.................................123
Table 6.6 ANOVA Results for out-domain MOS Test for Muong Bi...............................124
Table 6.7 ANOVA Results for Muong Bi in/out domain MOS Test.................................124
Table 6.8 ANOVA Results for in-domain MOS Test for Muong Tan Son........................126
Table 6.9 ANOVA Results for out-domain MOS Test for Muong Tan Son......................127
Table 6.10 ANOVA Results for Muong Tan Son in/out domain MOS Test......................127
Table A.1 Vietnamese vowels........................................................................................1
Table A.2 The Muong initial consonant..........................................................................1
Table A.3 Muong vowels system....................................................................................2
Table A.4 The correspondences between Vietnamese and Muong in 12 words refer to the
human body parts [137]
.......................................................................................................................................
4
Table A.7 Muong G2P..................................................................................................4
Table A.8 Muong Vietnamese phone mapping................................................................7

Table A.9 Muong Hoa Binh volunteers...........................................................................9
Table A.10 Muong Phu Tho volunteers........................................................................10


LIST OF FIGURES
Figure 1.1. Basic system architecture of a TTS system [22]..............................................8
Figure 1.2 Neural TTS architecture [3]..........................................................................9
Figure 1.3. General and clustering-based unit-selection scheme: Solid lines represent target
costs and dashed lines represent concatenation costs [13]
.....................................................................................................................................
10
Figure 1.4. Core architecture of HMM-based speech synthesis system [25].....................11
Figure 1.5. General HMM-based synthesis scheme [13, p. 5].........................................12
Figure 1.6. A speech synthesis framework based on a DNN [29]....................................13
Figure 1.7 Encoder and Decoder diagram in Seq2Seq model.........................................14
Figure 1.8 Char2Wav model [23]................................................................................17
Figure 1.9 Model of the Tacotron synthesis system [24].................................................18
Figure 1.10 Block diagram of the Tacotron 2 system architecture [25]............................19
Figure 1.11 Scheme of a HMM-based polyglot synthesizer [48].....................................23
Figure 1.12 Approaches to transfer TTS model from source language to target language
[32]
.....................................................................................................................................
26
Figure 1.13 Examples of sequence to sequence transformation [55]................................28
Figure 1.14 Describe the location of the Attention model in neural machine translation
.....................................................................................................................................
29
Figure 1.15 Example of translating an English input sentence into Chinese based on the
phrase
.....................................................................................................................................

31
Figure 1.16 Illustrate the process of translating a Spanish sentence into an English sentence
[63]
.....................................................................................................................................
33
Figure 1.17 Deploying a statistical translation system [67]............................................34
Figure 1.18 An ordinary voice translation system [11]...................................................35
Figure 1.19 Model of the speech-to-speech machine translation system using intermediate
representation for unwritten language
.....................................................................................................................................
36
Figure 1.20 Voice-to-text translation system [83]..........................................................37
Figure 2.1. Mon-Khmer branch of the Austroasiatic family [109, pp. 175–176]................51
Figure 2.2 Viet-Muong Group [110]............................................................................52
Figure 2.3 The distribution of the Muong dialects [114, p. 299]......................................53
Figure 3.1 Emulating TTS for Muong..........................................................................63
Figure 3.2 Muong G2P Module...................................................................................64
Figure 3.3 Intelligibility Results for Muong emulating tones...........................................69
Figure 3.4 Intelligibility Test Result for emulating close phonemes..................................70
Figure 3.5 Intelligibility Test Result for Equivalent phonemes.........................................71
Figure 3.6 MOS Emulating Test Result........................................................................72
Figure 4.1 Low-resourced L2 TTS transfer learning from rich resource L1......................79
Figure 4.2 Block diagram of the speech synthesis system architecture.............................80
Figure 4.3 Duration histogram....................................................................................83
Figure 4.4 Duration distribution across the M_15m, M_30m, and M_60m datasets.
.......................................................................................................................................85


Figure 4.5 Training loss and validation loss of pretrained TTS model..............................87
Figure 4.6 Training loss and validation error of Hifigan model......................................87

Figure 4.7 Training loss and validation loss of M_15m..................................................88
Figure 4.8 Training loss and validation loss of M_30m and M_60m................................88
Figure 5.1 System architecture....................................................................................99
Figure 5.2 WaveGlow model architecture [136]...........................................................99
Figure 5.3 Muong Phu Tho training loss and validation loss after training acoustic model
...................................................................................................................................
100
Figure 5.4 Muong Hoa Binh training loss and validation loss after training acoustic model
................................................................................................................................... 100


Figure 5.5 Testing interface......................................................................................102
Figure 6.1 Training phase TTS L1 text to L2 speech system uses intermediate representation
of phoneme level..........................................................................................................113
Figure 6.2 Decoding phase TTS L1 text to L2 speech system uses intermediate
representation of phoneme level.....................................................................................113
Figure 6.3 The result after manual annotation............................................................114
Figure 6.4 Phone to sound module, as a speech synthesis from phone sequence.............117
Figure 6.5 Muong Hoa Binh Training loss and validation loss after training acoustic model
................................................................................................................................... 118
Figure 6.6 Muong Phu Tho Training loss and validation loss after training acoustic model
................................................................................................................................... 118
Figure 6.7 Testing interface......................................................................................119
Figure 6.8 Comparing the synthesized speech results on Muong Hoa Binh using three
methods....................................................................................................................... 129
Figure 6.9 Comparing the synthesized speech results on Muong Hoa Binh using three
methods....................................................................................................................... 130
Figure 6.10 Comparing the synthesized speech results on Muong Phu Tho using two
methods....................................................................................................................... 131
Figure 6.11 Comparing the synthesized speech results on Muong Phu Tho using two

methods....................................................................................................................... 132
Figure 6.12 Sumary of direction for low-resourced language speech synthesis...............133
Figure A.1 Raw Muong Hoa Binh: ban vận động thành lập hội trí thức tỉnh ra mắt
.......................................................................................................................................13
Figure A.2 Muong Hoa Binh synthesis: ban vận động thành lập hội trí thức tỉnh ra mắt
.....................................................................................................................................
13
Figure A.3 Muong Phu Tho raw: ban vận động thành lập hội trí thức tỉnh ra mắt.14 Figure
A.4 Muong Phu Tho synthesis: ban vận động thành lập hội trí thức tỉnh ra
mắt............................................................................................................................... 14
Figure A.5 Muong Hoa Binh raw - Bố cháu ở nhà hay đi đâu.........................................15
Figure A.6 Muong Hoa Binh synthesis: Bố cháu ở nhà hay đi đâu..................................15
Figure A.7 Muong Phu Tho raw: Bố cháu ở nhà hay đi đâu...........................................16
Figure A.8 Muong Phu Tho synthesis: Bố cháu ở nhà hay đi đâu....................................16


INTRODUCTION
Motivation
Today's speech-processing technology is essential in many aspects of human-machine
interaction. Many recent voice interaction systems have been introduced, allowing users to
communicate with devices on various platforms, such as smartphones (Apple Siri, Google
Cloud, Amazon Alexa, etc.), intelligent cars (BMW, Ford, etc.), and smart homes. In these
systems, one of the essential components is speech synthesis or Text-to-Speech (TTS), which
can convert input text into speech. Developing a TTS system for a language is not only the
implementation of speech processing techniques but also requires linguistic studies such as
phonetics, phonology, syntax, and grammar.
According to statistics in the 25th edition of Ethnologue 1 (regarded as the most
comprehensive source of information on linguistic statistics), there are 7,151 living languages in
the world, belonging to 141 language families, of which 2,982 languages are not written. Some
languages have not been described in academic literature, such as dialects of ethnic minorities.

Machine learning methods based on big data do not immediately apply to low- resourced
languages, especially unwritten ones. The low-resourced/unwritten language processing field
has started to pay attention in the past few years and has yet to have many results. However, the
research results of this field are essential because, in addition to bringing voice communication
technologies to ethnic minority communities, products applying this technology are also
essential. It also contributes to the conservation of endangered languages.
Regarding the Vietnamese language and speech processing field, domestic research units
have given it comprehensive attention and addressed various aspects, ranging from natural
language processing problems such as text processing, syntactic component separation, and
semantics to speech processing problems such as synthesis and recognition. However, the
problem of language and speech processing in general, including TTS) systems for minority
languages without a writing system in Vietnam, has not received much attention due to the
scarcity of data sources such as bilingual text data and speech data, as well as a lack of related
linguistic studies.
The Muong language presents unique linguistic characteristics that make it challenging to
develop a TTS system, such as tonality and complex phonetic structures. Therefore, this thesis
aims to fill this gap by focusing on developing a TTS system for the Muong language, a
minority language spoken in Vietnam that does not have a writing system (only the Muong Hoa
Binh dialect had a writing system in 2016). This research area is novel not only in Vietnam but
also worldwide, and the development of a Muong TTS system can contribute to preserving and
promoting this endangered language.
Context and constraints
This thesis will classify low-resourced languages into two categories: written and
unwritten. The Muong language will be the object of study in both cases:
 Written: The Muong dialect of Hoa Binh will be examined, as it possesses a
written form.
 Unwritten: The Muong dialect of Phu Tho will be investigated, as it lacks a written
form.
In other regions, the Muong people currently do not use written language. They often
read directly from Vietnamese text and convert it into Muong speech for broadcasting and


1 />
1


communication purposes. This research aims to address these challenges and improve the
accessibility of TTS technology for both written and unwritten Muong dialects.
Moreover, this thesis is conducted within the scope of, and in collaboration with, the
project DLCN.20/17: "Research and development automatic translation system from
Vietnamese text to Muong speech, apply to unwritten minority languages in Vietnam"
(Nghiên cứu xây dựng hệ dịch tự động văn bản tiếng Việt ra tiếng nói tiếng Mường,
hướng đến áp dụng cho các ngôn ngữ dân tộc thiểu số chưa có chữ viết ở ViệtNam).
Specific components of this project include:
 Recorded speech from both Muong Hoa Binh and Muong Phu Tho dialects.
 A machine translation tool that converts Vietnamese text to an intermediate
representation of the Muong language.
Conversely, the research findings of this thesis have been successfully applied and
integrated into the project above, demonstrating the practical value of the work undertaken
in this thesis.
Challenges
Challenges Faced by Current Research:
 Data Scarcity: The foremost challenge is the paucity of training data. TTS
models demand substantial text-speech pairs for effective training. However,
for low-resourced languages, acquiring such data can be exceedingly difficult,
if not impossible.
 Limited Linguistic Knowledge: Inadequate linguistic knowledge hinders TTS
system development. Understanding language structure, vocabulary, and
prosody is crucial, but this knowledge is frequently absent for low-resourced
languages.
 Lack of Linguistic Studies: Linguistic research serves as the backbone for

building TTS systems. Unfortunately, languages with limited resources often
lack comprehensive linguistic studies, making it arduous to capture essential
linguistic characteristics.
To address these challenges, this work proposes an adaptive TTS approach that
efficiently utilizes limited resources to synthesize high-quality speech for the Muong, a
low-resourced language. The approach leverages transfer learning techniques from related
languages and applies unsupervised learning methods to reduce the need for extensive
labelled data. In addition, emulating the input of rich-resource TTS is also a good idea with
written low-resourced language. With an unwritten low-resourced language, an adaptation
is to use text or an intermediate representation of another language to help build better
TTS.
The proposed approach demonstrates the effectiveness of adaptive TTS in synthesizing
low-resourced languages. However, further research and investment in linguistic studies
for low-resourced languages are necessary to improve the quality of TTS systems. With
continued efforts, we can develop more robust TTS systems that provide access to speech
synthesis for all languages, regardless of their resource availability.
Objectives & approachs
This thesis aims to develop a Text-to-Speech (TTS) system for low-resourced languages,
focusing on the Muong language, by utilizing adaptation techniques. We categorize lowresourced languages into two groups, and for each group, we aim to employ suitable methods to
generate TTS:


 Written low-resourced languages: Using emulating input and an adaptive approach
to enhance the available linguistic resources.
 Unwritten low-resourced languages: Employing intermediate representations or
leveraging text from rich-resourced languages to bridge the gap in linguistic
resources.
In this way, the thesis aims to make TTS technology more accessible to low-resourced
languages, thus expanding its applications and fostering communication across diverse
linguistic communities. By focusing on Muong language as a specific case study, this research

not only contributes to the broader field of low-resourced languages but also opens doors for
practical applications. For instance, it paves the way for the development of applications
catering to the Muong community, including Muong radio broadcasts and Muong-speaking
newspapers, all generated from Vietnamese text. This demonstrates the real-world impact of the
research, showcasing its potential to empower minority languages like Muong and preserve
their cultural heritage.
Contributions
The thesis presents the following key contributions:
 First contribution: A method for the synthesizing speech from the written text for a
language with limited data, using the Muong language as a specific application
case. This includes (1) an adaptation technique that utilizes input from a
Vietnamese speech synthesis system (without requiring training data) and (2) finetuning the Vietnamese speech synthesis model with a small amount of Muong
language data.
 Second contribution: A method for synthesizing speech for an unwritten language
using a closely related language with available resources (generating Muong
speech from Vietnamese text). This approach treats the Muong language as if it
were unwritten. The two proposed methods are: (1) employing an intermediate
representation and (2) directly converting Vietnamese text into Muong speech.
In addition to the two main contributions mentioned above, we also researched the
comparison of Vietnamese and Muong languages, drawing several valuable conclusions for
phonetic studies and natural language processing. We have published various educational
materials and tools for processing text and vocabulary in Vietnamese and Muong.
Dissertation outline
The dissertation is composed of three parts and six chapters, organized as follows:
PART 1: BACKGROUND AND RELATED WORK
 Chapter 1, titled "Overview of speech synthesis and speech synthesis for LowResourced Languages": This chapter concisely reviews the existing literature
to gain a comprehensive understanding of TTS. Research directions for lowresourced TTS are also detailed in this chapter.
 Chapter 2, titled "Vietnamese and Muong Language": This chapter presents
research on the phonology of Vietnamese and Muong languages.
Computational linguistic resources for Vietnamese speech processing are

described in detail as applied in Vietnamese TTS.
PART 2: SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE
 Chapter 3, titled "Emulating Muong TTS Based on Input Transformation of
Vietnamese TTS, " presents the proposal to synthesize Muong speech by
adapting existing Vietnamese TTS systems. This approach can be


experimentally applied to create TTS systems for other Vietnamese ethnic
minority languages quickly.
 Chapter 4, titled "Cross-Lingual Transfer Learning for Muong Speech
Synthesis": In this chapter, we use and experiment with approaches for Muong
TTS that leverage Vietnamese resources. We focus on transfer learning by
creating Vietnamese TTS, further training it with different Muong datasets,
and evaluating the resulting Muong TTS.
PART 3: SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN
LANGUAGE
 Chapter 5, titled "Generating Unwritten Low-Resourced Language's Speech
Directly from Rich-resource Language's Text," presents our approach for
addressing speech synthesis challenges for unwritten low- resourced
languages by synthesizing L2 speech directly from L1 text. The proposed
system is built using end-to-end neural network technology for text-to-speech.
We use Vietnamese as L1 and Muong as L2 in our experiments.
 Chapter 6, titled "Speech synthesis for Unwritten Low-Resourced Languages
Using Intermediate Representation": This chapter proposes using phoneme
representation due to its close relationship with speech within a single
language. The proposed method is applied to the Vietnamese and Muong
language pair. Vietnamese text is translated into an intermediate representation
of two unwritten dialects of the Muong language: Muong Bi - Hoa Binh and
Muong Tan Son - Phu Tho. The evaluation reveals relatively high translation
quality for both dialects.

In conclusion, speech synthesis for low-resourced languages is a significant research
area with the potential to positively impact the lives of speakers of these languages.
Despite challenges posed by limited data and linguistic knowledge, advancements in
speech synthesis technology and innovative approaches enable the developing of highquality speech synthesis systems for low-resourced languages. The work presented in this
dissertation contributes to this field by exploring novel methods and techniques for speech
synthesis in low-resourced languages.
For future work, there is a need to continue developing innovative approaches to speech
synthesis for low-resourced languages, particularly in response to the growing demand for
accessible technology. This can be achieved through ongoing research in transfer learning,
unsupervised learning, and data augmentation. Additionally, there is a need for further
investment in collecting and preserving linguistic data for low-resourced languages and
developing phonological studies for these languages. With these efforts, we can ensure that
speech synthesis technology is accessible to everyone, regardless of their language.



×