Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.19 MB, 178 trang )
<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">
<b>Pham Van Dong </b>
DOCTORAL DISSERTATION IN COMPUTER SCIENCE
Ha Noi – 2023
MINISTRY OF EDUCATION AND TRAINING
<b>HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY </b>
</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2"><b>Pham Van Dong </b>
1. Dr. MAC DANG KHOA 2. Assoc. Prof. TRAN DO DAT
Ha Noi - 2023
MINISTRY OF EDUCATION AND TRAINING
<b>HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY </b>
</div><span class="text_page_counter">Trang 3</span><div class="page_container" data-page="3"><b>I, Pham Van Dong, declare that the dissertation titled “Speech Synthesis for </b>
Low-Resourced Languages based on Adaptation Approach: Application to Muong Language” has been entirely composed by myself. I assure you of some points as follows:
This work was done wholly or mainly while in candidature for a Ph.D. research degree at Hanoi University of Science and Technology.
The work has not been submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institution. Appropriate acknowledgment has been given within this dissertation, where
reference has been made to the published work of others.
The dissertation submitted is my own, except where work in the collaboration has been included. The collaborative contributions have been
1. Dr. Mac Dang Khoa
2. Assoc. Prof. Tran Do Dat
</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">Foremost, I would like to express my most sincere and deepest gratitude to my thesis advisors Dr. Mạc Đăng Khoa (Speech Communication Department, MultiLab at MICA), Prof. TRẦN Đỗ Đạt (The Ministry of Science and Technology, Vietnam) for their continuous support and guidance during my Ph.D. program, and for providing me with such a severe and inspiring research environment. I am grateful to Dr. Mạc Đăng Khoa for his excellent mentorship, caring, patience, and immense Text-To-Speech (TTS) knowledge. His advice helped me in all the research and writing of this thesis. I am very thankful to Prof. Đạt for shaping my thesis at the beginning and for their enthusiasm and encouragement. Prof. Trần Đỗ Đạt substantially facilitated my Ph.D. research, especially when I was a freshman on speech processing and TTS, with his valuable comments on Vietnamese and Muong TTS.
I thank all MICA members for their help during my Ph.D. study. My sincere thanks to Dr. Nguyen Viet Son, Assoc. Prof. Dao Trung Kien and Dr. Do Thi Ngoc Diep for giving me much support and valuable advice. Thanks to Nguyen Van Thinh, Nguyen Tien Thanh, Dang Thanh Mai, and Vu Thi Hai Ha for their help. I want to thank my Hanoi University of Mining and Geology colleagues for all their support during my Ph.D. study. Special thanks to my family for understanding my hours glued to the computer screen.
Hanoi, December 8, 2023 Ph.D. Student
</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">Text-to-speech (TTS) synthesis is the automatic conversion of text into speech. Typically, building high-quality voiceovers requires collecting tens of hours of the voice of a professional speaker with a high-quality microphone. There are about 7,000 languages spoken worldwide, but only a few languages, such as English, Spanish, Mandarin, and Japanese, are used in good TTS. With so-called "low-resourced languages" or even languages that are not yet written, these languages do not have TTS. Thus, to apply TTS technology to low-resourced language, it is necessary to study other TTS methods.
In Vietnam, Vietnamese is the mother tongue and is used the most. The Muong is a group of the language spoken by the Muong people of Vietnam. They are in the Austroasiatic language family and are closely related to Vietnamese, and Muong is also one of the five ethnic groups with the largest population. However, Muong still needs an official script, a typical representative of the low-resourced language in Vietnam. Therefore, researching TTS technologies to create TTS for the Muong language is challenging.
In the first part of this thesis, we do an overview of TTS. Researching the phonetics of Vietnamese and Muong languages, the thesis has also researched and published some tools to support TTS technology for Vietnamese and Muong languages. In the rest of the thesis, we conduct various experiments in creating TTS for low-resourced language; specifically, we experiment with the Muong language. We focus on two main low-resourced language groups: Written: We use emulating to simulate the reading of the Muong language
using Vietnamese TTS and cross-lingual adaptation transfer-learning. Unwritten: We experiment with adaptation in two directions. The first is to
create Muong speech synthesis directly from Vietnamese Text and Muong voice. The second is to create Muong speech synthesis from translation through intermediate representation
We hope our findings can serve as an impetus to develop speech synthesis for low-resourced languages worldwide and contribute to the basis for speech synthesis development for 53 ethnic minority languages in Viet Nam.
Hanoi, December 8, 2023 Ph.D. Student
</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6"><b>PART 1 : BACKGROUND AND RELATED WORKS ... 5</b>
<b>CHAPTER 1. OVERVIEW OF SPEECH SYNTHESIS AND SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGE ... 6</b>
<b>1.1.Overview of speech synthesis ... 6</b>
1.1.1. Overview... 6
1.1.2. TTS architecture ... 8
1.1.3. Evolution of TTS methods over time ... 9
1.1.3.1. TTS using unit-selection method ... 10
1.1.3.2. Statistical parameter speech synthesis ... 11
1.1.3.3. Speech synthesis using deep neural networks ... 13
1.1.3.4. Neural speech synthesis ... 14
<b>1.2.Speech synthesis for low-resourced languages... 19</b>
1.2.1. TTS using emulating input approach ... 20
1.2.2. TTS using the polyglot approach ... 22
1.2.3. Speech synthesis for low-resourced language using the adaptation approach ... 25
<b>1.3.Machine translation ... 27</b>
1.3.1. Neural translation model ... 28
1.3.2. Attention in neural machine translation ... 29
1.3.3. Statistical machine translation based on phrase ... 30
1.3.3.1. Statistical machine translation problem based on phrase ... 30
1.3.3.2. Translation model and language model ... 31
1.3.3.3. Decode the input sentence in the translation system ... 32
1.3.3.4. Model for building a statistical translation system ... 34
1.3.4. Machine translation through intermediate representation ... 34
1.3.5. Speech translation for unwritten low-resourced languages ... 36
<b>1.4.Speech synthesis evaluation metrics ... 38</b>
1.4.1. Mean Opinion Score (MOS) ... 38
1.4.1.1. Definition ... 38
1.4.1.2. Formula ... 38
1.4.1.3. Significance ... 38
</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">1.4.2.1. Concept ... 39
1.4.2.2. Formula ... 39
1.4.2.3. Significance ... 40
1.4.2.4. MCD with Dynamic Time Warping (MCD – DTW) ... 40
1.4.3. Analysis of variance (Anova) ... 40
2.1.2. Vietnamese phonetic system ... 45
2.1.2.1. Vietnamese syllabus structure ... 46
2.1.2.2. Vietnamese phonetic system... 47
2.1.2.3. Vietnamese tone system ... 49
2.2.1.4. Muong written script ... 54
2.2.2. Muong phonetics system ... 55
2.2.2.1. Muong syllable structure ... 55
2.2.2.2. Muong phoneme system ... 55
2.2.2.3. Muong tone system ... 57
<b>2.3.Comparison between Vietnamese and Muong ... 57</b>
<b>2.4.Dicussion and proposal approach ... 60</b>
<b>PART 2 : SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE ... 61</b>
<b>CHAPTER 3. EMULATING OF THE MUONG TTS BASED ON INPUT TRANSFORMATION OF THE VIETNAMESE TTS ... 62</b>
3.2.4. Analysis by ANOVA method ... 72
3.2.4.1. MOS analysis by ANOVA ... 72
3.2.4.2. Intelligibility analysis by ANOVA ... 75
<b>3.3.Conclusion ... 77</b>
</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8"><b>CHAPTER 4. CROSS-LINGUAL TRANSFER LEARNING FOR MUONG </b>
4.2.1.2. Muong Project‘s data ... 84
4.2.1.3. Muong fine-tuning data ... 84
4.2.2. Graphemes to phonemes ... 85
4.2.3. Training the pretrained model using Vietnamese dataset. ... 86
4.2.4. Finetuned TTS model on Muong datasets ... 87
<b>CHAPTER 5. GENERATE UNWRITTEN LOW-RESOURCED LANGUAGE’S SPEECH DIRECTLY FROM RICH-RESOURCE LANGUAGE’S TEXT ... 97</b>
5.2.5. MOS analysis by ANOVA ... 105
5.2.5.1. ANOVA analysis in Muong Bi speech synthesis ... 105
5.2.5.2. ANOVA analysis in Muong Tan Son speech synthesis ... 108
<b>5.3.Conclusion ... 111</b>
<b>CHAPTER 6. SPEECH SYNTHESIS FOR UNWRITTEN LOW-RESOURCED LANGUAGE USING INTERMEDIATE REPRESENTATION ... 112</b>
<b>6.1.Proposal Method ... 112</b>
<b>6.2.Experiment... 114</b>
6.2.1. Database building ... 114
6.2.2. System development ... 114
6.2.2.1. Text to phone translation ... 115
6.2.2.2. Phone to Sound Conversion... 117
<b>6.3.Evaluation ... 119</b>
6.3.1. Evaluation in Muong Bi and Muong Tan Son ... 119
6.3.2. MOS analysis by ANOVA ... 122
6.3.2.1. ANOVA analysis in Muong Bi speech synthesis ... 122
6.3.2.2. ANOVA analysis in Muong Tan Son speech synthesis ... 125
</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9"><b>A.3.Muong Vietnamese phone mapping ... 6</b>
<b>A.4.Information of Muong volunteers who participated in the assessment ... 9</b>
<b>A.5.Speech signal samples of the Muong TTS in chapter 5 ... 12</b>
</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10"><small>CART Classification And Regression Tree </small>
<small>ToolKit </small>
<small>A portable toolkit for building and manipulating hidden Markov models </small>
<small>Alphabet MARY </small>
<small>(TTS) </small>
<small>Modular Architecture for Research on speech sYnthesis </small>
<small>OCR Optical Character Recognition </small>
<small>A collection of machine learning algorithms for data mining tasks: </small>
<small>Methods Phonetic Alphabet </small>
</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11"><small>G2P Grapheme to Phoneme </small>
<small>Argmax Arguments of the maxima Argmax is an operation that finds the argument that gives the maximum value </small>
<small>from a target function </small>
</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12"><i>Table 2.1 Vietnamese syllabus structure [94] ... 46</i>
<i>Table 2.2 Vietnamese syllabus structure [96] ... 46</i>
<i>Table 2.3 Vietnamese syllables based on structure ... 47</i>
<i>Table 2.4 Hanoi Vietnamese inital consonants ... 48</i>
<i>Table 2.5 The letter of initial consonant ... 48</i>
<i>Table 2.6 Hanoi Vietnamese final consonant ... 49</i>
<i>Table 2.7 Tone of Hanoi Vietnamese [108] ... 49</i>
<i>Table 2.8 Muong syllabic structure ... 55</i>
<i>Table 2.9 Muong final sound system ... 56</i>
<i>Table 2.10 Muong Hoa Binh tone system [115] ... 57</i>
<i>Table 2.11 Muong Bi and Muong Tan Son Tone ... 57</i>
<i>Table 2.12 Muong and Vietnamese phonetic comparison (orthography in normal, IPA in italic; Vi: Vietnamese; Mb: Muong Bi ; Mts : Muong Tan Son) ... 59</i>
<i>Table 2.13 Comparing the tone of Vietnamese with Muong Tan Son and Muong Bi ... 60</i>
<i>Table 3.1 Muong G2P Result Sample ... 64</i>
<i>Table 3.2 Examples of applying transformation rules to convert the Muong text into input text for Vietnamese TTS ... 65</i>
<i>Table 3.3. Testing material for emulating tone ... 66</i>
<i>Table 3.4. Testing material for emulating phone (the concerning phonemes in bold) ... 67</i>
<i>Table 3.5. Testing material for remaining phonemes ... 67</i>
<i>Table 3.6 ANOVA Results for MOS Test ... 73</i>
<i>Table 3.7 ANOVA Results for Intelligibility Test ... 75</i>
<i>Table 4.1 Parameters of acoustic model ... 80</i>
<i>Table 4.2 Vietnamese dataset information ... 83</i>
<i>Table 4.3 Muong recorded data ... 85</i>
<i>Table 4.4 The Muong split data set ... 85</i>
<i>Table 4.5 Parameter for optimizer ... 86</i>
<i>Table 4.6 Value of parameters when training Hifigan model ... 86</i>
<i>Table 4.7 The specifications of the in-domain and out-domain test sets ... 89</i>
<i>Table 4.8 Test set samples ... 89</i>
<i>Table 4.9 Evaluation results ... 90</i>
<i>Table 4.10 ANOVA Results for in-domain MOS Test ... 92</i>
<i>Table 4.11 ANOVA Results for out-domain MOS Test ... 93</i>
<i>Table 4.12 ANOVA Results for in/out domain MOS Test ... 94</i>
<i>Table 5.1 Evaluation Score ... 102</i>
<i>Table 5.2 TTS evaluation with in-domain test set ... 103</i>
<i>Table 5.3 TTS evaluation with out-domain test set ... 104</i>
<i>Table 5.4 ANOVA Results for in-domain MOS Test for Muong Bi ... 106</i>
<i>Table 5.5 ANOVA Results for out-domain MOS Test for Muong Bi ... 107</i>
<i>Table 5.6 ANOVA Results for Muong Bi in/out domain MOS Test ... 107</i>
<i>Table 5.7 ANOVA Results for in-domain MOS Test for Muong Tan Son ... 109</i>
<i>Table 5.8 ANOVA Results for out-domain MOS Test for Muong Tan Son ... 110</i>
<i>Table 5.9 ANOVA Results for Muong Tan Son in/out domain MOS Test ... 110</i>
<i>Table 6.1 Examples of labeling Vietnamese text into an intermediate representation of Muong Bi and Muong Tan Son phonemes. ... 117</i>
<i>Table 6.2 Text information of Muong language datasets ... 118</i>
</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13"><i>Table 6.5 ANOVA Results for in-domain MOS Test for Muong Bi ... 123</i>
<i>Table 6.6 ANOVA Results for out-domain MOS Test for Muong Bi ... 124</i>
<i>Table 6.7 ANOVA Results for Muong Bi in/out domain MOS Test ... 124</i>
<i>Table 6.8 ANOVA Results for in-domain MOS Test for Muong Tan Son ... 126</i>
<i>Table 6.9 ANOVA Results for out-domain MOS Test for Muong Tan Son ... 127</i>
<i>Table 6.10 ANOVA Results for Muong Tan Son in/out domain MOS Test ... 127</i>
<i>Table A.1 Vietnamese vowels ... 1</i>
<i>Table A.2 The Muong initial consonant ... 1</i>
<i>Table A.3 Muong vowels system ... 2</i>
<i>Table A.4 The correspondences between Vietnamese and Muong in 12 words refer to the human body parts [137] ... 4</i>
<i>Table A.7 Muong G2P ... 4</i>
<i>Table A.8 Muong Vietnamese phone mapping ... 7</i>
<i>Table A.9 Muong Hoa Binh volunteers ... 9</i>
<i>Table A.10 Muong Phu Tho volunteers ... 10</i>
</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14"><i>Figure 1.1. Basic system architecture of a TTS system [22] ... 8</i>
<i>Figure 1.2 Neural TTS architecture [3] ... 9</i>
<i>Figure 1.3. General and clustering-based unit-selection scheme: Solid lines represent target costs and dashed lines represent concatenation costs [13] ... 10</i>
<i>Figure 1.4. Core architecture of HMM-based speech synthesis system [25] ... 11</i>
<i>Figure 1.5. General HMM-based synthesis scheme [13, p. 5] ... 12</i>
<i>Figure 1.6. A speech synthesis framework based on a DNN [29] ... 13</i>
<i>Figure 1.7 Encoder and Decoder diagram in Seq2Seq model ... 14</i>
<i>Figure 1.8 Char2Wav model [23] ... 17</i>
<i>Figure 1.9 Model of the Tacotron synthesis system [24] ... 18</i>
<i>Figure 1.10 Block diagram of the Tacotron 2 system architecture [25] ... 19</i>
<i>Figure 1.11 Scheme of a HMM-based polyglot synthesizer [48] ... 23</i>
<i>Figure 1.12 Approaches to transfer TTS model from source language to target language [32] ... 26</i>
<i>Figure 1.13 Examples of sequence to sequence transformation [55] ... 28</i>
<i>Figure 1.14 Describe the location of the Attention model in neural machine </i>
<i>Figure 1.17 Deploying a statistical translation system [67] ... 34</i>
<i>Figure 1.18 An ordinary voice translation system [11]. ... 35</i>
<i>Figure 1.19 Model of the speech-to-speech machine translation system using intermediate representation for unwritten language ... 36</i>
<i>Figure 1.20 Voice-to-text translation system [83] ... 37</i>
<i>Figure 2.1. Mon-Khmer branch of the Austroasiatic family [109, pp. 175–176] .... 51</i>
<i>Figure 2.2 Viet-Muong Group [110] ... 52</i>
<i>Figure 2.3 The distribution of the Muong dialects [114, p. 299] ... 53</i>
<i>Figure 3.1 Emulating TTS for Muong ... 63</i>
<i>Figure 3.2 Muong G2P Module ... 64</i>
<i>Figure 3.3 Intelligibility Results for Muong emulating tones ... 69</i>
<i>Figure 3.4 Intelligibility Test Result for emulating close phonemes ... 70</i>
<i>Figure 3.5 Intelligibility Test Result for Equivalent phonemes ... 71</i>
<i>Figure 3.6 MOS Emulating Test Result ... 72</i>
<i>Figure 4.1 Low-resourced L2 TTS transfer learning from rich resource L1 ... 79</i>
<i>Figure 4.2 Block diagram of the speech synthesis system architecture ... 80</i>
<i>Figure 4.3 Duration histogram ... 83</i>
<i>Figure 4.4 Duration distribution across the M_15m, M_30m, and M_60m datasets. ... 85</i>
<i>Figure 4.5 Training loss and validation loss of pretrained TTS model ... 87</i>
<i>Figure 4.6 Training loss and validation error of Hifigan model ... 87</i>
<i>Figure 4.7 Training loss and validation loss of M_15m ... 88</i>
<i>Figure 4.8 Training loss and validation loss of M_30m and M_60m ... 88</i>
<i>Figure 5.1 System architecture ... 99</i>
<i>Figure 5.2 WaveGlow model architecture [136] ... 99</i>
<i>Figure 5.3 Muong Phu Tho training loss and validation loss after training acoustic model ... 100</i>
</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15"><i>Figure 5.5 Testing interface ... 102</i>
<i>Figure 6.1 Training phase TTS L1 text to L2 speech system uses intermediate representation of phoneme level ... 113</i>
<i>Figure 6.2 Decoding phase TTS L1 text to L2 speech system uses intermediate representation of phoneme level ... 113</i>
<i>Figure 6.3 The result after manual annotation. ... 114</i>
<i>Figure 6.4 Phone to sound module, as a speech synthesis from phone sequence .. 117</i>
<i>Figure 6.5 Muong Hoa Binh Training loss and validation loss after training acoustic model ... 118</i>
<i>Figure 6.6 Muong Phu Tho Training loss and validation loss after training acoustic model ... 118</i>
<i>Figure 6.7 Testing interface ... 119</i>
<i>Figure 6.8 Comparing the synthesized speech results on Muong Hoa Binh using three </i>
<i>Figure 6.12 Sumary of direction for low-resourced language speech synthesis .... 133</i>
<i>Figure A.1 Raw Muong Hoa Binh: ban vận động thành lập hội trí thức tỉnh ra mắt ... 13</i>
<i>Figure A.2 Muong Hoa Binh synthesis: ban vận động thành lập hội trí thức tỉnh ra mắt ... 13</i>
<i>Figure A.3 Muong Phu Tho raw: ban vận động thành lập hội trí thức tỉnh ra mắt . 14Figure A.4 Muong Phu Tho synthesis: ban vận động thành lập hội trí thức tỉnh ra mắt ... 14</i>
<i>Figure A.5 Muong Hoa Binh raw - Bố cháu ở nhà hay đi đâu ... 15</i>
<i>Figure A.6 Muong Hoa Binh synthesis: Bố cháu ở nhà hay đi đâu ... 15</i>
<i>Figure A.7 Muong Phu Tho raw: Bố cháu ở nhà hay đi đâu ... 16</i>
<i>Figure A.8 Muong Phu Tho synthesis: Bố cháu ở nhà hay đi đâu ... 16</i>
</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">Today's speech-processing technology is essential in many aspects of human-machine interaction. Many recent voice interaction systems have been introduced, allowing users to communicate with devices on various platforms, such as smartphones (Apple Siri, Google Cloud, Amazon Alexa, etc.), intelligent cars (BMW, Ford, etc.), and smart homes. In these systems, one of the essential components is speech synthesis or Text-to-Speech (TTS), which can convert input text into speech. Developing a TTS system for a language is not only the implementation of speech processing techniques but also requires linguistic studies such as phonetics, phonology, syntax, and grammar.
According to statistics in the 25th edition of Ethnologue<small>1</small> (regarded as the most comprehensive source of information on linguistic statistics), there are 7,151 living languages in the world, belonging to 141 language families, of which 2,982 languages are not written. Some languages have not been described in academic literature, such as dialects of ethnic minorities. Machine learning methods based on big data do not immediately apply to low-resourced languages, especially unwritten ones. The low-low-resourced/unwritten language processing field has started to pay attention in the past few years and has yet to have many results. However, the research results of this field are essential because, in addition to bringing voice communication technologies to ethnic minority communities, products applying this technology are also essential. It also contributes to the conservation of endangered languages.
Regarding the Vietnamese language and speech processing field, domestic research units have given it comprehensive attention and addressed various aspects, ranging from natural language processing problems such as text processing, syntactic component separation, and semantics to speech processing problems such as synthesis and recognition. However, the problem of language and speech processing in general, including TTS) systems for minority languages without a writing system in Vietnam, has not received much attention due to the scarcity of data sources such as bilingual text data and speech data, as well as a lack of related linguistic studies.
The Muong language presents unique linguistic characteristics that make it challenging to develop a TTS system, such as tonality and complex phonetic structures. Therefore, this thesis aims to fill this gap by focusing on developing a TTS system for the Muong language, a minority language spoken in Vietnam that does not have a writing system (only the Muong Hoa Binh dialect had a writing system in 2016). This research area is novel not only in Vietnam but also worldwide, and the development of a Muong TTS system can contribute to preserving and promoting this endangered language.
<b>Context and constraints </b>
This thesis will classify low-resourced languages into two categories: written and unwritten. The Muong language will be the object of study in both cases:
Written: The Muong dialect of Hoa Binh will be examined, as it possesses a written form.
Unwritten: The Muong dialect of Phu Tho will be investigated, as it lacks a written form.
In other regions, the Muong people currently do not use written language. They often read directly from Vietnamese text and convert it into Muong speech for broadcasting and
</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">communication purposes. This research aims to address these challenges and improve the accessibility of TTS technology for both written and unwritten Muong dialects.
Moreover, this thesis is conducted within the scope of, and in collaboration with, the project DLCN.20/17: "Research and development automatic translation system from Vietnamese text to Muong speech, apply to unwritten minority languages in Vietnam"
<i>(Nghiên cứu xây dựng hệ dịch tự động văn bản tiếng Việt ra tiếng nói tiếng Mường, hướng đến áp dụng cho các ngơn ngữ dân tộc thiểu số chưa có chữ viết ở ViệtNam). Specific </i>
components of this project include:
Recorded speech from both Muong Hoa Binh and Muong Phu Tho dialects. A machine translation tool that converts Vietnamese text to an intermediate
representation of the Muong language.
Conversely, the research findings of this thesis have been successfully applied and integrated into the project above, demonstrating the practical value of the work undertaken in this thesis.
<b>Challenges </b>
Challenges Faced by Current Research:
Data Scarcity: The foremost challenge is the paucity of training data. TTS models demand substantial text-speech pairs for effective training. However, for low-resourced languages, acquiring such data can be exceedingly difficult, if not impossible.
Limited Linguistic Knowledge: Inadequate linguistic knowledge hinders
vocabulary, and prosody is crucial, but this knowledge is frequently absent for low-resourced languages.
Lack of Linguistic Studies: Linguistic research serves as the backbone for building TTS systems. Unfortunately, languages with limited resources often lack comprehensive linguistic studies, making it arduous to capture essential linguistic characteristics.
To address these challenges, this work proposes an adaptive TTS approach that efficiently utilizes limited resources to synthesize high-quality speech for the Muong, a low-resourced language. The approach leverages transfer learning techniques from related languages and applies unsupervised learning methods to reduce the need for extensive labelled data. In addition, emulating the input of rich-resource TTS is also a good idea with written low-resourced language. With an unwritten low-resourced language, an adaptation is to use text or an intermediate representation of another language to help build better TTS.
The proposed approach demonstrates the effectiveness of adaptive TTS in synthesizing low-resourced languages. However, further research and investment in linguistic studies for low-resourced languages are necessary to improve the quality of TTS systems. With continued efforts, we can develop more robust TTS systems that provide access to speech synthesis for all languages, regardless of their resource availability.
<b>Objectives & approachs </b>
This thesis aims to develop a Text-to-Speech (TTS) system for low-resourced languages, focusing on the Muong language, by utilizing adaptation techniques. We categorize
</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">low- Written low-resourced languages: Using emulating input and an adaptive approach to enhance the available linguistic resources.
representations or leveraging text from rich-resourced languages to bridge the gap in linguistic resources.
In this way, the thesis aims to make TTS technology more accessible to low-resourced languages, thus expanding its applications and fostering communication across diverse linguistic communities. By focusing on Muong language as a specific case study, this research not only contributes to the broader field of low-resourced languages but also opens doors for practical applications. For instance, it paves the way for the development of applications catering to the Muong community, including Muong radio broadcasts and Muong-speaking newspapers, all generated from Vietnamese text. This demonstrates the real-world impact of the research, showcasing its potential to empower minority languages like Muong and preserve their cultural heritage.
<b>Contributions </b>
The thesis presents the following key contributions:
First contribution: A method for the synthesizing speech from the written text for a language with limited data, using the Muong language as a specific application case. This includes (1) an adaptation technique that utilizes input from a Vietnamese speech synthesis system (without requiring training data) and (2) fine-tuning the Vietnamese speech synthesis model with a small amount of Muong language data.
Second contribution: A method for synthesizing speech for an unwritten language using a closely related language with available resources (generating Muong speech from Vietnamese text). This approach treats the Muong language as if it were unwritten. The two proposed methods are: (1) employing an intermediate representation and (2) directly converting Vietnamese text into Muong speech.
In addition to the two main contributions mentioned above, we also researched the comparison of Vietnamese and Muong languages, drawing several valuable conclusions for phonetic studies and natural language processing. We have published various educational materials and tools for processing text and vocabulary in Vietnamese and Muong.
<b>Dissertation outline </b>
The dissertation is composed of three parts and six chapters, organized as follows: PART 1: BACKGROUND AND RELATED WORK
Chapter 1, titled "Overview of speech synthesis and speech synthesis for low-resourced language,": This chapter concisely reviews the existing literature to gain a comprehensive understanding of TTS. Research directions for low-resourced TTS are also detailed in this chapter. Chapter 2, titled "Vietnamese and Muong language": This chapter presents
research on the phonology of Vietnamese and Muong languages. Computational linguistic resources for Vietnamese speech processing are described in detail as applied in Vietnamese TTS.
PART 2: SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE Chapter 3, titled "Emulating of the Muong TTS based on input
</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">systems. This approach can be experimentally applied to create TTS systems for other Vietnamese ethnic minority languages quickly. Chapter 4, titled "Cross-lingual transfer learning for Muong speech
synthesis,": In this chapter, we use and experiment with approaches for Muong TTS that leverage Vietnamese resources. We focus on transfer learning by creating Vietnamese TTS, further training it with different Muong datasets, and evaluating the resulting Muong TTS.
PART 3: SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN LANGUAGE
Chapter 5, titled "Generate unwritten low-resourced language’s speech directly from rich-resource language’s text," presents our approach for addressing speech synthesis challenges for unwritten low-resourced languages by synthesizing L2 speech directly from L1 text. The proposed system is built using end-to-end neural network technology for text-to-speech. We use Vietnamese as L1 and Muong as L2 in our experiments. Chapter 6, titled "Speech synthesis for Unwritten low-resourced language
using intermediate representation,": This chapter proposes using phoneme representation due to its close relationship with speech within a single language. The proposed method is applied to the Vietnamese and Muong language pair. Vietnamese text is translated into an intermediate representation of two unwritten dialects of the Muong language: Muong Bi - Hoa Binh and Muong Tan Son - Phu Tho. The evaluation reveals relatively high translation quality for both dialects.
In conclusion, speech synthesis for low-resourced languages is a significant research area with the potential to positively impact the lives of speakers of these languages. Despite challenges posed by limited data and linguistic knowledge, advancements in speech synthesis technology and innovative approaches enable the developing of high-quality speech synthesis systems for low-resourced languages. The work presented in this dissertation contributes to this field by exploring novel methods and techniques for speech synthesis in low-resourced languages.
For future work, there is a need to continue developing innovative approaches to speech synthesis for low-resourced languages, particularly in response to the growing demand for accessible technology. This can be achieved through ongoing research in transfer learning, unsupervised learning, and data augmentation. Additionally, there is a need for further investment in collecting and preserving linguistic data for low-resourced languages and developing phonological studies for these languages. With these efforts, we can ensure that speech synthesis technology is accessible to everyone, regardless of their language.
</div><span class="text_page_counter">Trang 21</span><div class="page_container" data-page="21">This section presents a concise overview of Text-to-Speech (TTS) synthesis and its application to low-resourced languages. It highlights the challenges faced in developing TTS systems for languages with limited resources and data. Additionally, it introduces various approaches and techniques to address these challenges and improve TTS quality for low-resourced languages.
<b>1.1. Overview of speech synthesis </b>
This section offers a brief introduction to the field of speech synthesis. It highlights the key concepts and techniques in converting written text into spoken language. It also provides a foundation for understanding the complexities and challenges of developing speech synthesis systems.
<b>1.1.1. Overview </b>
Speech synthesis is the artificial generation of human speech using technology. A computer system designed for this purpose, known as a speech computer or speech synthesizer, can be realized through software or hardware implementations. A text-to-speech (TTS) system specifically converts standard written language text into audible speech, whereas other systems transform symbolic linguistic representations, such as phonetic transcriptions, into speech [1]. TTS technology has evolved significantly, incorporating advanced algorithms and machine learning techniques to produce more natural-sounding and intelligible speech output. By simulating various aspects of human speech, including pitch, tone, and intonation, TTS systems strive to provide a seamless and user-friendly listening experience.
The development of TTS technology has undergone remarkable progress over time: In the 1950s, pioneers like Homer Dudley with his "VODER" and
Franklin S. Cooper's "Pattern Playback" initiated the foundation for modern TTS systems.
The 1960s brought forth formant-based synthesis, utilizing models of vocal tract resonances to produce speech sounds.
The 1970s introduced linear predictive coding (LPC), enhancing speech signal modeling and producing more natural synthesized speech.
The 1980s saw the emergence of concatenative synthesis, a method that combined pre-recorded speech segments for the final output.
During the 1990s, unit selection synthesis became popular, using extensive databases to select the best-fitting speech units for more natural output.
</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23"> The 2000s experienced the rise of statistical parametric synthesis techniques, such as Hidden Markov Models (HMMs), providing a data-driven and adaptable approach to TTS.
The 2010s marked the beginning of deep learning-based TTS with models like Google's WaveNet, revolutionizing speech synthesis by generating raw audio waveforms instead of relying on traditional signal processing. End-to-end neural TTS systems like Tacotron streamlined the TTS
process by directly converting text to speech without intermediate stages. Transfer learning and multilingual TTS models have recently enabled the development of high-quality TTS systems for low-resourced languages, expanding the reach of TTS technology.
Today, TTS plays a vital role in everyday life, powering virtual assistants, accessibility tools, and various digital content types.
Some current applications of text-to-speech (TTS) technology includes:
Assistive technology for the visually impaired: TTS systems help blind and visually impaired individuals by reading text from books, websites, and other sources, converting it into audible speech.
Learning tools: TTS systems are used in computer-aided learning programs, aiding language learners and students with reading difficulties or dyslexia by providing auditory reinforcement.
Voice output communication aids: TTS technology assists individuals with severe speech impairments by enabling them to communicate through synthesized speech.
Public transportation announcements: TTS provides automated announcements for passengers on buses, trains, and other public transportation systems.
E-books and audiobooks: TTS systems can read electronic books and generate audiobooks, making content accessible to a broader audience. Entertainment: TTS technology is utilized in video games, animations,
and other forms of multimedia entertainment to create realistic and engaging voiceovers.
Email and messaging: TTS systems can read emails, text messages, and other written content aloud, helping users stay connected and informed. Call center automation: TTS is employed in automated phone systems,
allowing users to interact with voice-activated menus and complete transactions through spoken commands.
Virtual assistants: TTS is a crucial component of popular voice-activated virtual assistants like Apple's Siri, Google Assistant, and Amazon's Alexa, enabling them to provide spoken responses to user queries.
</div><span class="text_page_counter">Trang 24</span><div class="page_container" data-page="24"> Voice search applications: By integrating TTS with speech recognition, users can use speech as a natural input method for searching and retrieving information through voice search apps.
In conclusion, TTS technology has come a long way since its inception, with continuous advancements in algorithms, machine learning, and deep learning techniques. As a result, TTS systems now provide more natural-sounding and intelligible speech, enhancing the user experience across various applications such as assistive technology, learning tools, entertainment, virtual assistants, and voice search. The ongoing development and integration of TTS into our daily lives will continue to shape the future of human-computer interaction and digital accessibility.
<b>1.1.2. TTS architecture </b>
The architecture of a TTS system is generally composed of several components, as depicted in Figure 1.1. The Text Processing component is responsible for preparing the input text for speech synthesis. The G2P Conversion component converts the written words into
their corresponding phonetic representations. The Prosody Modeling component adds appropriate intonation, duration, and other prosodic features to the phonetic sequence. Lastly, the Speech Synthesis component generates the speech waveform based on the parameters derived from the fully tagged phonetic sequence [2].
Text processing is crucial for identifying and interpreting all textual or linguistic information that falls outside the realms of phonetics and prosody. Its primary function is to transform non-orthographic elements into words that can be spoken aloud. Through text normalization, symbols, numbers, dates, abbreviations, and other non-orthographic text elements are converted into a standard orthographic transcription, facilitating subsequent phonetic conversion. Additionally, analyzing whitespace, punctuation, and other delimiters is vital for determining document structure and providing context for all subsequent steps. Certain text structure elements may also directly impact prosody. Advanced syntactic and semantic analysis can be achieved through effective text-processing techniques [2, p. 682].
The phonetic analysis aims to transform orthographic symbols of words into phonetic representations, complete with any diacritic information or lexical tones present in tonal languages. Although future TTS systems might rely on word-sounding units and possess increased storage capacity, homograph disambiguation and grapheme-to-phoneme (G2P) conversion for new words remain essential for accurate pronunciation of every word. G2P
<small>Figure 1.1. Basic system architecture of a TTS system [22] </small>
</div><span class="text_page_counter">Trang 25</span><div class="page_container" data-page="25">conversion is relatively straightforward in languages with a clear relationship between written and spoken forms. A small set of rules can effectively describe this direct correlation, which is characteristic of phonetic languages such as Spanish and Finnish. Conversely, English is not a phonetic language due to its diverse origins, resulting in less predictable letter-to-sound relationships. In these cases, employing general letter-to-sound rules and dictionary lookups can facilitate the conversion of letters to sounds, enabling the correct pronunciation of any word [2, p. 683].
In TTS systems, prosodic analysis involves examining prosodic features within the text input, such as stress, duration, pitch, and intensity. This information is then utilized to generate more natural and expressive speech. Prosodic analysis helps determine the appropriate stress, intonation, and rhythm for the synthesized speech, resulting in a more human-like output. Predicting prosodic features can be achieved through rule-based or machine-learning methods, including acoustic modeling and statistical parametric speech synthesis. By adjusting the synthesized speech, TTS systems can convey various emotions or speaking styles, enhancing their versatility and effectiveness across diverse applications.
Speech synthesis employs anticipated information from the fully tagged phonetic sequence to generate the corresponding speech waveform. Broadly, two traditional speech synthesis techniques are concatenative and source/filter synthesizers. Concatenative synthesizers assemble pre-recorded human speech components to produce the desired utterance. In contrast, source/filter synthesizers create synthetic voices using a source/filter model based on the parametric description of speech. The first method necessitates assistance in generating high-quality speech using the input text's parametric representation and speech parameters. Meanwhile, the second approach requires a combination of algorithms and signal processing adjustments to ensure smooth and continuous speech, particularly at junctures.
Several improvements have been proposed for high-quality text-to-speech (TTS) systems, drawing from the two fundamental speech synthesis techniques. Among the most prominent state-of-the-art methods are statistical parametric speech synthesis and unit selection techniques, which have been the subject of extensive debate among researchers in the field.
<small>Figure 1.2 Neural TTS architecture [3] </small>
With the advancement of deep learning, neural network-based TTS (neural TTS) systems have been proposed, utilizing (deep) neural networks as the core model for speech synthesis. A neural TTS system comprises three fundamental components: a text analysis module, an acoustic model, and a vocoder. As illustrated in Figure 1.2 the text analysis module transforms a text sequence into linguistic features. The acoustic model then generates acoustic features from these linguistic features, and finally, the vocoders synthesize the waveform from the acoustic features.
<b>1.1.3. Evolution of TTS methods over time </b>
The evolution of TTS methods has progressed significantly over time, with advancements in technology and research contributing to more natural and intelligible
</div><span class="text_page_counter">Trang 26</span><div class="page_container" data-page="26">speech synthesis. Early TTS systems relied on rule-based methods and simple concatenation techniques, which have since evolved into sophisticated machine learning approaches, including neural network-based TTS systems. These modern systems offer improved speech quality, prosody, and adaptability, resulting in more versatile applications across various industries.
<i>1.1.3.1. TTS using unit-selection method </i>
The unit-selection approach allows for the creation of new genuinely sounding utterances by picking relevant sub-word units from a natural speech database [4], based on how well a chosen unit matches a specification/a target unit (and how well two chosen units join together). During synthesis, an algorithm chooses one unit from the available options to discover the best overall sequence of units that meets the specification [1]. The specification and the units are described by a feature set that includes linguistic and speech elements. The feature set is used to do a Viterbi-style search to determine the sequence of units with the lowest total cost.
Although they are theoretically quite similar, the review of Zen [4] suggests that there are two fundamental methods in unit-selection synthesis: (i) the selection model [5], shown in Figure 1.3a; (ii) the clustering approach [6], shown in Figure 1.3b, which effectively enables the target cost to be pre-calculated. The second method asks questions about features available at the time of synthesis and groups units of the same type into a decision tree.
In the selection model for TTS synthesis, speech units are chosen based on a cost function calculated in real time during the synthesis process. This cost function considers the acoustic and linguistic similarity between the target text and available speech units in the database, selecting the unit with the lowest cost for synthesis. Conversely, the clustering approach pre-calculates the cost for each speech unit, grouping similar units into a decision tree. This tree allows for rapid speech unit selection during synthesis based on available features, reducing the real-time computation and resulting in faster, more efficient TTS synthesis. Both methods have their advantages and disadvantages, with the selection model offering greater flexibility for adapting to different languages and voices and the clustering approach
<small>Figure 1.3. General and clustering-based unit-selection scheme: Solid lines represent target costs and dashed lines represent concatenation costs [13] </small>
</div><span class="text_page_counter">Trang 27</span><div class="page_container" data-page="27">providing enhanced speed and efficiency. The choice between these methods depends on the specific needs of the TTS system being developed.
<i>1.1.3.2. Statistical parameter speech synthesis </i>
In a typical statistical parametric speech synthesis system, a set of generative models is used to model the parametric speech representations extracted from a speech database, including spectral and excitation parameters (also known as vocoder parameters are used as inputs of the vocoder). The model parameters are frequently estimated using the Maximum Likelihood (ML) criterion. Then, to maximize their output probabilities, speech parameters are constructed for a specific word sequence to be synthesized from the estimated models. Finally, a speech waveform is built from the parametric representations of speech [4].
Any generative model can be employed; however, HMMs are mainly well-known. In HMM-based speech synthesis (HTS) [7], context-dependent HMMs statistically model and produce the speech parameters of a speech unit, such as the spectrum and excitation parameters (for example, fundamental frequency - F0). A typical HMM-based speech
synthesis system's core architecture, as shown in Figure 1.4 [8], consists of two main processes: training and synthesis.
The Expectation Maximization (EM) algorithm is used to do the ML estimation (MLE) during training, and it is similar to speech recognition. The primary distinction is that excitation and spectrum parameters are taken from a database of natural speech that a collection of multi-stream context-dependent HMMs has modeled. Excitation parameters include log F0 and its dynamic properties.
Another distinction is adding prosodic and linguistic circumstances to phonetic settings (called contextual features). The state-duration distribution for each HMM is also used to describe the temporal structure of speech. The Gamma distribution and the Gaussian
<small>Figure 1.4. Core architecture of HMM-based speech synthesis system [25] </small>
</div><span class="text_page_counter">Trang 28</span><div class="page_container" data-page="28">distribution are options for state-duration distributions. In order to estimate them, the forward-backward method used statistical data that was gathered during the previous
iteration.
An inverse speech recognition procedure is carried out throughout the synthesis process. The utterance HMM is built by concatenating the context-dependent HMMs by the label sequence after a given word sequence is transformed into a context-dependent label sequence. Second, the speech parameter generation algorithm creates spectral and excitation parameter sequences from the utterance HMM. The obtained spectral and excitation parameters are then used to create a speech waveform using a speech synthesis filter and a vocoder with a source-excitation/filter model [4].
<small>Figure 1.5. General HMM-based synthesis scheme [13, p. 5] </small>
</div><span class="text_page_counter">Trang 29</span><div class="page_container" data-page="29">Figure 1.5 illustrates the general scheme of based synthesis [4, p. 5]. An HMM-based TTS system defines a feature system and trains a different model for each feature combination. Because each has its context dependency, the spectrum, excitation, and duration are all modeled simultaneously in an integrated framework of HMMs. Due to the combinatorial explosion of contextual information, their parameter distributions are grouped independently and contextually using phonetic decision trees. The models corresponding to the entire context label sequence, which was predicted from the text, are concatenated to create the speech parameters. The duration model is used to select a state sequence prior to producing parameters. This establishes the number of frames that will be produced from each model state. Actual speech, where the fluctuations in speech characteristics are smoother, would not fit this well.
<i>1.1.3.3. Speech synthesis using deep neural networks </i>
A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers between the input and output layers [9], [10]. DNNs can model complex non-linear relationships. DNN architectures generate compositional models where the object is expressed as a layered composition of primitives [11]. The extra layers enable the composition of features from lower layers, potentially modeling complex data with fewer units than a similarly performing external network [9].
Deep architectures contain numerous variations on a few fundamental ideas. Every architecture has achieved success in particular fields. Unless they have been tested on the same data sets, comparing the performance of different architectures is seldom viable. DNNs are typically feedforward networks in which information moves straight from the input layer to the output layer.
Figure 1.6 illustrates a speech synthesis framework based on a DNN. A given text to be synthesized is first converted to a sequence of input features {x<small>n</small><sup>t</sup>}, where x<small>n</small><sup>t</sup> denotes the
<small>n-Figure 1.6. A speech synthesis framework based on a DNN [29] </small>
</div><span class="text_page_counter">Trang 30</span><div class="page_container" data-page="30">th input feature at frame t. The input features include binary answers to questions about linguistic contexts and numeric values [12].
<i>1.1.3.4. Neural speech synthesis </i>
As deep learning evolves, neural network-based TTS (neural TTS for short) is proposed, which uses (deep) neural networks as the speech synthesis model's backbone, as shown in Figure 1.2. SPSS has incorporated early neural models to replace HMM for audio modeling. Later, WaveNet [13] is proposed to produce waveform directly from language information, making it the first contemporary neural TTS model. Other models, such as DeepVoice 2 [14], adhere to the three components of statistical parametric synthesis but enhance them with neural network-based models. Moreover, several end-to-end models (e.g., Tacotron 2 [15, p. 2], Deep Voice 3 [16], and FastSpeech 2 [17]) are proposed to simplify text analysis modules and directly accept character/phoneme sequences as input, as well as to simplify acoustic characteristics with Mel-spectrograms. Later, end-to-end TTS systems such as
ClariNet [18], FastSpeech 2 [17], and EATS [19] are created to generate waveform directly from the text. The advantages of neural network-based speech synthesis over prior TTS systems based on concatenative synthesis and statistical parametric synthesis include great voice quality in terms of both intelligibility and naturalness and reduced need for human preprocessing and feature development.
The End-to-End [15] method proposed by Google in 2017 is based on the Seq2Seq model widely used in machine translation. Seq2Seq includes two components: An encoder and a Decoder. Both components are neural networks. The encoder converts input data (input sequence) into a language representation. At the same time, the decoder is responsible for generating output sound from language characteristics created in the Encoder section. This is the best speech synthesis method today, typically the Tacotron system, which produces voices closest to natural human voices.
The End-to-End method has the advantage of having less module processing, so the discrepancy between the predicted results and the input is small, resulting in the voice quality being closest to natural. However, the downside of this method is that the amount of data needed to train the model is enormous, along with the training time that takes tens of hours or even weeks and requires tremendous computer performance. Therefore, the cost of building these systems is enormous [19].
<small>Figure 1.7 Encoder and Decoder diagram in Seq2Seq model </small>
</div><span class="text_page_counter">Trang 31</span><div class="page_container" data-page="31">Fully end-to-end TTS models can generate speech waveform straight from a sequence of characters or phonemes, which has the following benefits: It can also cut training, development, and deployment costs [3].
However, training TTS models end-to-end is challenging, primarily due to the differences in modalities between text and speech waveforms and the significant length disparity between character/phoneme sequences and waveform sequences. For a 5-second speech with approximately 20 syllables, the length of the phoneme sequence is around 100, while the length of the waveform sequence is 80,000 (assuming a 16kHz sample rate). Memory constraints make it difficult to include the waveform points of an entire utterance during model training. Additionally, capturing context representations is problematic when using only a short audio clip for end-to-end training.
In end-to-end models [3], only text normalization and grapheme-to-phoneme conversion are preserved to transform characters into phonemes, or the entire text analysis module can be omitted by directly taking characters as input. Acoustic features are simplified, where complex characteristics such as MGC, BAP, and F0 employed in SPSS are consolidated into Mel-spectrograms. Additionally, two or three modules can be replaced with a single end-to-end model. For instance, acoustic models and vocoders can be substituted by a unified vocoder model, like WaveNet.
Numerous advanced vocoders have emerged in neural TTS systems to enhance speech synthesis quality. One prominent example is WaveNet [13], created by Google's DeepMind. WaveNet is a sophisticated generative model that employs convolutional neural networks (CNNs) to generate raw audio waveforms directly. By modeling temporal dependencies in audio data using a large receptive field, it attains high-quality and natural-sounding speech. The success of WaveNet has generated significant interest in the field of speech synthesis, laying the groundwork for future advancements.
Another prevalent approach to vocoding in neural TTS involves the use of generative adversarial networks (GANs), which has given rise to the development of the GAN-based vocoder family. HiFi-GAN [20], a distinguished example within this group, produces high-fidelity speech from input acoustic features. By employing a multi-scale generative network and a multi-resolution discriminator, it captures both local and global structures in the generated audio, yielding high-quality and natural-sounding speech. The adversarial training process in GAN-based vocoders contributes to refining the synthesized speech, making it more authentic and expressive.
Both WaveNet and GAN-based vocoders, such as HiFi-GAN, have significantly contributed to the advancements in neural TTS. They offer more natural and high-quality synthesized speech, enabling TTS systems to be more versatile and effective across various applications, including virtual assistants, audiobook narration, and accessibility services for visually impaired users.
Fully end-to-end TTS models, which generate speech waveforms directly from character or phoneme sequences, offer several advantages over traditional cascaded approaches:
They require less human annotation and feature development, such as alignment information between text and speech, reducing the need for labor-intensive manual work.
Joint and end-to-end optimization can prevent error propagation common in cascaded models, such as those involving text analysis, acoustic
</div><span class="text_page_counter">Trang 32</span><div class="page_container" data-page="32">models, and vocoders. End-to-end models can achieve more accurate and efficient speech synthesis by streamlining the process.
These models can reduce training, development, and deployment costs, making them a more attractive option for various applications.
Overall, fully end-to-end TTS models present a promising direction for the future of speech synthesis technology.
Some notable examples of fully end-to-end TTS models include:
WaveNet [13]: A Generative Model for Raw Audio - Developed by DeepMind, WaveNet introduced a novel deep learning architecture for speech synthesis, using a fully end-to-end approach to generate speech waveforms directly from text input. Based on PixelCNN, it employs a dilated causal convolutional network to model the temporal dependencies in the audio signal, generating high-quality, natural-sounding speech widely adopted in applications like Google Assistant.
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS) – VITS [21] is a state-of-the-art end-to-end TTS model combining the strengths of conditional variational autoencoders (CVAEs) and adversarial learning. It leverages variational inference techniques and generative adversarial networks (GANs) to produce high-quality, natural-sounding speech directly from text input. Jointly optimizing both components, VITS prevents error propagation common in cascaded models and balances speech quality and training efficiency.
NaturalSpeech [22]: End-to-End Text-to-Speech Synthesis with Human-Level Quality - Developed by Microsoft, NaturalSpeech aims to generate human-level quality speech using a fully end-to-end approach, synthesizing speech waveforms directly from character or phoneme sequences. Eliminating the need for intermediate steps such as text analysis, acoustic models, and vocoders, NaturalSpeech streamlines the synthesis process, leading to more accurate and efficient results. This advanced TTS model has the potential to revolutionize the way we generate human-like speech from text, opening up new possibilities for various applications.
These fully end-to-end TTS models exemplify the ongoing advancements in speech synthesis technology, showcasing the potential for a more natural and efficient generation of human-like speech in the future.
Identifying suitable features for synthesis can be accomplished using deep learning techniques with neural networks. These neural networks can determine appropriate features originating from the character level in the input text. The concept of end-to-end speech synthesis involves the network accepting a string of characters as input, processing it through hidden layers in the network, and generating an output audio signal. At this point, several proposals have been made to construct such end-to-end aggregation networks. The advantages of these synthesis methods include a lightweight system in the typical processing stage, ease of adaptation to new data, increased robustness compared to systems with
</div><span class="text_page_counter">Trang 33</span><div class="page_container" data-page="33">multiple interconnected modules as errors do not accumulate, and more. However, the challenge of direct TTS conversion arises because the same text can correspond to multiple pronunciations or speech patterns. Training should encompass as many signal-level variations for a given input text as possible.
a) Char2Wav speech synthesis system
The Char2Wav speech synthesis system, developed by Sotelo [23], aims to construct an end-to-end system trained with a string as input. Char2Wav consists of two components: an encoder-decoder model with an attention mechanism serving as the reader, and a neural vocoder (Figure 1.8). The built-in encoder is a bidirectional recurrent neural network with text or phonemic strings as input. The decoder is an attention-based neural network. The decoder's outputs are intermediate feature representations for the input of a SampleRNN neural vocoder. This intermediate representation comprises the vocoder features suggested in the World Vocoder set. The SampleRNN network is enhanced to accept previously generated tonal samples and feature frames from the decoder as input. The system's final output consists of raw acoustic wave samples. Consequently, Char2Wav also generates audio directly from text. However, Char2Wav still relies on the features of the World vocoder, and the sequence-to-sequence and SampleRNN models require initial training.
<small>Figure 1.8 Char2Wav model [23] </small>
b) Tacotron synthesis system
Tacotron is an end-to-end speech synthesis system that takes text input directly [24]. The input consists of a character string, while the output is the spectrogram of the signal. The system model undergoes training from scratch using <text, speech> pairs. Moreover,
</div><span class="text_page_counter">Trang 34</span><div class="page_container" data-page="34">Tacotron generates frame-level speech, which is faster than the sample-level speech generation methods mentioned previously.
<small>Figure 1.9 Model of the Tacotron synthesis system [24] </small>
Figure 1.9 depicts the system's architecture, including the encoding part based on CBHG (Convolutional 1-D filters Bank + Highway networks + Gated recurrent unit bidirectional) and the decoding part based on the attention mechanism. Each part includes the connection of many different architectural components. The output of the encoder phase is transformed through the CBHG module (behind the processing network) to predict the sampled spectral amplitude on a linear frequency scale. Then the Griffin-Lim algorithm was used to reconstruct the output audio from this spectrogram. This synthesis will not need features from previous TTS systems as in WaveNet. Technical details are presented by [24]. However, spectrograms can represent speech but do not carry phase information. Therefore, the Griffin-Lim algorithm is used to estimate the phase information; then, the inverse short-time Fourier transform is used. However, the sound output quality in this way could be higher. The Tacotron development team also suggested that the system must still be developed, perfecting the spectrogram to acoustic waveform converter.
By the end of 2017, Tacotron had been developed to version 2 [25], overcoming the disadvantage of converting the spectrogram to sound waveform. The system consists of (1) a sequence-to-string-recursive feature prediction network, which allows the mapping of character packets to spectrogram representations, and (2) a transform Wavenet model that performs sound wave generation from these spectrogram representations. Using a spectrogram instead of traditional input features of the WaveNet network, such as language information, time, F0, etc., reduces the size of the WaveNet network significantly. The architectural model of the system is depicted in Figure 1.10.
</div><span class="text_page_counter">Trang 35</span><div class="page_container" data-page="35"><small>Figure 1.10 Block diagram of the Tacotron 2 system architecture [25] </small>
Tacotron2 is trained directly on the normalized sequence of characters and the corresponding acoustic waveform. Tacotron2 is considered to produce synthetic voices of natural human-like quality. In addition, Tacotron2 is considered able to handle cases of out-of-domain and complex words, learn pronunciation based on the semantics of sentences, pronounce well when the input word is spelled incorrectly, and learn sentence tone.
<b>1.2. Speech synthesis for low-resourced languages </b>
The development of interactive systems for under-resourced languages [26] faces challenges due to the need for more data and minimal research in this area. The SLTU-CCURL<small>2</small> workshops and SIGUL<small>3</small> meetings aim to gather researchers working on speech and NLP for these languages to exchange ideas and experiences. These events foster innovation and encourage cross-disciplinary collaboration between fields like computer science, linguistics, and anthropology. The focus is on promoting the development of spoken language technologies for low-resourced languages, covering topics like speech recognition, text-to-speech synthesis, and dialogue systems. By bringing together academic and industry researchers, these meetings help address the challenges faced in under-resourced language processing.
Many investigations for low-resourced languages have been conducted recently using a variety of methods, including applying speaker characteristics [27], modifying phonemic features [28], [29], and cross-lingual text-to-speech [30], [31]. Yuan-Jui Chen et al. introduced end-to-end TTS with cross-lingual transfer learning [32]. The authors proposed a method to learn a mapping between source and target linguistic symbols because the model trained on the source language cannot be directly applied to the target language due to input space mismatches. By using this memorization mapping, pronunciation information can be
<small>2 </small>
<small>3 </small>
</div><span class="text_page_counter">Trang 36</span><div class="page_container" data-page="36">kept throughout the transfer proess. Sahar Jamal et al. [33] used transfer learning for the experiments to take advantage of the low-resourced scenario. The information obtained then trains the model with a significantly smaller collection of Urdu training data. The authors created standalone Urdu and learning systems by using pre-trained Tacotron models of English and Arabic as parent models. Marlene Staib et al. [34] improved or matched the performance of many baselines, including a resource-intensive expert mapping technique, by swapping out Tacotron 2’s character input for a manageably small set of IPA-inspired features. This model architecture also enables the automated approximation of sounds that have not been seen in training. They demonstrated that a model trained on one language could produce intelligible speech in a target language even in the lack of acoustic training data. A similar approach [35] is used in transfer learning, where a high-resource English source model is fine-tuned with either 15 minutes or 4 hours of transcribed German data. Data augmentation is a different approach that researchers apply to solve the low-resourced language challenge [36]–[38] An innovative three-step methodology has been developed for constructing expressive style voices using as little as 15 minutes of recorded target data, circumventing the costly operation of capturing large amounts of target data. Firstly, Goeric Huybrechts et al. [36] augment data by using recordings of other speakers whose speaking styles match the desired one. In the next step, they use synthetic data to train a TTS model based on the available recordings. Finally, the model is fine-tuned to improve quality.
Muthukumar and his colleagues have developed a technique for automatically constructing phonetics for unwritten languages [39]. Synthesis may be improved by switching to a representation closer to spoken language than written language.
The main challenges to address when developing TTS for under-resourced languages are 1. synthesizing speech for languages with a writing system but limited data; 2. synthesizing speech for languages without a writing system, using input text or speech from another language. Key research directions, such as adaptation and polyglot approaches, will be discussed in detail in the following sections to tackle these challenges.
<b>1.2.1. TTS using emulating input approach </b>
The rationale behind this approach is to leverage an existing TTS system for a base language (Base Language - BL) to simulate TTS for an unsupported language (target language - TL). This strategy aims to assist individuals who speak unsupported languages when communicating in another language is inconvenient, such as when new immigrants visit a doctor. While TTS plays a role in translating doctor-patient conversations, text-based communication is also essential in healthcare. Consequently, TTS becomes necessary for users with limited English proficiency or literacy skills in their native language, enabling them to access and understand vital information [40].
The first emulating idea given by Evans et al. [41], the team developed the simulator to fit a screen reader. They describe a method that enables the production of text-to-speech synthesizers for new languages with assistive apps. The method employs a straightforward rule-based text-to-phoneme step. The phonemes are transmitted to a phoneme-to-speech system for another language. They demonstrate that the correspondence between the language to be synthesized and the language on which the phoneme-to-speech system is based is crucial for the perceived quality of speech but not necessarily for speech comprehension. They report the exam in Greek but can apply the same method with equal success for Albanian, Czech, Welsh, and additional languages.
</div><span class="text_page_counter">Trang 37</span><div class="page_container" data-page="37">Three primary challenges exist in simulating a target language (TL) using a base language (BL). First, it is essential to choose BL phonemes that closely resemble those of the TL's phonemes. Second, the goal is to minimize discrepancies in text-to-phoneme mapping. Lastly, we must select a BL with linguistic features that closely align with those of the TL's linguistic features. These three challenges can lead to different approaches, and ultimately, the balance achieved will be significantly influenced by the decisions made regarding the BL [40].
In the study by Evans et al. [41], the evaluation process has a unique aspect compared to conventional TTS assessment. This distinction is essential to understand as it highlights the tailored approach needed for evaluating TTS systems in under-resourced languages. The MRT (Mean Opinion Score - Revised) is a variation of the traditional MOS (Mean Opinion Score) assessment. The conventional MOS is a subjective evaluation method used to gauge the overall quality of speech synthesis systems. In contrast, MRT focuses on the clarity and usability of the synthesized speech in low-resource settings. This shift in focus makes MRT a more suitable evaluation method for under-resourced languages. The study used nonsensical words and simple sentence structures as test cases to evaluate the Greek TTS system. This approach was chosen because, in under-resourced languages, ensuring that the TTS system can generate clear and understandable speech even when faced with unusual or uncommon linguistic structures is crucial. By using these "fake" cases, the evaluation can better assess the system's performance and robustness in challenging situations.
Harold Somers and his colleagues proposed a "emulating" approach for developing TTS systems in under-resourced languages, as explored in their publications [40] and [42]. They aimed to create a TTS system for Somali, an under-resourced language, by leveraging an existing TTS system for a well-resourced language. The researchers also discussed various experimental designs to assess TTS systems developed using this approach, emphasizing the importance of evaluating speech quality, intelligibility, and usefulness. This method utilizes existing resources from well-resourced languages, showing potential for developing TTS systems for under-resourced languages. By investigating different experimental designs and evaluation methods, researchers can better comprehend the challenges, opportunities, and limitations of this approach.
The advantages and disadvantages of the "emulating" approach for low-resourced languages, as well as its applicability, can be summarized as follows:
Advantages:
Resource efficiency: By leveraging existing TTS systems for rich-resourced languages, the need for extensive data collection and development efforts can be reduced.
Faster development: Utilizing existing resources accelerates the development process for TTS systems in low-resourced languages. Cross-disciplinary collaboration: The "emulating" approach fosters
collaboration among researchers in various fields, such as computer science, linguistics, and anthropology.
Disadvantages:
Speech quality: Synthesized speech quality may be compromised due to the mismatch between the base and target languages.
</div><span class="text_page_counter">Trang 38</span><div class="page_container" data-page="38"> Intelligibility: Depending on the similarity between the base and target languages, the intelligibility of the generated speech might be limited. Customizability: The "emulating" approach might not be suitable for
every low-resourced language, especially if there is no closely-related rich-resourced language to use as a base.
Applicability:
Languages with similar phonetic or linguistic characteristics: The "emulating" approach is most applicable when the target low-resourced language shares phonetic or linguistic features with a well-resourced language.
Situations requiring rapid TTS system development: In cases where a TTS system is urgently needed for an low-resourced language, the "emulating" approach can provide a quicker solution than traditional methods.
Initial system development: The "emulating" approach can serve as a starting point for developing a more refined TTS system for low-resourced languages, allowing researchers to identify specific challenges and opportunities for improvement.
In summary, the "emulating" approach presents a promising direction for developing TTS systems for low-resourced languages. However, its success depends on selecting a suitable base language and overcoming the limitations inherent in this method.
<b>1.2.2. TTS using the polyglot approach </b>
Polyglot TTS and multilingual TTS are often used interchangeably, but they can have slightly different meanings depending on the context:
Polyglot TTS: A single TTS model is trained to handle multiple languages simultaneously in the polyglot approach. The model can synthesize speech in various languages using the same architecture and shared parameters. The polyglot approach aims to leverage commonalities among languages and transfer knowledge from rich-resourced languages to low-resourced languages. This approach can be more resource-efficient and scalable compared to building separate TTS models for each language.
Multilingual TTS: Multilingual TTS is a broader term that refers to any TTS system capable of handling multiple languages, regardless of the specific architecture or method used. A multilingual TTS system can include separate TTS models for each language or use a shared model like in the polyglot approach. The main goal of multilingual TTS systems is to support speech synthesis in various languages.
In summary, polyglot TTS is a specific approach to building multilingual TTS systems where a single model is used for multiple languages. On the other hand, multilingual TTS is a more general term that encompasses any TTS system capable of handling multiple
</div><span class="text_page_counter">Trang 39</span><div class="page_container" data-page="39">languages, whether it uses separate models for each language or a shared model like in the polyglot approach.
Below are a few notable examples of this approach. First, consider one language as the primary language for building cross-linguistic polyglot TTS, as researched by Samsudin [43], [44]. Any system utilizing this framework can synthesize different languages using the same collection of recorded or trained voices. Next is the synthesizing speech from mixed-language text, as described by H. Romsdorfer et al. [45]. This technique is advantageous when multiple languages appear within a single text, such as in cases of xenomorph occurrences. In these situations, swapping corpora (datasets used to train TTS systems) for each language in the text would be impractical. Polyglot speech synthesis resolves this issue by enabling TTS systems to synthesize speech from text containing multiple languages seamlessly and coherently. Polyglot speech synthesis relies on text analysis and language identification to discern the different languages present in the text and select the appropriate TTS system for each language. This allows the TTS system to produce coherent and natural-sounding speech, even when the text contains multiple languages. Overall, polyglot speech synthesis is a promising approach for addressing the challenges of synthesizing speech from mixed-language text and can potentially enhance the quality and effectiveness of TTS systems for low-resourced languages.
For a more detailed description of the TTS polyglot, we give the detailed architecture in Figure 1.11. In that picture, a speaker-adaptable polyglot voice synthesizer has two phases: training and synthesis. Section (1) shows the cross-language case, when the SI model is
<small>Figure 1.11 Scheme of a HMM-based polyglot synthesizer [48] </small>
</div><span class="text_page_counter">Trang 40</span><div class="page_container" data-page="40">adapted to a speaker of one of the languages included in the training data and text in any of these languages is synthesized. Sections (2) and (3) show the adaptation and synthesis of extrinsic languages using phone mapping. (1) Basic scheme of an HMM-based polyglot synthesizer, (2) adaptation to speakers of extrinsic languages, (3) synthesis of extrinsic languages. During training, speech collections in target languages are analyzed, and their spectral features are stored using Hidden Markov Models (HMM). The system creates speaker-independent HMMs and then uses speaker adaptation to improve the consistency of the synthesized speech quality. The second section of the diagram shows how the system adapts the phoneme mapping when the target language is not present in the training data. This architecture requires voice recordings and the participation of native speakers in creating language materials.
Recently, with the explosion of neural technology, the approach to creating polyglot TTS systems has undergone significant changes, yielding better results. This is evident in the development of voice cloning-based Polyglot NTTS systems. A common technique involves training multilingual Neural Text-To-Speech (NTTS) models using only monolingual datasets. In training these models, it is crucial to understand how the composition of the training corpus impacts the quality of multilingual speech synthesis. For example, given the close relationship between Spanish and Italian, a typical question is, "Would adding more Spanish data improve my Italian synthesis?" Ziyao Zhang [46] carried out an extensive ablation study to determine how various training corpus characteristics, such as language family affiliation, gender composition, and the number of speakers, affect the quality of polyglot synthesis. Their findings include the observation that most cases favour female speaker data, and having more speakers from the target language variety in the training corpus is only sometimes advantageous. These insights are informative for data acquisition and corpus development processes.
In summary, polyglot TTS systems have shown great potential in addressing the challenges of synthesizing speech for multilingual and mixed-language texts. These systems utilize a single model or shared architecture and parameters across multiple languages to use commonalities and facilitate knowledge transfer from rich-resourced to low-resourced languages. This approach proves to be more resource-efficient and scalable than creating separate TTS models for each language.
Research by Samsudin [43], [44], H. Romsdorfer et al. [45], and Ziyao Zhang [46] demonstrates that polyglot TTS systems can generate coherent and natural-sounding speech, even when dealing with mixed-language texts. Furthermore, the development of voice cloning-based Polyglot NTTS systems, and their use of monolingual datasets showcase the potential of neural technology to enhance the quality and effectiveness of TTS systems for low-resourced languages.
Polyglot TTS systems offer several advantages:
Resource efficiency and scalability, resulting from a shared architecture and parameters.
The ability to exploit similarities between languages and transfer knowledge from rich-resourced to low-resourced languages.
Seamless and coherent handling of mixed-language texts. However, polyglot TTS systems also have some drawbacks:
</div>