Tải bản đầy đủ (.pdf) (107 trang)

supporting voice communication in chatbot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.12 MB, 107 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

HA MINH DUC

<b>SUPPORTING VOICE COMMUNICATION IN CHATBOT </b>

Major: COMPUTER SCIENCE Major code: 8480101

MASTER’S THESIS

HO CHI MINH CITY, January 2024

</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

THIS THESIS IS COMPLETED AT

HO CHI MINH UNIVERSITY OF TECHNOLOGY – VNU-HCM

Supervisor: Le Thanh Van, Ph. D

Examiner 1: Ton Long Phuoc, Ph. D

Examiner 2: Vo Dang Khoa, Ph. D

This master’s thesis is defended at Ho Chi Minh City University of Technology (HCMUT)

– VNU-HCM

on 23rd Jan 2024

Master’s Thesis Committee:

1. Assoc. Prof. Dr. Tran Van Hoai, Ph. D Chairman 2. Ton Long Phuoc, Ph. D Examiner 1 3. Vo Dang Khoa, Ph. D <b> Examiner 2 </b>

4. Le Thanh Van, Ph. D Commissioner 5. Assoc. Prof. Dr. Tran Ngoc Thinh, Ph. D Secretary

Approval of the Chairman of the Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any).

<b><small> CHAIRMAN OF THESIS COMMITTEE DEAN OF FACULTY OF </small></b>

<b><small> COMPUTER SCIENCE AND ENGINEERING </small></b>

</div><span class="text_page_counter">Trang 3</span><div class="page_container" data-page="3">

<small>VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY </small>

<b><small>HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY</small></b>

<b><small> Independence – Freedom - Happiness</small></b>

<i><b>THE TASK SHEET OF MASTER’S THESIS </b></i>

Full name: Ha Minh Duc Student code: 2270348 Date of birth: 20/03/1985 Place of birth: Kien Giang Major: Computer Science Major code: 8480101

<b>I. THESIS TITLE: Supporting voice communication in chatbot.</b>

<i>(Hỗ trợ giao tiếp bằng giọng nói trong phần mềm chatbot) </i>

<b>II. TASKS AND CONTENTS: </b>

 Task 1: Research and Experimentation for Sequence-to-Sequence Model Development.

The primary objective is to create a powerful Sequence-to-Sequence model tailored for chatbot applications. This is a neural network architecture known for its success in natural language processing tasks, and this task focuses on exploring, experimenting, and optimizing the Sequence-to-Sequence model to enhance its performance in the context of chatbot interactions.

 Task 2: Research and Experimentation for Automatic Speech Recognition Model Development.

During this phase, the primary emphasis is on thorough research and experimentation with diverse methods to craft high-performance automatic speech recognition models. Exploring various techniques is essential to achieving precise conversion from audio to text. The objective is to pinpoint

<b>the most effective model that aligns with the project's requirement. </b>

 Task 3: Sequence-to-Sequence Model and Automatic Speech Recognition Evaluation and Future Work.

After developing the Sequence-to-Sequence model for Automatic Speech Recognition, a comprehensive evaluation process will be conducted. The achieved results will be analyzed in detail using appropriate metrics and techniques to assess accuracy performance. The strengths and weaknesses of

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

each Sequence-to-Sequence model will be identified and assessed meticulously. Based on this analysis, recommendations for future work will be provided, addressing potential improvements and further developments in Automatic Speech Recognition technology.

<b>III. THESIS START DAY: Feb-06-2023</b>

<b>IV. THESIS COMPLETION DAY: Dec-10-2023V. SUPERVISOR: Le Thanh Van, Ph. D </b>

<i>Ho Chi Minh City, January 22, 2024 </i>

<b>SUPERVISOR </b>

(Full name and signature)

<b>CHAIR OF PROGRAM COMMITTEE </b>

(Full name and signature)

<b>DEAN OF FACULTY OF </b>

<b>COMPUTER SCIENCE AND ENGINEERING </b>

(Full name and signature)

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

<b>ACKNOWLEDGEMENT </b>

First of all, I would like to appreciate the dedicated guidance and support of my lecturers during my work. They not only enthusiastically shared with my business plan but also suggested me many great ideas that helped my thesis much more fruitful and interesting. Especially to Dr. Le Thanh Van who has been always willing to assist me in any way he could during my thesis.

Furthermore, I would also like to acknowledge the Ho Chi Minh City University of Technology for its engagement and valuable learning experiences that have given to me as well as my classmates in Vietnam. Thanks to HCMUT’s personalized programs, I have had an ideal Master of Computer Science course during my busy working schedule.

The next, I am eternally grateful for the network in which my friends Pham Thanh Huu, Nguyen Thi Ty, Vo Thi Kim Nguyet, Le Duc Huy, Nguyen Tan Sang and Pham Dien Khoa. They are not only experiencing the course with me, but also sharing life and working tips that are meaning for a person like me. I am so lucky to have them in my entire life.

The last, words cannot describe how thankful I am for IMP Academic Team’s kindly support. Without their accompany, I could not complete my Master of Computer Science course.

Sincerely, Ha Minh Duc

Ho Chi Minh City, Jan 2024

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

<b>ABSTRACT </b>

This master's thesis delves into the improvement of voice-based communication in healthcare chatbots through the integration of cutting-edge natural language processing and automatic speech recognition technologies. The research centers on leveraging the GPT-3-based sequence-to-sequence architecture for enhancing natural language understanding and generation. Additionally, it incorporates the innovative Way2vec 2.0 model to empower robust Automatic Speech Recognition capabilities. The GPT-3 architecture is chosen for its adeptness in comprehending medical contexts, generating contextually relevant responses, and handling dynamic healthcare-related conversational flows. The integration of Way2vec 2.0 ensures precise and context-aware transcription of voice inputs, enhancing the accuracy of healthcare-related information capture. This research contributes to the field of healthcare technology by presenting a novel approach to improving patient engagement and satisfaction through voice interactions. The combination of GPT-3 and Way2vec 2.0 not only strengthens the chatbot's ability to understand and generate natural language responses but also extends this proficiency to healthcare-focused voice interactions, thereby widening the applicability and accessibility of chatbot system in the aspect of speech recognition.

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

<b>TÓM TẮT LUẬN VĂN THẠC SĨ </b>

Luận văn thạc sĩ này tập trung vào việc cải thiện giao tiếp dựa trên giọng nói trong chatbot chăm sóc sức khỏe thơng qua sự kết hợp của các công nghệ xử lý ngôn ngữ tự nhiên và nhận dạng giọng nói tự động tiên tiến. Nghiên cứu tập trung vào việc sử dụng kiến trúc dựa trên GPT-3 cho quá trình nâng cao hiểu biết và tạo ra ngơn ngữ tự nhiên. Ngồi ra, nó kết hợp mơ hình Way2vec 2.0 sáng tạo để cung cấp khả năng nhận dạng giọng nói tự động mạnh mẽ. Kiến trúc GPT-3 được chọn vì khả năng hiểu biết về ngữ cảnh y tế, tạo ra các phản ứng liên quan đến ngữ cảnh và xử lý các luồng trò chuyện y tế động. Sự tích hợp của Way2vec 2.0 đảm bảo việc chuyển đổi chính xác và nhận thức ngữ cảnh của đầu vào giọng nói, từ đó nâng cao độ chính xác của việc thu thập thơng tin liên quan đến sức khỏe. Nghiên cứu này đóng góp cho lĩnh vực cơng nghệ chăm sóc sức khỏe bằng cách trình bày một cách tiếp cận mới để cải thiện sự tương tác và sự hài lòng của bệnh nhân thơng qua giao tiếp giọng nói. Sự kết hợp giữa GPT-3 và Way2vec 2.0 không chỉ củng cố khả năng của chatbot trong việc hiểu và tạo ra phản ứng tự nhiên bằng ngơn ngữ, mà cịn mở rộng khả năng áp dụng và tiếp cận của hệ thống chatbot trong phương diện nhận dạng tiếng nói.

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

<b>DECLARATION OF AUTHORSHIP </b>

I hereby declare that this thesis was carried out by myself under the guidance and supervision of Le Thanh Van, Ph.D; and that the work contained and the results in it are true by author and have not violated research ethics. The data and figures presented in this thesis are for analysis, comments, and evaluations from various resources by my own work and have been duly acknowledged in the reference part.

In addition, other comments, reviews and data used by other authors, and organizations have been acknowledged, and explicitly cited.

I will take full responsibility for any fraud detected in my thesis. Ho Chi Minh City University of Technology (HCMUT) – VNU-HCM is unrelated to any copyright infringement caused on my work (if any).

<i> Ho Chi Minh City, Jan 2024 Author </i>

<i> Ha Minh Duc </i>

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

<b>1.3. Target of the Thesis ... 4 </b>

<b>1.4. Scope of the Thesis ... 5 </b>

<b>1.5. Contribution ... 6 </b>

<b>1.6. Thesis Structure ... 7 </b>

<b>CHAPTER 2. BACKGROUND ... 9 </b>

<b>2.1. Hidden Markov Model (HMM) ... 9 </b>

<b>2.2. Deep Neural Networks ... 10</b>

<b>2.3. Artificial Neural Networks ... 11 </b>

<b>2.4. Convolutional Neural Network ... 13 </b>

<b>2.7. Recurrent Neural Networks ... 19 </b>

<b>2.8. Long Short-Term Memory ... 21 </b>

<i>2.8.1. The Long-Term Dependency Problem ... 22 </i>

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

<b>2.11. The History of Chatbots... 33 </b>

<b>2.12. Using Luong’s Attention for Sequence 2 Sequence Model ... 35 </b>

<i>2.12.1. Sequence to Sequence Model ... 35 </i>

<i>2.12.2. Encoder ... 36 </i>

<i>2.12.3. Decoder ... 38 </i>

<b>2.13. Automatic Speech Recognition ... 41 </b>

<i>2.13.1. Speech Recognition Using Recurrent Neural Networks ... 43 </i>

<i>2.13.2. Speech-to-Text Using Deep Learning ... 43 </i>

<b>CHAPTER 3. PROPOSED MODEL ... 47 </b>

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

<b>LIST OF FIGURES </b>

Figure 1.1. The History of IVR [17] ... 3

Figure 2.1. HMM-based phone model [1] ... 10

Figure 2.2. A Deep Neural Network ... 11

Figure 2.3. Neuron Anatomy [53] ... 12

Figure 2.4. A Simple Example of the Structure of a Neural Network ... 13

Figure 2.5. The McCulloch-Pitts Neuron [4] ... 13

Figure 2.6. The architecture of CNN [57] ... 14

Figure 2.7. Sigmoid Function and its Derivative by [43] ... 17

Figure 2.8. Underfitting, Optimal and Overfitting ... 18

Figure 2.9. Underfitting, Optimal weight decay and Overfitting ... 19

Figure 2.10. The Recurrent Neural Network [5] ... 20

Figure 2.11. LSTM Network Architecture [6] ... 22

Figure 2.12. RNN and Short-Term Dependencies [7] ... 22

Figure 2.13. RNN and Long-Term Dependencies [7] ... 23

Figure 2.14. The Repeating Modules in an RNN Contains One Layer [7] ... 24

Figure 2.15. The Repeating Modules of an LSTM Contain Four Layers [7] ... 24

Figure 2.16. The Architecture of GRU [9] ... 25

Figure 2.17. The CBOW Model with One Input [14] ... 30

Figure 2.18. The Skip-gram Model [14] ... 32

Figure 2.19. ELIZA – The First Chatbot in the World at MIT by Joseph Weizenbaum ... 34

Figure 2.20. The Conversational between Elize and Parry [19] ... 34

Figure 2.21. Seq2Seq Model with GRUs [15] ... 36

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

Figure 2.22. Bidirectional GRU [18] ... 37

Figure 2.23. Attention Mechanism in Seq2Seq Model [20] ... 39

Figure 2.24. Luong’s Global Attention [21] ... 40

Figure 2.25. Basic Voicebot Architecture [56] ... 43

Figure 2.26. Graph showing how the loss function change depending on the size of the trained network [50] ... 45

Figure 2.27. Graph showing how the loss function change depending on the size of the training set[50] ... 46

Figure 3.1. Taxonomy of Sequence to Sequence Models ... 47

Figure 3.2. The Transformer and GPT Architecture [22][23] ... 48

Figure 3.3. Input Embeddings ... 50

Figure 3.4. Multi-headed Attention ... 51

Figure 3.5. Dot Product of Query and Key ... 52

Figure 3.6. Scaling Down the Attention Scores ... 52

Figure 3.7. SoftMax of the Scaled Scores ... 53

Figure 3.8. Multiply SoftMax Output with Value Vector... 54

Figure 3.9. Computing Multi-headed Attention ... 54

Figure 3.10. Multi-headed Attention Output ... 55

Figure 3.11. Residual Connection of the Input and Output ... 56

Figure 3.12. Decoder First Multi-Headed Attention ... 58

Figure 3.13. Adding Mask to Scaled Matrix ... 59

Figure 3.14. Applying SoftMax function to Attention Score ... 60

Figure 3.15. The Process Flow of Multi-headed Attention ... 60

Figure 3.16. Final Stage of Transformer’s Decoder ... 62

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

Figure 3.17. GPT’s Architecture ... 63

Figure 3.18. Transformer Architecture and Training Objectives [24] ... 65

Figure 3.19. Taxonomy of Speech Recognition ... 66

Figure 3.20. Wav2vec 2.0 Architecture ... 67

Figure 3.21. Wav2vec 2.0 Latent Feature Encoder... 68

Figure 3.22. Wav2vec 2.0 Quantization Module ... 69

Figure 3.23. Wav2vec 2.0 Context Network (Transformer Encoder) ... 69

Figure 3.24. Wav2vec 2.0 Contrastive Loss ... 70

Figure 4.1. Example of datasets for Seq2seq model ... 75

Figure 4.2. Example of datasets for ASR model ... 76

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

<i><b>LIST OF TABLES </b></i>

Table 2.1. Example of Co-occurrence Matrix ... 27

Table 4.1. Intent Recognition Results ... 76

Table 4.2. Entity Recognition Results ... 77

Table 4.3. Chat Handoff and Fallback ... 77

Table 4.4. Conversation Log with Chatbot ... 78

Table 4.5. Result of Scenarios... 81

Table 4.6. Result of Way2vec 2.0 on Vietnamese Audio Files ... 81

</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

<b>ACRONYMS </b>

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

<b>CHAPTER 1. THESIS INTRODUCTION 1.1. Overview </b>

The idea of human-computer interaction through natural language has been created in Hollywood movies. 3-CPO is one of the legends of the Revolutionary Army in the world of Star Wars movie. This robot guy has served through many generations of Skywalkers and is one of the top personality robots in the universe. In the series of this movie, we can see that 3-CPO not only has very similar gestures and communication to humans, but sometimes has great instructions for its owner. This is a cinematic product that is ahead of another era when it comes to predicting the future of Artificial intelligence (AI). The Star Wars fictional movie universe is set in a galaxy where humans and alien creatures live in harmony with droids. These robots are capable of assisting people in daily life or traveling across other planets. In the movie Iron Man (2010), Tony Stark used his supercomputer assistant, JARVIS, to get support and help in everyday life and on trips to save the world with superheroes. In particular, it can be mentioned the film AI (2001), adapted from the short story series Supertoys Last All Summer Long, which tells the perspective of the 22nd century, when sea level rise washed away coastal cities, causing a serious decline in population density. The Mecha robot line simulates real people designed to integrate with humans. They possess the ability to think well but do not know how to express emotions.

The history of Interactive Voice Response (IVR) systems began in the 1930s when Voder machines were created. The technology was the first to analyze the English language and produce human-like sounds. The original speech recognition system was rudimentary, only understanding numerals because engineers thought human language was too complex. In 1952, Alexander Graham Bell's Bell Laboratories designed "Audrey," a system for recognizing digits from a given voice. Ten years later at the World's Fair, IBM demonstrated a "Shoebox" system that could recognize 16 different English words. The vision from these projects is that users can communicate with computers through natural language and therefore not have to learn any specific language or prompts. However, it turns out to be quite complicated

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

to understand the spoken language. It can be said that only entities (humans) living in the real world can effectively understand language, arguing that without context, the meaning of a word is incomprehensible.

IVRs were widely used for businesses in the 1990s. Meanwhile, the use of call queuing and automated call routing became popular in the mid-1990s. In the late 1990s, the move of multimedia to call centers led to companies investing in IVR systems with CTI. This integration allows businesses to integrate their call centers into their marketing campaigns. Moreover, continuous improvements in IVR make them cheaper to deploy the company. Contemporary platforms emerged in the 2010s. The emphasis on integrating IVR with comprehensive analytics, automated SMS messaging, and advanced call monitoring features was evident during this period. Modern IVR systems are now part of a larger solution and enable seamless integration of customer communications across channels. Unlike bulky and expensive standalone systems, these advanced all-inclusive platforms now offer options, giving customers the opportunity to choose their preferred method. Today, IVR has been integrated into the overall customer experience. It now comes with a personalized brand voice, protects customer data, and detects fraud and spam. In addition to routing to the best department to address customer needs, the tool is now integrated into marketing efforts The self-service model has evolved significantly with the arrival of conversational IVR. These AI-enabled technologies replicate the experience of talking to a live agent. Today's IVR systems provide solutions to customers faster, even without a direct operator connection. It is useful for many industries and uses. It can help manage hotel reservations, pay bills, do market research, buy tickets, and present information about products and services. These latest functions meet the needs of the market. Zendesk's research shows that 69% of people try to solve their own problems before contacting customer service. However, businesses must ensure that they are implementing IVR self-service best practices to increase customer satisfaction. A poorly designed automated system can disrupt a business, especially if it wastes consumers' time without solving their problems.

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

<b>Figure 1.1. The History of IVR [17] </b>

The basic and big problem with most IVR systems is that no matter how useful they may be in providing all possible options to fit a lot of customer queries and issues, most customers really just want to talk to one person in person. Nothing can beat direct human-to-human communication. This may seem counterintuitive if you are using an IVR system to reduce human resources and staffing costs, but adding the option to talk to an advisor on the main menu without forcing customers to search the verbal maze of menu options will make most customers more satisfied and less frustrated. Automatic speech recognition (ASR) is a classic feature of many IVR systems and allows users to communicate with the IVR system by voice instead of pressing phone keys or on a laptop, which can be difficult for users to do. But if ASR is incapable of recognizing what humans are saying, it will make the system

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

frustrating and, worse, useless. Therefore, ensuring accuracy as well as optimizing design interfaces in ASR is essential in meeting the high demands of users.

<b>1.2. Research Problem </b>

There are several challenges associated with the current implementation of the voice-to-text chatbot. Firstly, quality and coherence of generative-based chatbots. They generate responses based on statistical patterns learned from large datasets of text. While they can produce more diverse and flexible responses compared to retrieval-based models, the quality and coherence of their output can vary widely. They may generate nonsensical or contextually inappropriate responses, especially when faced with input they have not encountered during training.

Secondly, the accuracy of ASR models. ASR is a newer feature of many IVR systems. It allows customers to communicate with the basic voice IVR system instead of clicking on phone or laptop keyboards. The accuracy of a ASR needs to be high if it is to create any value. However, achieving a high level of accuracy can be a challenge. According to a recent survey [39], there are 73% of respondents cited accuracy as the biggest obstacle in adopting speech recognition technology. Before diving into the barriers to accuracy, it would be appropriate to mention that WER is a metric commonly used to measure the accuracy and performance of speech recognition systems.

Lastly, ASR models often exhibit domain-specific performance. This implies that if a model is exclusively on a particular dataset or within a specific domain (such as healthcare, finance, or tourism), it might encounter challenges in recognizing and processing inputs beyond that training domain. Consequently, this limitation can result in recognition errors and diminished accuracy when confronted with new or unfamiliar data.

<b>1.3. Target of the Thesis </b>

The objective of this master's thesis is to develop and train an intelligent chatbot using freely available data sources from online forums, FAQs, and videos on Youtube. The specific goals include:

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

<b> Data Collection and Processing: Research and develop effective methods for </b>

collecting dialogue data from free online sources. This involves identifying appropriate data sources, filtering, and cleaning data to ensure quality and reliability.

<b> Analysis and Modeling: Analyze the characteristics of dialogue data, such as </b>

structure, context, and linguistic diversity. Develop suitable machine learning or deep learning models for training the chatbot, focusing on researching and constructing algorithms for the chatbot using deep learning methods and large language models in a sequence-to-sequence format.

<b> Chatbot Training: Apply advanced techniques in artificial intelligence and </b>

machine learning to train the chatbot to understand and respond accurately and naturally.

<b> Evaluation and Improvement: Evaluate the performance of the chatbot </b>

through testing methods and user feedback. Use the evaluation results for continuous improvement of the chatbot model.

<b> Practical Application: Explore the potential application of chatbots in the </b>

healthcare sector, emphasizing the integration of ASR. Evaluate how this integration impacts user access and interaction to understand the changes in healthcare service delivery.

<b>1.4. Scope of the Thesis </b>

The scope of this master's thesis encompasses several key areas. Firstly, the primary focus of this thesis is on the healthcare sector, utilizing datasets gathered from FAQs on hospital websites in Vietnam. Despite the potential applicability of the methodologies and technologies in other fields, the primary emphasis remains on healthcare. This approach ensures specialized attention to the unique requirements of the healthcare industry but limits the immediate applicability to other domains.

Secondly, in terms of methodology, the thesis employs an advanced Sequence to Sequence (Seq2Seq) model, integrating deep learning techniques such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). This model is structured with an encoder Recurrent Neural Network (RNN), which processes the

</div><span class="text_page_counter">Trang 21</span><div class="page_container" data-page="21">

input sequence, and a decoder RNN, responsible for generating the output sequence. Additionally, Luong's attention mechanism is employed to enhance the model's ability to focus on relevant parts of the input while generating each part of the output. To streamline the training process, a Greedy decoding strategy is utilized, where the most probable next output is selected at each step. While this combination is effective for complex dialogue modeling, it poses challenges in terms of technical complexity and computational resources.

Thirdly, the foundational model used is Generative Pretrained Transformer-3 (GPT-3), based on the Transformer architecture. This choice represents the cutting edge in AI technology but restricts flexibility in adapting to new methods and technologies.

Lastly, the performance of the text-to-text model is evaluated using a combination of F1-score, precision, recall, and Bilingual Evaluation Understudy (BLEU) while speech-to-text model’s performance uses Word Error Rate (WER) and Perplexity (PER), providing a more comprehensive and widely accepted set of metrics. However, relying solely on these may not fully capture all performance aspects in practical scenarios, as each metric emphasizes different aspects of model performance.

<b>1.5. Contribution </b>

The contributions of this master’s thesis are threefold. Firstly, it compares the performance of the traditional Seq2Seq models combined with Luong’s Attention mechanism against the encoder-only approach of GPT-3, customized for the Vietnamese language. This comparative analysis provides insights into the strengths and weaknesses of each model architecture in the context of structured dialogue processing and language understanding, particularly in the healthcare domain.

Secondly, it enhances the accuracy of the ASR model by selecting and evaluating experiments with the best results for the dataset. This contribution aims to address the challenges of speech recognition in noisy environments and varying aucoustic conditions, thereby improving the overall performance and usability of the ASR technology.

</div><span class="text_page_counter">Trang 22</span><div class="page_container" data-page="22">

Lastly, it integrates the two models of text-to-text and speech-to-text to develop a chatbot supporting voice interactions usingVietnamese language in the healthcare domain. This integration expands the chatbot’s functionality to accommodate users who prefer or require voice-based interactions, thereby enhancing accessibility and user experience in healthcare services.

<b>1.6. Thesis Structure </b>

<b>Thesis "Supporting Voice Communication in Chatbot" includes Five Chapters with the following main content: </b>

<small></small> <b>Chapter 1, INTRODUCTION: presents an overview of the topic, reasons </b>

for carrying out the research, and the practical significance of the problem, as well as the scope and limitations of the topic. Finally, the tasks and structure of the thesis are described.

<small></small> <b>Chapter 2, BACKGROUND: synthesizes the most relevant academic issues </b>

to be applied to solve the problem, focusing mainly on the content of deep learning, the basic of HMM, from Artificial Neural Network to Recurrent Neural Network, LSTM, GRU in Seq2Seq model and ASR model. This chapter also provides a general overview of related research that has been and is being conducted, as well as the current general trends in solving the problem (Luong’s attention mechanisms, the theory of encoders and decoders, and particularly the GPT-3’s architecture). This section also brings discussions and evaluations for the methods as they form an important basis for the student's research during the thesis process.

<small></small> <b>Chapter 3, PROPOSED MODEL: introduces the proposed model for </b>

Chatbot. At the same time, it presents improvements and motivations for those proposals. Finally, the student presents the steps to conduct experiments on the data set and evaluates the results of the improvements compared to the chosen model.

<small></small> <b>Chapter 4, IMPLEMENTATION: involves selection, training, evaluation, </b>

and integration of models to develop a robust and effective voice-to-text chatbot tailored for the Vietnamese language and healthcare domain.

</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23">

<small></small> <b>Chapter 5, CONCLUSION: synthesizes the results achieved during the </b>

thesis process from the research and hypothesis construction to the experimental deployment. This section also discusses the limitations and outstanding issues, and finally proposes solutions for future improvements. The Table of Contents, List of Figures, List of Tables and Acronyms are provided at the beginning of the thesis. The references will be presented at the end of the thesis.

</div><span class="text_page_counter">Trang 24</span><div class="page_container" data-page="24">

<b>CHAPTER 2. BACKGROUND </b>

In this chapter, we explore fundamental theories crucial for the implementation of both the Seq2Seq model and ASR model. Our discussion within this chapter spans key topics, including the integration of LSTM and GRU with Word Embedding models like Word2Vec and Global Vectors (GloVe), specifically applied in the Seq2Seq model. Simultaneously, we delve into the realm of the ASR model, covering diverse content ranging from the Hidden Markov Model to theories involving Deep Neural Networks (DNN), Artificial Neural Network (ANN), Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), multilayer perceptron, the vanishing gradient problem, and various regularization techniques. This comprehensive exploration sheds light on the theoretical foundations essential for effectively deploying these models in practical applications.

<b>2.1. Hidden Markov Model (HMM) </b>

HMM is a statistical model. In speech recognition, the HMM provides a statistical representation of the sound of words [1]. The architecture of HMM in speech recognition is given in Figure 2.4. The HMM consists of a sequence of states. In HMM, the current state is hidden and only the output from each state can be observed. Each state in the HMM corresponds to a frame in the audio input. The model parameters estimated in sound training are θ = [{a<small>ij</small>}, {b<small>j</small>()}], where {a<small>ij</small>} corresponds to the transition probability and {b<small>j</small>()} to the output observation distributions. The a<small>ij</small> transition probability is the probability of changing from state i to state j.

</div><span class="text_page_counter">Trang 25</span><div class="page_container" data-page="25">

<b>Figure 2.1. HMM-based phone model [1] </b>

An important feature with HMM is that self loops a<small>ii</small> makes it possible for HMM to model the changing length of the phone. When performing a transition and entering a new stage in the HMM, a feature vector is created using the distribution associated with that particular state in the HMM. The first and last states in the HMM are called non-emitting states. For example, in Figure 2.1.1 s<small>1</small> is the in state and s<small>5</small> is the exit state. They are used as entrances and exits of the model and simplify the pairing of HMMs, phone models, to form words.

<b>2.2. Deep Neural Networks </b>

An alternative to the Gaussians mixture models in speech recognition is to use DNN [2]. A DNN is a transitional, artificial neural network with more than one hidden layer between the input layer and the output layer, as illustrated in Figure 2.3.1. Nodes along the weighted path are attached to them, and the output at every node is calculated by the activation function. Typically, the input for a node at a layer in a DNN is calculated from the layer below.

𝑦<sub>𝑢𝑡</sub>(𝑠) ≜ 𝑃(𝑠|𝑜<sub>𝑢𝑡</sub>) = <sup>𝑒𝑥𝑝{𝑎</sup><sup>𝑢𝑡</sup><sup>(𝑠)}</sup>∑ 𝑒𝑥𝑝⁡{𝑎<sub>𝑠</sub><small>′</small> <sub>𝑢𝑡</sub>(𝑠<small>′</small>)}

where b<small>j</small> is the bias of units j, i is an index on units in the lower class, and w<small>ij</small> is weighted on a connection to units j from units i in the lower class. The output for the upper layer is then calculated as below.

𝑦<sub>𝑗</sub> = 𝑎<sub>𝑢𝑡</sub>(𝑥<sub>𝑗</sub><i>) </i>

The hidden layers make the DNN able to model non-linear and complex

<i>relationships in the data. For multiclass classification, output unit j converts its total input x<small>j</small> into a probability using a SoftMax function [2]. In Kaldi the function used to </i>

estimate the posterior probabilities for the HMM is as below. 𝑦<sub>𝑢𝑡</sub>(𝑠) ≜ 𝑃(𝑠|𝑜<sub>𝑢𝑡</sub>) = <sup>𝑒𝑥𝑝{𝑎</sup><sup>𝑢𝑡</sup><sup>(𝑠)}</sup>

∑ 𝑒𝑥𝑝⁡{𝑎<sub>𝑠</sub><small>′</small> <sub>𝑢𝑡</sub>(𝑠<small>′</small>)}

<i><b>where a</b><small>ut</small><b> denotes the observation at time t in utterance u and a</b><small>ut</small> is the activation </i>

<i>function at the output layer corresponding to state s [3]. </i>

</div><span class="text_page_counter">Trang 26</span><div class="page_container" data-page="26">

The goal in training is to optimize an objective function and update the weights of internal nodes based on the information transmitted to the model. In training, an important parameter is the learning rate. The greater the learning rate, the faster but less accurate the training.

<b>Figure 2.2. A Deep Neural Network 2.3. Artificial Neural Networks </b>

Basically, ANN is a computational model, they are built on the structure and function of the neural network in Biology (although the structure of the ANN will be affected by a flow of information). Therefore, this neural network will change, they depend on inputs and outputs. We can think of ANN as nonlinear statistical data. This means a complex, defined relationship between input and output. As a result, we will have many different templates.

ANN takes ideas from how the human brain works – making the right connections. Therefore, ANN used silicones and wires to make neurons and dendrites live for themselves. In the human body, 1 part of the brain is already composed of 86 billion neurons and they are connected to thousands of other cells through Axons. Because humans have so many different information inputs from the senses, the body also has many dendrites to help transmit this information. They will generate electrical impulses to move, transmitting information in this neural network. And the

</div><span class="text_page_counter">Trang 27</span><div class="page_container" data-page="27">

same goes for ANN when different problems need to be dealt with, the neuron sends a message to another neuron.

<b>Figure 2.3. Neuron Anatomy [53] </b>

Therefore, we can say that the ANN will consist of many internal nodes, they mimic the biological neurons inside the human brain. ANN networks connect these neurons by means of links, and they interact with each other. Nodes in ANN are used to retrieve input data. Moreover, performing operations on data is also very simple. After performing these operations with data, these activities are transferred to other neurons. The output at each node is called its trigger value or node value. Every link in the ANN network is associated with weight. In addition, they have the ability to learn. That will take place by changing the weight values. The figure 2.3.2 is an illustration of a simple ANN.

</div><span class="text_page_counter">Trang 28</span><div class="page_container" data-page="28">

<b>Figure 2.4. A Simple Example of the Structure of a Neural Network </b>

A classic but simple type of node in Neural Networks (NN) is McCulloch-Pitts Node [4]. An illustration of this node, or neurons as McCulloch and Pitt like to call them, can be seen in Figure 2.3.3. The calculations performed in the McCulloch-Pitts node are essentially a sigmoid function. Generally, they add inputs and if they are above a certain threshold, they generate 1, otherwise they generate 0. There are also much more complex representations of these nodes, but the McCulloch-Pitts neuron is a good starting point for understanding the basics of NN.

<b>Figure 2.5. The McCulloch-Pitts Neuron [4] </b>

Along with the nodes, a neural network is also formed by the so-called "edges" in network weighting. What they do is essentially multiply the output of a node by their weight before transferring it to the next neuron to which they are connected. By updating these weights, depending on the generated output of an example, researchers can teach the network to distinguish which input data will produce which output. The procedure for updating weights is called backpropagation and will be described in the next section.

<b>2.4. Convolutional Neural Network </b>

CNN is one of the extremely advanced Deep Learning models. CNN will allow researchers to build intelligent systems with extremely high accuracy. Figure 2.4.1 is a basic architectural form of the CNN network.

</div><span class="text_page_counter">Trang 29</span><div class="page_container" data-page="29">

<b>Figure 2.6. The architecture of CNN [57] </b>

A convolutional is a type of sliding window located on a matrix. Convolutional layers will have parameters learned to adjust and retrieve the most accurate information without having to select features. Convolutional is the multiplication of elements in a matrix. Sliding Window is also known as kernel, filter or feature detect and is a small type of matrix.

CNN will compare images based on each piece and these pieces are called Feature. Instead of having to match the photos together, CNN will see the similarity when searching for raw features that match each other with 2 better images. Each feature is considered a mini-image which means they are small 2-dimensional arrays. These features all correspond to certain aspects of the image, and they can fit together. The followings are basic layers of CNN.

<b> Convolutional layer. This is the most important class of CNN, which is tasked </b>

with doing all the calculations. The important elements of a convolutional layer are: stride, padding, filter map, feature map.

 Stride means when the filter map is moved in pixels based on the value from left to right.

 Padding is 0 values added with the input class.

 CNN uses filters to apply to the area of the image. These filter maps are called 3-dimensional matrices, inside which are numbers, and they are parameters.

 The feature map represents the result of each filter map scan through the input. After each scan will occur the calculation process.

</div><span class="text_page_counter">Trang 30</span><div class="page_container" data-page="30">

<b> Rectified Linear Unit (ReLu) Layer. ReLu layer is the activation function in </b>

NN and this function is also called activation function. The trigger function simulates neurons with pulse rate through axons. In the activation function, it also means: ReLu, Leaky, Tanh, Sigmoid, Maxout, etc. Currently, the ReLu function is commonly used and extremely common. It is used a lot for neural network training needs, but ReLu brings a lot of outstanding advantages such as: the calculation will become faster, etc. In the process of using ReLu, we need to pay attention to the problem of customizing learning rates and tracking dead units. The ReLu layers were used after the filter map was calculated and

<b>applied the ReLu function to the values of the filter map. </b>

<b> Pooling Layer. When the input is too large, the pooling layers will be placed </b>

between the Convolutional layers to reduce the parameter. Currently, pooling

<b>layer has 2 main types, max pooling and average. </b>

<b> Fully Connected Layer. This layer is responsible for producing results after </b>

the convolutional layer and pooling layer have received the transfer image. At this point, the result is that the model has read the information of the image and to link them as well as produce more output, we use fully connected layers. In addition, if the fully connected layer keeps the image, it will turn it into an undivided section. This is quite similar to the votes they will evaluate to vote

<b>for the highest quality image. 2.5. Multilayer Perceptron </b>

<i>2.5.1. Backpropagation </i>

The name backpropagation comes from the term used by Rosenblatt (1962) for attempts to generalize the perceptron learning algorithm into several layers, even if none of the many attempts to do so in the 1960s and 1970s were particularly successful [40]. The backpropagation algorithm is one of the most important tools of the artificial neural network, it is especially the part that deals with the training of the network, i.e., where it learns. During this process, the network updates the weighting of all edges to make it perform the correct output for a particular input.

</div><span class="text_page_counter">Trang 31</span><div class="page_container" data-page="31">

Stochastic Gradient Descent (SGD) is used to train NN in oreder to reduce repetitive cost functionality. The principle of maximum likelihood is followed primarily by the loss function. The goal is to reduce the cross entropy error between the output and the predicted output. The gradient update equation is shown as below.

The above equation activation is represented by øk. How much l-th The j-th neuron of the output error control layer is denoted by the term δlj error.

<i>According to chain rule, δ<small>j</small> can be written as </i>

Hence, hidden layer’s neurons error can be obtained. So the model weights are calculated as follow.

</div><span class="text_page_counter">Trang 32</span><div class="page_container" data-page="32">

<i>2.5.2. The Vanishing Gradient Problem </i>

The problem is in the many layers that use certain activation functions, such as the sigmoid function. This function transforms a large input space into a small input number from 0 to 1. That means a significant change in input results in a small change in output; thus, its derivatives become small.

<b>Figure 2.7. Sigmoid Function and its Derivative by [43] </b>

This is caused by string rules used in backpropagation. It calculates gradients by moving each layer from the last layer to the original one. According to the chain rule, each subsequent derivative is multiplied by the calculated value. Therefore, when there are n hidden layers using sigmoid as activation functions, n small derivatives are multiplied together. As a result, the gradient value decreases exponentially as the backpropagation algorithm progresses to the beginning of layers [43].

The following ways are the solutions to avoid this problem.

 The first solution is to use another activation function, such as ReLu, which does not cause a small derivative.

 The next solution is the residual network (ResNets). They provide the residual connections straight to the next layers, effectively bypassing the activation

</div><span class="text_page_counter">Trang 33</span><div class="page_container" data-page="33">

functions. That results in higher than derivatives, and it leads to the ability to train networks much deeper [44].

 The last one is mass normalization. The batch normalization method normalizes the input on a predefined scale, where the sigmoid derivative is not small.

<b>2.6. Regularization </b>

It is important to consider the problem of overfitting when building NN or machine learning algorithms. Overfitting is when the model begins to learn features that are too specific to the training set. Basically, the model not only learns the general rules that lead from input to output, but also more rules, which perhaps describe the training set, but are not necessarily valid at the general level. This process leads to reduced training errors but also increases assessment errors. As a result, model will perform worse on unknown data due to the specific rules that it has learned from the training set. If overfitting occurs when model is too suitable for the training set, then the opposite phenomenon is called underfitting, i.e. when the model learns too general rules. We can find an illustration of the phenomena mentioned above in Figure 2.10.1.

<b>Figure 2.8. Underfitting, Optimal and Overfitting </b>

<i>2.6.1. Dropout </i>

Dropout is the skipping of units (i.e. 1 network node) during training 1 random way. By omitting this, the unit will not be considered in the forward and backward process. Accordingly, p is called the probability of retaining 1 network node in each training phase, so the probability of it being eliminated is (1 - p). This method helps to avoid the overfitting [45]. If 1 fully connected layer has too many parameters and takes up most parameters, the network nodes in that layer are too interdependent

</div><span class="text_page_counter">Trang 34</span><div class="page_container" data-page="34">

during training, which will limit the power of each node, leading to excessive combination.

<i>2.6.2. Weight Decay </i>

L<small>2</small> regularization, or weight decay, is a regularization technique applied to the weights of neural network. This technique minimizes the loss function that affects both the loss function and the penalty for the Weight Norm. Therefore, weight decay is simply a complement to the loss function of the network and can be described through the following equation.

L<small>new</small>(w) = L<small>original</small>(w) + λw<small>T</small>w

where λ is a value that determines the power of the penalty and L(w) is the chosen loss function. If λ value is very small, weight decay will not help regularize the network. In contrast, if λ is too large, the error function will gradually decrease, and network will only aim to keep the weight of the network at 0. This effect can be seen in figure 2.6.2.1

<b>Figure 2.9. Underfitting, Optimal weight decay and Overfitting 2.7. Recurrent Neural Networks </b>

RNNs [12] have revolutionized the field of customer service by enabling the creation of chatbots that can engage in more natural and effective dialogues. Unlike traditional NNs that process inputs in isolation, RNNs are designed to recognize and remember patterns over sequences of words, making them ideal for parsing customer queries and maintaining the context of a conversation. This sequential memory allows chatbots to provide more accurate and contextually relevant responses, improving the

</div><span class="text_page_counter">Trang 35</span><div class="page_container" data-page="35">

customer experience. RNNs can be trained on vast datasets of customer interactions, allowing them to understand a wide range of requests and issues. However, they do face limitations with longer sequences, where they may struggle to maintain context

<b>over extensive dialogues. </b>

<b>Figure 2.10. The Recurrent Neural Network [5] </b>

The image shows a RNN and how it unfolds through time. At the heart of an RNN is a loop that allows information to persist. In the diagram, the loop is unrolled to show the entire sequence of operations over time. The following steps are how it works:

<small></small> <i><b>Input (x): At each time step t, the RNN takes an input x</b><small>t</small></i> and the previous

<i>hidden state s<small>t</small></i>−1.

<small></small> <i><b>Hidden State (s): The hidden state s</b><small>t</small> at time t is updated by applying a weight matrix U to the input x<small>t</small> and another weight matrix W to the previous hidden state s<small>t</small></i>−1. The function of the hidden state is to capture and carry forward information through time.

<small></small> <i><b>Output (o): The output o</b><small>t</small> at time t is then computed using the current hidden state s<small>t</small> and a weight matrix V. In many applications, this output is then passed </i>

through a function, such as SoftMax, to make decisions or predictions based on the learned patterns.

<small></small> <i><b>Weights: There are three sets of weights: U for input to hidden, W for hidden </b></i>

<i>to hidden, which is the recurrent connection, and V for hidden to output. These </i>

weights are shared across all time steps, which allows the RNN to generalize across sequences of varying lengths.

</div><span class="text_page_counter">Trang 36</span><div class="page_container" data-page="36">

RNNs are powerful because they can theoretically use their internal state, memory, to process sequences of inputs of any length. However, they can be difficult to train due to issues like vanishing and exploding gradients, which can occur during backpropagation through the many layers of the network. As a result, RNNs do face limitations with longer sequences, where they may struggle to maintain context over extensive dialogues. This is often addressed by using advanced RNN architectures like LSTM or GRU, which are better at capturing long-term dependencies and can maintain context over longer conversations, a critical aspect of customer service interactions.

<b>2.8. Long Short-Term Memory </b>

LSTM networks, a specialized form of RNNs, are crafted to capture long-term dependencies within data sequences effectively. The architecture of an LSTM features a cell state, akin to a conveyor belt, which traverses the length of the network with minimal linear interaction, ensuring the preservation of information over time. This cell state is modulated by a series of gates: the forget gate uses a sigmoid function to decide which information the cell state should discard, the input gate decides which values to update and integrates new candidate values created by a tanh layer, and the output gate determines the next hidden state by filtering the cell state through a tanh layer and applying the sigmoid function's output. The hidden state, responsible for predictions, is updated with information from the cell state, providing the LSTM with the capability to maintain relevancy in sequential data over long periods. This quality is particularly beneficial for language modeling, where understanding context from extensive data sequences is paramount. These cells have

<b>3 gates that control the flow of information as below [6]. </b>

<b> Input Gate: Decides the degree to which new input should affect the memory. </b>

<b> Forget Gate: Determines what portions of the existing memory should be </b>

forgotten.

<b> Output Gate: Selects the parts of the memory to be used in the output. </b>

</div><span class="text_page_counter">Trang 37</span><div class="page_container" data-page="37">

<b>Figure 2.11. LSTM Network Architecture [6] </b>

<i>2.8.1. The Long-Term Dependency Problem </i>

A key feature of RNNs is the idea of using preceding information to make predictions for the present, similar to how one uses previous scenes in a movie to understand the current scene. If RNNs could effectively do this, they would be incredibly useful; however, whether they can accomplish this depends on the specific case. Sometimes, just revisiting the recently obtained information is sufficient to understand the current situation. For example, in the sentence: "Cách điều trị hen suyễn" once reading up to " Cách điều trị hen" it is enough to predict that the next word will be suyễn." In this scenario, the distance to the needed information for making a prediction is short, so RNNs are entirely capable of learning this.

<b>Figure 2.12. RNN and Short-Term Dependencies [7] </b>

But in many situations, we need to use more context to infer. For example, predicting the last word in the phrase: " Tôi bị bệnh hen suyễn… hen suyễn dị ứng, từ phấn hoa." Clearly, the recent information ("Hen suyễn dị ứng, từ phấn hoa") tells us that what follows will be the name of some disease, but it's impossible to know

</div><span class="text_page_counter">Trang 38</span><div class="page_container" data-page="38">

exactly what that disease is. To understand what it is, we need the additional context of " Hen suyễn dị ứng, từ phấn hoa" to make an inference. Obviously, the information distance here may already be quite far. Unfortunately, as the distance grows, RNNs start to struggle with remembering and learning.

Theoretically, RNNs are fully capable of handling "long-term dependencies," meaning that current information can be derived from a sequence of previous information. However, in practice, RNNs lack this capability. This issue has been highlighted by Hochreiter and Bengio, among others, as a challenge for the RNN

<b>them to retain memories without any external intervention. </b>

All recurrent networks have the form of a sequence of neural network modules that are repeated. In standard RNN networks, these modules have a very simple

<b>structure, typically a single tanh layer. </b>

</div><span class="text_page_counter">Trang 39</span><div class="page_container" data-page="39">

<b>Figure 2.14. The Repeating Modules in an RNN Contains One Layer [7] </b>

Similarly, LSTM also has a sequential architecture, but the modules within it have a different structure compared to standard RNN networks. Instead of just having a<b>single neural network layer, they have four layers that interact in a very special way. </b>

<b>Figure 2.15. The Repeating Modules of an LSTM Contain Four Layers [7] 2.9. GRU </b>

The GRU is the newer generation of RNN and is pretty similar to an LSTM [8]. GRU are designed to solve the vanishing gradient problem that can occur in standard RNN. They do this by using gating mechanisms to control the flow of information. The architecture consists of two gates: a reset gate and an update gate.

<b> Reset Gate: This gate determines how much of the past information needs to </b>

be forgotten. It can be thought of as a way to decide how much past information to discard, which helps the model to drop irrelevant information from the past.

<b> Update Gate: This gate decides how much of the past information will carry </b>

over to the current state. It is akin to a combination of the forget and input

</div><span class="text_page_counter">Trang 40</span><div class="page_container" data-page="40">

gates in an LSTM, allowing the model to determine how much of the past information should influence the current state.

<b>Figure 2.16. The Architecture of GRU [9] </b>

During its operation, the GRU first takes in the input and the previous hidden state to inform its gates; the reset gate uses this information to decide which parts of the past data should be forgotten, while the update gate determines the quantity of the previous hidden state that should be carried forward, effectively blending the old information with the new input to form a candidate hidden state, which is then combined with the old state, modulated by the update gate, to produce the final hidden state for the current time step.

This structure allows GRUs to keep relevant backpropagation error signals alive, making them capable of learning over many time steps, which is particularly useful for tasks that require the understanding of long-term dependencies, such as language modeling and time-series analysis. GRUs offer a simpler and more

<b>computationally efficient alternative to LSTMs while providing similar benefits. 2.10. Word Embedding Model </b>

Word Embedding is a general term for language models and feature-learning methods in Natural Language Processing (NLP), where words or phrases are mapped to numerical vectors (usually real numbers). This tool plays a crucial role in most

</div>

×