N Ontology-Based Improvement For Multi-Answer Summarization In Consumer Health Question Answering System.pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.59 MB, 55 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

<b>UNIVERSITY OF ENGINEERING AND TECHNOLOGY </b>

<b>FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM </b>

</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

<b>UNIVERSITY OF ENGINEERING AND TECHNOLOGY Major: Computer Science </b>

<b>Supervisors: Assoc.Prof. Tran Trong Hieu MSc. Can Duy Cat</b>

<b>AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM </b>

</div><span class="text_page_counter">Trang 3</span><div class="page_container" data-page="3">

Automatic question answering (QA) systems assist customers in quickly addressing daily questions. During the COVID-19 pandemic, one of the topics that users care about is healthcare. In the era of information explosion, distilling helpful information from the QA system responses takes time. Multi-answers summarization problem is researched for solving this problem. The model of this task takes the customer’s question and all answers as input, then return the summary. The summary has been shown to aid in better information absorption.

This thesis focuses on the extractive summarization problem and presents some ontology-based improvements to the baseline multi-answer summarization model in the consumer health question answering system with two main sub-tasks: Ontology struction and Building extractive multi-answer summarization model. Ontology con-struction task focus on building ontology, which is leveraged to extend biological knowl-edge such as related terms, chemicals, diseases, and symptoms. Additionally, WordNet is used for enhancing common sense knowledge. In the summarization phase, some sentence scoring methods are proposed for using extending keywords. Compared to the baseline, the improved model performs better with large margin. As the result, the proposed model outperforms current state-of-the-art comparatives with 0.511 ROUGE-2 F1. An application model is built for creating a question-answering summarization model from five world’s leading independent biotechnology companies’ websites in Japan.

Keywords: multi-answer summarization, extractive summarization, query-based sum-marization, ontology construction, ROUGE.

iii

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

I want to thank my supervisor, Assoc.Prof. Tran Trong Hieu, MSc. Can Duy Cat. They always had insightful comments both on my work and on this thesis. Their dedication has given me more motivation to complete the thesis in the best way.

Furthermore, I am very thankful to Dr. Le Hoang Quynh and Data Science and Knowledge Technology Laboratory members at the VNU University of Engineering and Technology. We had many discussion meetings, and their comments will help me im-prove myself and become more mature in the future.

Finally, a deep thank to my family, relatives, and friends who are always with me during the most challenging times, always encouraging us in life and at work.

Although I attempted to complete the report but will undoubtedly make minor errors, I sincerely receive the teachers’ and professors’ understanding and instruction.

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

I declare that the thesis has been composed by myself and that the work has not be submitted for any other degree or professional qualification. I confirm that the work submitted is my own, except where work which has formed part of jointly-authored publications has been included. My contribution and those of the other authors to this work have been explicitly indicated below. I confirm that appropriate credit has been given within this thesis where reference has been made to the work of others.

I certify that, to the best of my knowledge, my thesis does not infringe upon any-one’s copyright nor violate any proprietary rights and that any ideas, techniques, quota-tions, or any other material from the work of other people included in my thesis, pub-lished or otherwise, are fully acknowledged in accordance with the standard referencing practices.

I take full responsibility and take all prescribed disciplinary actions for our com-mitments. I declare that this thesis has not been submitted for a higher degree to any other University or Institution.

Nguyen Quoc An

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

1.3 Difficulties and Challenges . . . .

1.4 Contributions of the thesis . . . .

3.1.2 Single-answer extractive summarization . . . . 15

3.1.3 Multi-answer extractive summarization . . . . 17

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

3.2.4 Independence Ontology Construction . . . . 21

3.2.5 Ontologies Integration . . . . 2

3.2.6 Ontology Population . . . .

3.3 Apply Ontology-based Improvements to Summarization model . . . 25

3.3.1 Baseline Model Improvements . . . . 26

3.3.2 Question’s Keyword Expanding . . . . 26

3.3.3 Customised scoring methods . . . . 29

4 Experiments and Results . . . . 31

4.1 Implementation and Configurations . . . . 31

4.2 Dataset and Evaluation methods . . . . 3

4.2.1 Metrics and Evaluation . . . . 3

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

List of Figures

1.1 The evolution of MEDLINE citations between 1986 and 2019. . . . 2

1.2 Typical tasks / competitions in the field of natural language processing for biomedical data . . . . 3

1.3 Classification of Text Summarization Approaches . . . . 4

1.4 Multi-Answer Summarization pipeline . . . . 5

2.1 Summarization approaches . . . .

3.1 Summarization baseline model . . . . 1

3.2 Overview of propose ontology construction . . . . 20

3.3 CTD disease-chemical relations . . . . 2

3.4 Proposed summarization model overview . . . . 27

3.5 Ontology expanding method . . . .

3.6 WordNet expanding method . . . .

4.1 The statistic of nodes and terms in three independent ontologies . . . 35

4.2 The statistic of nodes and terms in three integrated ontology . . . . 35

4.3 The reduction of ROGUE-2 F1 per each scoring method when replacing the proposed weighted score with the before version. . . . 37

4.4 Ablation test results for various components . . . . 38

viii

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

List of Tables

1.1 The result summary example responses to a question in medical question

and answer system (MEDIQA). . . .

3.1 MeSH’s topic category list . . . .

4.1 Configurations and parameters of proposed model . . . . 33

4.2 The statistics of extract summary in datasets . . . . 34

4.3 The statistic of relations and terms in ontology population . . . . 35

4.4 Comparison model’s results of the MEDIQA 2021 Task 2 - Extractive Summarization . . . . 37

4.5 Examples of some errors in test set. . . . 39

4.6 Five biotechnology companies’ websites in Japan. . . . 40

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

Chapter 1 Introduction

This chapter will present the motivation and the urgency of the thesis topic in sec-tion 1.1. Also, the summarizasec-tion problem and query-based summarizasec-tion problem are discussed in section 1.2.

Many experts and leaders have identified data as an invaluable asset in the era of informa-tion explosion. For example, Clive Humby - a British mathematician and entrepreneur in the field of data science, said “Data is the new oil”. Indeed, exploiting data effec-tively will bring great value. Biomedical text mining is a topic of increasing interest in the research community. For example, the expansion of MEDLINE<small>1</small>is depicted in Fig-ure 1.1 [20]. It is one of the largest and most well-known biomedical online databases in the world. From 1 million in 1970 to 13.5 million in 2005, the number doubled in 14 years to 26.2 million in 2019.

However, in this age of information abundance and overload, the overabundance of data has made it difficult for humans to absorb. In that context, some automatic question-answer system is built. For example, a question-answer system supports getting information about treatment for common symptoms of COVID-19 from reliable data, which allows users to handle infection situations more scientifically and easily.

<small>1the US National Library of Medicine’s biomedical database</small>

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

Figure 1.1: The evolution of MEDLINE citations between 1986 and 2019.

The vertical axis represents the number of citations (million). For a clearer representation, the statistics from before 2005 are issued every 5 years.

Nowadays, several automatic question answering systems about health are built like Pubmed<small>2</small>or CHiQA<small>3</small>, Google<small>4</small>. Although the answers returned by the search en-gines have been selected, independent answers from different sources still overlap. For instance, with the question “How long have SARS-CoV-2 existed?”, Pubmed provides about 1000 long answers, and Google returns 5,070,000,000 response .<small>5</small>

The idea is to use a summary engine to summarize all the responses into a short paragraph. The summary answer gathers all of the necessary information and elimi-nates any duplicates. Therefore, the users can read one paragraph instead of a massive amount of documents. This thesis focuses on the summarization model in the Health question-answering system. However, it is the two most demanding tasks are the ques-tion answering and summarizaques-tion systems for biomedical text (according to experts in Figure 1.2 [6]).

Realizing the potential of biomedical summarization, a number of competitions have been launched in recent years to support research and development in this field. The BioNLP workshop series, which is co-hosted by the ACL SIGBIOMED

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

Figure 1.2: Typical tasks / competitions in the field of natural language processing for biomedical data

ized research community, has grown into an exceptional yearly event for researchers to present their research ideas in the field of natural language processing for biological and medical data (bioNLP). wIn 2021, the BioNLP workshop with the topic MEDIQA 2021: Summarization in the Medical Domain<small>6</small>was held, consisting of three separate tasks. The summarization of Multiple Answers task is similar to the summary engine in the question-answer system, is chosen by my team. Our team won second prize (in extractive summary) and third prize (in abstract summary) in this contest. Besides, our team won second prize in science research student competition at my university and has four papers about summarization.

After participating in this completion, the error analysis process indicates that the model only focuses on the terms mentioned in the question. Meanwhile, related terms such as synonym terms, related chemicals, related diseases, etc., also have a certain de-gree of importance. It is main reason for this thesis to continue research about question-driven improvements. This thesis proposes some ontology-based improvements with a significant development compared to the previous model.

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

1.2 Problem Statement

Text summarization aims to select or generate important information from the original text(s) to create a short version [7]. Humans often read all documents to develop un-derstanding, and then write a summary highlighting its main points. Because of the absence of human experience and understanding, generating a text summary is exceed-ingly tough, time-consuming, and effortless for machines.

Based on the different characteristics of the summary paragraphs, text summariza-tion can be classified in many different ways as Figure 1.3 [3] .

<small>Text Summarization Approach</small>

Figure 1.3: Classification of Text Summarization Approaches

• According to the input document(s): Single-document summarization and Multi-document summarization. The difference is that a Single-Multi-document summarization only focuses on a single text while a multi-text summary uses multiple documents as input.

• According to the summary usage: Generic and Query-based. Generic is an ap-proach that does not focus on a specific topic or aspect, and it makes an overview of sources. While the query-based summarization approach, the result is focused on the user questions.

• According to Techniques: Supervised and Unsupervised. Unsupervised approaches based on algorithms do not depend on human support, such as labelling train datasets. These models are suitable for big data, such as website data. Supervised learning methods are based on a sentence-level classification approach where the model learns between summary and non-summary sentences.

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

• According to output characteristics: Extractive summarization and Abstractive summarization. The extraction method entails extracting the most crucial sen-tences from the documents. The summary is then made by combining all of the critical sentences. As a reason, every sentence, in summary, belongs to the original document in this approach. Secondly, the abstractive approach tries to recreate the summary base on the original sentences.

Formal definition According to Multi-Answer Summarization task requirements ,<small>7</small>

different answers can bring complementary perspectives that are likely to benefit the users of QA systems. The purpose of this task is multi-answer summarizing model that can tackle summary challenges that numerous relevant replies to a medical question. The input to the model is the customer’s question Q, and all answers A = {A A<sub>1</sub>, <sub>2</sub>,...,A<sub>n</sub>}. The output is a summary that answers the given question (Figure 1.4). Table 1.1 shows the example of result summary.

<small>User's question </small>

<small>Multiple related answers</small>

<small>Summarization </small>

Figure 1.4: Multi-Answer Summarization pipeline

Thesis scope In this work, the model focus on the Query-based Multi-document Ex-tractive summarization approach. According to the classification approaches, the model follows the four properties: Multiple, Query-based, Unsupervised, Extract. The extrac-tive approach has many advantages, such as (i) quick summarization time, (ii) low cost of hardware resources, and (iii) easy to manage summary quality. Besides, compressing multiple replies into a single answer saves time and effort for users. The paragraph is summarised based on the user’s question, which is highly applicable.

<small>7 class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

Table 1.1: The result summary example responses to a question in medical question and answer system (MEDIQA).

<small>QuestionWhat bone graft materials are used for spinal fusion?</small>

<small>You will be asleep and feel no pain (general anesthesia). The doctor will make a surgicalcut (incision)to view the spine. Other surgery, such as a diskectomy, laminectomy, or aforaminotomy, is almost always done first. Spinal fusionmay be done. On your back orneck over the spine. You will be lying face down. Muscles and tissue will be separatedto expose the spine. On your side, if you are having surgery on your lower back. Thesurgeon will use tools called retractors to gently separate, hold the soft tissues and bloodvessels apart, and have room to work.A synthetic bone substitute is used. With a cut onthe front of the neck, toward the side. The surgeon will use a graft (such as bone) to hold(or fuse) the bones together permanently. There are several ways of fusing vertebraetogether. Strips of bone graft material may be placed over the back part of the spine.Bone graft material may be placed between the vertebrae. Special cages may be placedbetween the vertebrae. These cages are packed with bone graft material. The surgeonmay get the bone graft from different places. From another part of your body (usuallyaround your pelvic bone). This is called an autograft. Your surgeon will make a smallcut over your hip and remove some bone from the back of the rim of the pelvis. From abone bank. This is called an allograft. A synthetic bone substitute can also be used.Thevertebrae may also fixed together with rods, screws, plates, or cages. They are used tokeep the vertebrae from moving until the bone grafts are fully healed.Surgery can take3 to 4 hours.</small>

<small>A bone graft can be taken from the person’s own healthy bone (this is called an au-tograft). Or, it can be taken from frozen, donated bone (allograft).In some cases, amanmade (synthetic) bone substitute is used.You will be asleep and feel no pain (gen-eral anesthesia).During surgery, the surgeon makes a cut over the bone defect. The bonegraft can be taken from areas close to the bone defect or more commonly from the pelvis.The bone graft is shaped and inserted into and around the area. The bone graft can beheld in place with pins, plates, or screws.</small>

<small>Extractivesummary</small>

<small>A bone graft can be taken from the person’s own healthy bone (this is called an au-tograft). Or, it can be taken from frozen, donated bone (allograft).In some cases, amanmade (synthetic) bone substitute is used.The vertebrae may also fixed togetherwith rods, screws, plates, or cages. They are used to keep the vertebrae from movinguntil the bone grafts are fully healed.</small>

</div>